Stanford Biomedical Event Parser (SBEP)

Event Extraction for the BioNLP 2009/2011 shared task

About | Downloads | Usage | Questions | Release history

About

This software is the event parser component from the Stanford and FAUST submissions to the BioNLP shared task. It does not include the event reranker component currently (this performance of the parser alone is generally around 0.5-1% lower than the reranked parser). The event parser system is described in

David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011), Main Conference. [pdf, bib]

The underlying parser is MSTParser, created by Ryan McDonald and Jason Baldridge. Thanks to them for making their code available! The Stanford Event Parser code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software. The download requires Java 1.6.

Downloads

Download Stanford Biomedical Event Parser code (version 1.0, 1.9MB)

Download our parses, tokenizations, and triggers (version 1.0, 13MB)

Download our PubMed distributional similarity word clusters (version 1.0, 2MB)

Download Stanford CoreNLP (version 1.1.0, no models, 31MB)

Usage

  1. Obtaining BioNLP shared task data

    These are part of the training data for building event parser models. These can be found here. If you use our parses, tokenizations, and triggers archive, you do not need anything from the supporting downloads page.

  2. Obtaining libraries

    You will need the jar files from Stanford CoreNLP (version 1.1.0 -- if you require the Event Parser to work against a different version of CoreNLP, please let us know) as well as GNU Trove (version 2.0.4 works for us, version 3 currently does not) to run this. You will need to add libraries from these files to your classpath. From Stanford CoreNLP, add stanford-corenlp-version.jar (where version is the version of the Stanford CoreNLP distribution), jgrapht.jar, jgraph.jar and xom.jar. From GNU Trove, add the appropriate trove-versionnumber.jar file.

  3. Filesystem setup

    The event parser expects files to have a specific directory structure. This directory should be rooted in the base.directory property. Be sure to set this property in all of your properties files. For the purposes of this documentation, this location will be referred to as base. Inside of base, you should have a subdirectories for each dataset that you want to use. Each dataset is identified by a "shortName": GENIA is genia, Epigenetics is epi, and Infectious Diseases is infect. If you're using the combined GENIA and Infectious Diseases (per our experiments on Infectious Diseases), the shortName is infect++5x. You should extract the parses, tokenizations, and triggers archive inside the base directory.

    DirectoryContents
    base/shortName/stanford-tokenizations Should contain a sentence-segmented and tokenized version of each document (regardless of whether the document is part of training, testing, etc.). Each file should be of the form docID.tok (e.g. PMID-9878621.tok). In each file, sentences should be newline separated and words should be separated by spaces.
    base/shortName/stanford-mccc-parses Should contain a parse for each document (regardless of whether the document is part of training, testing, etc.) from the biomedical McClosky-Charniak-Johnson parser. Each file should be of the form docID.ptb (e.g. PMID-9878621.ptb). (These can be found in the distributed parses and tokenizations file.)
    base/shortName/umass-tokenizations (optional) Same format as stanford-tokenizations but used when dataset.tokenizer=umass. These tokenizations were made by Sebastian Riedel.
    base/shortName/umass-mccc-parses (optional) Same format as stanford-mccc-parses but used when dataset.parser=umass-mccc. These should be the result of parsing umass-tokenizations with the McClosky-Charniak-Johnson parser.

  4. Configuration

    See the files trigger_classifier_defaults.props and event_parser_defaults.props from the code distribution as a basis. Each file tells you which settings should be adjusted to fit your system and which ones can likely be left alone.

  5. Running the tokenizer (optional)

    Our code sometimes doesn't work well with the default BioNLP 2011 task tokenizations. Our own tokenizations/segmentations can be found in the distributed parses and tokenizations file, or produced by the class RunBioNLPTokenizer over your base directory:

    java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.RunBioNLPTokenizer -base.directory base

  6. Running the trigger classifier (optional)

    This step is optional if you've downloaded the parses, tokenizations, and triggers archive and you're working off BioNLP 2011 data. If you're working on other (non-shared task data), you'll need to run the trigger classifier over it. The trigger classifier can be run with the following command

    java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.TriggerClassifier -props properties

    where properties is the filename for your trigger classifier properties from the Configuration step.

  7. Running the event parser

    The event parser can be run with the following command

    java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.EventParser -props properties

    where properties is the filename for your event parser properties from the Configuration step.

Frequently Asked Question

What is sanity check 1? Does it matter that it's failing frequently? Sanity check is whether events and their arguments are in the same sentence. Since there are quite a large number of cases where events and their arguments are not in the same sentence in the BioNLP corpora, this should not really be a concern. Of course, edges connecting events and arguments that span sentences are dropped, so they are a concern from that standpoint if you're working on improving the event parser.

Other questions Please email David McClosky and Mihai Surdeanu if you have other questions. The distribution is still in beta and likely in need of more testing so feel free to ask.

Release History


Version 1.0August, 2nd 2011 Initial release