This software is the event parser component from the Stanford and FAUST submissions to the BioNLP shared task. It does not include the event reranker component currently (this performance of the parser alone is generally around 0.5-1% lower than the reranked parser). The event parser system is described in
David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011), Main Conference. [pdf, bib]The underlying parser is MSTParser, created by Ryan McDonald and Jason Baldridge. Thanks to them for making their code available! The Stanford Event Parser code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software. The download requires Java 1.6.
These are part of the training data for building event parser models. These can be found here. If you use our parses, tokenizations, and triggers archive, you do not need anything from the supporting downloads page.
You will need the jar files from Stanford CoreNLP
(version 1.1.0 -- if you require the Event Parser to work against a different version of CoreNLP, please let us know) as well as GNU Trove (version 2.0.4 works for us, version 3 currently does not) to run this. You will need to add libraries from these files to your
classpath. From Stanford CoreNLP, add
version is the version of the Stanford CoreNLP distribution),
xom.jar. From GNU Trove, add the appropriate
The event parser expects files to have a specific directory structure.
This directory should be rooted in the
property. Be sure to set this property in all of your properties files.
For the purposes of this documentation, this location will be referred
base. Inside of
you should have a subdirectories for each dataset that you want to use.
Each dataset is identified by a "
shortName": GENIA is
epi, and Infectious Diseases is
If you're using the combined GENIA and Infectious Diseases (per our experiments on Infectious
infect++5x. You should extract the
parses, tokenizations, and triggers
archive inside the
Should contain a sentence-segmented and tokenized version of each document
(regardless of whether the document is part of training, testing,
etc.). Each file should be of the form
PMID-9878621.tok). In each file, sentences
should be newline separated and words should be separated by spaces.
Should contain a parse for each document (regardless of whether
the document is part of training, testing, etc.) from the biomedical
McClosky-Charniak-Johnson parser. Each file
should be of the form
PMID-9878621.ptb). (These can be found in the
distributed parses and tokenizations file.)
Same format as
stanford-tokenizations but used when
dataset.tokenizer=umass. These tokenizations were made by
Same format as
stanford-mccc-parses but used when
dataset.parser=umass-mccc. These should be the
result of parsing
umass-tokenizations with the
See the files
event_parser_defaults.props from the code distribution as
a basis. Each file tells you which settings should be adjusted to fit
your system and which ones can likely be left alone.
Our code sometimes doesn't work well with the default BioNLP 2011 task tokenizations.
Our own tokenizations/segmentations can be found in the distributed parses and tokenizations file,
or produced by the class
RunBioNLPTokenizer over your base directory:
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.RunBioNLPTokenizer -base.directory base
This step is optional if you've downloaded the parses, tokenizations, and triggers archive and you're working off BioNLP 2011 data. If you're working on other (non-shared task data), you'll need to run the trigger classifier over it. The trigger classifier can be run with the following command
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.TriggerClassifier -props properties
properties is the filename for your trigger classifier properties from the Configuration step.
The event parser can be run with the following command
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.EventParser -props properties
properties is the filename for your event parser properties from the Configuration step.
What is sanity check 1? Does it matter that it's failing frequently? Sanity check is whether events and their arguments are in the same sentence. Since there are quite a large number of cases where events and their arguments are not in the same sentence in the BioNLP corpora, this should not really be a concern. Of course, edges connecting events and arguments that span sentences are dropped, so they are a concern from that standpoint if you're working on improving the event parser.
|August, 2nd 2011