About | Downloads | Usage | Questions | Release history
This software is the event parser component from the Stanford and FAUST submissions to the BioNLP shared task. It does not include the event reranker component currently (this performance of the parser alone is generally around 0.5-1% lower than the reranked parser). The event parser system is described in
David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of the Association for Computational Linguistics - Human Language Technologies 2011 Conference (ACL-HLT 2011), Main Conference. [pdf, bib]
The underlying parser is MSTParser, created by Ryan McDonald and Jason Baldridge. Thanks to them for making their code available! The Stanford Event Parser code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software. The download requires Java 1.6.
Download Stanford Biomedical Event Parser code (version 1.0, 1.9MB)
Download our parses, tokenizations, and triggers (version 1.0, 13MB)
Download our PubMed distributional similarity word clusters (version 1.0, 2MB)
These are part of the training data for building event parser models. These can be found here. If you use our parses, tokenizations, and triggers archive, you do not need anything from the supporting downloads page.
You will need the jar files from Stanford CoreNLP
(version 1.1.0 -- if you require the Event Parser to work against a different version of CoreNLP, please let us know) as well as GNU Trove (version 2.0.4 works for us, version 3 currently does not) to run this. You will need to add libraries from these files to your classpath
. From Stanford CoreNLP, add stanford-corenlp-version.jar
(where version
is the version of the Stanford CoreNLP distribution), jgrapht.jar
, jgraph.jar
and xom.jar
. From GNU Trove, add the appropriate trove-versionnumber.jar
file.
The event parser expects files to have a specific directory structure.
This directory should be rooted in the base.directory
property. Be sure to set this property in all of your properties files.
For the purposes of this documentation, this location will be referred
to as base
. Inside of base
,
you should have a subdirectories for each dataset that you want to use.
Each dataset is identified by a "shortName
": GENIA is genia
,
Epigenetics is epi
, and Infectious Diseases is infect
.
If you're using the combined GENIA and Infectious Diseases (per our experiments on Infectious
Diseases), the shortName
is infect++5x
. You should extract the
parses, tokenizations, and triggers
archive inside the base
directory.
Directory | Contents |
base/shortName/stanford-tokenizations
|
Should contain a sentence-segmented and tokenized version of each document
(regardless of whether the document is part of training, testing,
etc.). Each file should be of the form docID.tok
(e.g. PMID-9878621.tok ). In each file, sentences
should be newline separated and words should be separated by spaces.
|
base/shortName/stanford-mccc-parses
|
Should contain a parse for each document (regardless of whether
the document is part of training, testing, etc.) from the biomedical
McClosky-Charniak-Johnson parser. Each file
should be of the form docID.ptb
(e.g. PMID-9878621.ptb ). (These can be found in the
distributed parses and tokenizations file.)
|
base/shortName/umass-tokenizations (optional)
|
Same format as stanford-tokenizations but used when
dataset.tokenizer=umass . These tokenizations were made by
Sebastian Riedel.
|
base/shortName/umass-mccc-parses (optional)
|
Same format as stanford-mccc-parses but used when
dataset.parser=umass-mccc . These should be the
result of parsing umass-tokenizations with the
McClosky-Charniak-Johnson parser.
|
See the files trigger_classifier_defaults.props
and
event_parser_defaults.props
from the code distribution as
a basis. Each file tells you which settings should be adjusted to fit
your system and which ones can likely be left alone.
Our code sometimes doesn't work well with the default BioNLP 2011 task tokenizations.
Our own tokenizations/segmentations can be found in the distributed parses and tokenizations file,
or produced by the class RunBioNLPTokenizer
over your base directory:
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.RunBioNLPTokenizer -base.directory base
This step is optional if you've downloaded the parses, tokenizations, and triggers archive and you're working off BioNLP 2011 data. If you're working on other (non-shared task data), you'll need to run the trigger classifier over it. The trigger classifier can be run with the following command
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.TriggerClassifier -props properties
where properties
is the filename for your trigger classifier properties from the Configuration step.
The event parser can be run with the following command
java -cp classpath edu.stanford.nlp.ie.machinereading.domains.bionlp.EventParser -props properties
where properties
is the filename for your event parser properties from the Configuration step.
What is sanity check 1? Does it matter that it's failing frequently? Sanity check is whether events and their arguments are in the same sentence. Since there are quite a large number of cases where events and their arguments are not in the same sentence in the BioNLP corpora, this should not really be a concern. Of course, edges connecting events and arguments that span sentences are dropped, so they are a concern from that standpoint if you're working on improving the event parser.
Other questions Please email David McClosky and Mihai Surdeanu if you have other questions. The distribution is still in beta and likely in need of more testing so feel free to ask.
Version 1.0 | August, 2nd 2011 | Initial release |