Self-trained biomedical parsing

Note: If you're looking for our biomedical event extraction software, please see this page instead.

I am now (June 16, 2009) distributing my division of the GENIA 1.0 trees in Penn Treebank format. You can download them here.

Using the above trees, I repeated the self-training experiments from our ACL 2008 paper using GENIA 1.0 trees as the labeled data. This also allowed me to create a GENIA reranker. The results (on the dev set from my division) are quite dramatic:

Model

f-score

WSJ 74.9
WSJ + WSJ reranker 76.8
WSJ + PubMed (parsed by WSJ) + WSJ reranker 80.7 [1]
Genia 83.6
Genia + WSJ reranker 84.5
Genia + Genia reranker 85.7
Genia + PubMed (parsed by Genia) + Genia reranker 87.6 [2]

[1] Original self-trained biomedical parsing model (ACL 2008)
[2] Improved self-trained biomedical parsing model (please see my thesis)

Improved self-trained biomedical parsing model

Available here. Please cite my thesis if you use this model:

  • David McClosky. 2010. Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing. Ph.D. thesis, Department of Computer Science, Brown University. [PDF] [thesis defense slides]

Original self-trained biomedical parsing model

Available here. This is deprecated and only here for historical purposes.

The DATA/ directory is an alternate data directory, trained from WSJ and 266,664 randomly collected biomedical abstracts from PubMed. Using the standard WSJ-trained reranker (included with the BLLIP reranking parser), this model achieves an f-score of 84.3% on the GENIA treebank beta 2 test set. For more details, please see:

  • David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]

More information about self-training can be found in these papers:

  • David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]
  • David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
  • David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]

Make sure you have a new enough release of the BLLIP reranking parser from here or it will not be able to handle the larger vocabulary.