Self-trained biomedical parsing
Note: If you're looking for our biomedical event extraction software, please see this page instead.
I am now (June 16, 2009) distributing my division of the GENIA 1.0 trees in Penn Treebank format. You can download them here.
Using the above trees, I repeated the self-training experiments from our ACL 2008 paper using GENIA 1.0 trees as the labeled data. This also allowed me to create a GENIA reranker. The results (on the dev set from my division) are quite dramatic:
Model |
f-score |
WSJ | 74.9 |
WSJ + WSJ reranker | 76.8 |
WSJ + PubMed (parsed by WSJ) + WSJ reranker | 80.7 [1] |
Genia | 83.6 |
Genia + WSJ reranker | 84.5 |
Genia + Genia reranker | 85.7 |
Genia + PubMed (parsed by Genia) + Genia reranker | 87.6 [2] |
[1] Original self-trained biomedical parsing model
(ACL 2008)
[2] Improved self-trained biomedical parsing model (please see my
thesis)
Improved self-trained biomedical parsing model
Available here. Please cite my thesis if you use this model:
- David McClosky. 2010. Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing. Ph.D. thesis, Department of Computer Science, Brown University. [PDF] [thesis defense slides]
Original self-trained biomedical parsing model
Available here. This is deprecated and only here for historical purposes.
The DATA/ directory is an alternate data directory, trained from WSJ and 266,664 randomly collected biomedical abstracts from PubMed. Using the standard WSJ-trained reranker (included with the BLLIP reranking parser), this model achieves an f-score of 84.3% on the GENIA treebank beta 2 test set. For more details, please see:
- David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]
More information about self-training can be found in these papers:
- David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]
- David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
- David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]