Self-training for Parsing Biomedical Literature with the Charniak Parser

Self-trained biomedical parsing

Note: If you're looking for our biomedical event extraction software, please see this page instead.

I am now (June 16, 2009) distributing my division of the GENIA 1.0 trees in Penn Treebank format. You can download them here.

Using the above trees, I repeated the self-training experiments from our ACL 2008 paper using GENIA 1.0 trees as the labeled data. This also allowed me to create a GENIA reranker. The results (on the dev set from my division) are quite dramatic:

Model	f-score
WSJ	74.9
WSJ + WSJ reranker	76.8
WSJ + PubMed (parsed by WSJ) + WSJ reranker	80.7 [1]
Genia	83.6
Genia + WSJ reranker	84.5
Genia + Genia reranker	85.7
Genia + PubMed (parsed by Genia) + Genia reranker	87.6 [2]

[1] Original self-trained biomedical parsing model (ACL 2008)
[2] Improved self-trained biomedical parsing model (please see my thesis)

Improved self-trained biomedical parsing model

Available here. Please cite my thesis if you use this model:

David McClosky. 2010. Any Domain Parsing: Automatic Domain Adaptation for Natural Language Parsing. Ph.D. thesis, Department of Computer Science, Brown University. [PDF] [thesis defense slides]

Original self-trained biomedical parsing model

Available here. This is deprecated and only here for historical purposes.

The DATA/ directory is an alternate data directory, trained from WSJ and 266,664 randomly collected biomedical abstracts from PubMed. Using the standard WSJ-trained reranker (included with the BLLIP reranking parser), this model achieves an f-score of 84.3% on the GENIA treebank beta 2 test set. For more details, please see:

David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]

More information about self-training can be found in these papers:

David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]
David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]

David McClosky

Self-trained biomedical parsing

Model

f-score

Improved self-trained biomedical parsing model

Original self-trained biomedical parsing model

nlp links

nlp/cl/ml feeds

photos