Effective Self-training for Parsing

Self-trained parser model

Download the self-trained parser model. You may also want to visit this page which lists the latest information about BLLIP Parser models.

The DATA/ directory is an alternate data directory, trained from WSJ and NANC data using self-training. WSJ is given a relative weight of 5 and approximately 1,750k sentences from NANC (1,765,736 sentences total). On section 23 of the Penn Treebank, it achieves an f-score of 92.1% with the reranking parser. For more details, please see:

David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]

More information about self-training can be found in these papers:

David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]
David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]

David McClosky

Self-trained parser model

nlp links

nlp/cl/ml feeds

photos