Self-trained parser model
Download the self-trained parser model.
The DATA/ directory is an alternate data directory, trained from WSJ and NANC data using self-training. WSJ is given a relative weight of 5 and approximately 1,750k sentences from NANC (1,765,736 sentences total). On section 23 of the Penn Treebank, it achieves an f-score of 92.1% with the reranking parser. For more details, please see:
- David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]
More information about self-training can be found in these papers:
- David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
- David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]
- David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]
Make sure you have a new enough release of the BLLIP reranking parser from here or it will not be able to handle the larger vocabulary.