Self-trained parser model

Download the self-trained parser model.

The DATA/ directory is an alternate data directory, trained from WSJ and NANC data using self-training. WSJ is given a relative weight of 5 and approximately 1,750k sentences from NANC (1,765,736 sentences total). On section 23 of the Penn Treebank, it achieves an f-score of 92.1% with the reranking parser. For more details, please see:

  • David McClosky, Eugene Charniak, and Mark Johnson. Effective Self-Training for Parsing. Proceedings of the Conference on Human Language Technology and North American chapter of the Association for Computational Linguistics (HLT-NAACL 2006), Brooklyn, New York. [PDF] [slides]

More information about self-training can be found in these papers:

  • David McClosky, Eugene Charniak, and Mark Johnson. Reranking and Self-Training for Parser Adaptation. Proceedings of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, Australia. [PDF] [slides]
  • David McClosky, Eugene Charniak, and Mark Johnson. When is Self-Training Effective for Parsing? Proceedings of the International Conference on Computational Linguistics (COLING 2008), Manchester, UK. [PDF] [slides]
  • David McClosky and Eugene Charniak. Self-Training for Biomedical Parsing. Proceedings of the Association for Computational Linguistics (ACL 2008, short papers), Columbus, Ohio. [PDF]

Make sure you have a new enough release of the BLLIP reranking parser from here or it will not be able to handle the larger vocabulary.