A traditional problem with newswire-trained statistical parsers and part-of-speech taggers are that they are not very good at parsing things like questions and imperatives which are rare in newswire. This problem was partially addressed by John Judge in 2006 by his annotation of 4000 questions:
John Judge, Aoife Cahill and Josef van Genabith. 2006. QuestionBank: Creating a Corpus of Parse Annotated Questions. Proceedings of ACL/CoLing, Sydney, Australia.
which he released here:
However, the quality of the original annotations is only medium. In 2011, Chris Manning spent some time improving the parses in QuestionBank, and the results of that work appear here. I'm releasing this corrected version as a script that maps the original to the corrected version.
However, I'm sure there are still more errors to be fixed in these trees. Feel free to pass along any that you find.
To use it, you need: Perl (v5), bash shell, Java (1.5+), and Stanford Tregex/Tsurgeon (http://nlp.stanford.edu/software/tregex.shtml). You then need to edit autofix.sh to provide flags to Java so it can find Tregex/Tsurgeon on the classpath (unless it is already there). That is, make JAVAFLAGS be:
Then you need the original version of QuestionBank from the site linked above. Then you should be able to run the code as a filter like this:
./autofix.sh < 4000qs-originalVersion.txt > 4000qs-fixedVersion-1.0.txt
Note one fine point: Just testing a parser against the improved treebank (or even training with some of the material and testing on the rest) will not necessarily give you higher parse accuracy numbers. This is because the original release tended to preserve errors made by statistical parsers trained on WSJ newswire rather than correcting them, and the WSJ material dominates the parser training material, typically leading to errors such as parsers dispreferring using WH- categories even when they are correct. Hence, such a result does not affect the claim that the trees in this release are much nearer correct according to Penn Treebank 3 annotation standards.
The scripts in this download are in the public domain. That is not necessarily true of the source data nor the tools that the scripts use (Tregex is GPL). Use at your own risk. But the data should give you more correct question parses, according to the Penn Treebank version 3 conventions. (That is, this is "old Treebank" annotation, not the "new Treebank" annotation used in recent projects like OntoNotes, with NML, and HYPH.)
There's also an option in the scripts where you can sample the 4000 questions down to 3924, by excluding the ones that appear in the test set used by Laura Rimell et al. in:
Laura Rimell, Stephen Clark and Mark Steedman. 2009. Unbounded Dependency Recovery for Parser Evaluation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-09), pp. 813-821, Singapore.
Since the first 2000 questions in QuestionBank are TREC questions, while the second 2000 come from Dan Roth's group, mainly from ISI, mainly from answers.com, it seems best to sample training and test data from both halves. I'd suggest these splits:
Train: 1-1000, 2001-3000
Dev: 1001-1500, 3001-3500
Test: 1501-2000, 3501-4000
Site design by Bill MacCartney