This package contains implementations of three parsers for natural language text. There is an accurate unlexicalized probabilistic context-free grammar (PCFG) parser, a lexical dependency parser, and a factored, lexicalized probabilistic context free grammar parser, which does joint inference over the first two parsers. For many purposes, we recommend just using the unlexicalized PCFG. With a well-engineered grammar (as supplied), it is fast, accurate, requires much less memory, and in many circumstances, lexical preferences are unavailable or inaccurate across domains or genres and it will perform just as well as a lexicalized parser. However, the factored parser will sometimes provide greater accuracy through knowledge of lexical dependencies. Using the dependency parser by itself is not very useful.
The factored parser and the unlexicalized PCFG parser are described in:
All the internal guts of the parser are in one file,
FactoredParser.java
, and are not exposed as public.
LexicalizedParser
provides a simple interface, for either training
a parser from a treebank, or parsing text using an
already serialized grammar.
You need Java (preferably JDK1.4+) installed on your system, and
java
in your PATH where commands are looked for.
You need a machine with a fair amount of memory. Required memory depends on the choice of parser, the size of the grammar, and other factors like presence of numerous unknown words To run the PCFG parser on sentences of up to 40 words you need 100 Mb of memory. To be able to handle longer sentence, you need more (to parse sentences up to 100 words, you need 400 Mb. For running the Factored Parser, 500-600 Mb is needed for dealing with sentences up to 40 words (which are quite typical in newsire!). Training a new lexicalzed parser requires about 1500m of memory; less for a PCFG.
You need a serialized parser model (grammars, lexicon, etc.). Four are
provided (compressed) (in /u/nlp/data/lexparser
for local
users, or in the root directory
of the distributed version).
There is serializedFactoredParser.gz
and serializedPCFGParser.gz
for English, and
serializedChineseFactoredParser.gz
and
serializedChinesePCFGParser.gz
for Chinese.
And you need the parser code
accessible. This can be done by having the supplied
javanlp.jar
in your CLASSPATH.
Then if you have some sentences in test.txt
(as plain
text), the following
commands should work.
Parsing a local text file:
java -mx100m edu.stanford.nlp.parser.lexparser.LexicalizedParser
serializedPCFGParser.gz test.txt
Parsing a document over the web:
java -mx100m edu.stanford.nlp.parser.lexparser.LexicalizedParser
serializedPCFGParser.gz http://nlp.stanford.edu/~danklein/project-parsing.shtml
NB: This program just does very rudimentary stripping of HTML tags, and
so it'll work okay on plain text web pages, but it won't work
adequately on most complex commercial script-driven pages.
Parsing a Chinese sentence (in the default input encoding of GB18030 -- and you'll need the right fonts to see the output correctly):
java -mx100m edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
serializedChinesePCFGParser.gz chinese-onesent
or for Unicode (UTF-8) format files:
java -mx100m edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
-encoding UTF-8 serializedChinesePCFGParser.gz chinese-onesent-utf
The program has many options. The most useful end-user option is
-maxLength n
which determines the maximum
length sentence that the parser will parser. Longer sentences are
skipped, with a message printed to stderr
.
The parser supports many different input formats: tokenized/not, sentences/not, and tagged/not.
The input may be
tokenized or not, and users may supply their own tokenizers. The input
is by default assumed to not be tokenized; if the
input is tokenized, supply the option -tokenized
. If the
input is not tokenized, you may supply the name of a tokenizer class
with -tokenizer tokenizerClassName
; otherwise the default
tokenizer (edu.stanford.nlp.processor.PTBTokenizer
) is
used.
The
input may have already been split into sentences or not. The input is by
default assumed
to be not split; if sentences are split, supply the option
-sentences delimitingToken
, where the delimiting token
may be any string. If the delimiting token is sDelimited
the parser will accept input in which sentences are marked XML-style
with <s> ... </s> (the same format as the input to
Eugene Charniak's parser). If the delimiting token
is newline
the parser will assume that each line of the
file is a sentence.
Finally, the input may be tagged or not. If it is tagged, it
assumes that words and tags are separated by a non-whitespace
separating character such as '/' or '_'. You may supply the option
-tagSeparator tagSeparator
to specify a tag separator;
otherwise the default '/' is used.
We do not at present provide a Chinese word segmenter. We assume that
Chinese input has already been word-segmented according to Penn
Chinese Treebank conventions). Choosing
Chinese with -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
makes this space separated words the default tokenization.
LexicalizedParser
would more usually be called
programmatically. It implements a couple of useful interfaces that
provide for simple use:
edu.stanford.nlp.parser.ViterbiParser
and edu.stanford.nlp.process.Appliable
.
The following simple class shows typical usage:
import java.util.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class Demo { public static void main(String[] args) { LexicalizedParser lp = new LexicalizedParser("serializedPCFGParser.gz"); String[] sent = { "This", "is", "an", "easy", "sentence", "." }; Tree parse = (Tree) lp.apply(Arrays.asList(sent)); parse.pennPrint(); System.out.println(parse.dependencies(new CollinsHeadFinder())); } }
In a usage such as this, the parser expects sentences already tokenized according to Penn Treebank conventions. For arbitrary text, prior processing must be done to achieve such tokenization (a simple example of doing this is provided in the main method of LexicalizedParser).
Implementation notes. The current version uses class objects as temporary objects to avoid short-lived object creation, and as global numberer spaces. Because of this, the parser doesn't support concurrent usage in multiple threads.
@author Dan Klein @author Christopher Manning @author Roger Levy @author Teg Grenager