This package contains implementations of three parsers for natural language text. There is an accurate unlexicalized probabilistic context-free grammar (PCFG) parser, a lexical dependency parser, and a factored, lexicalized probabilistic context free grammar parser, which does joint inference over the first two parsers. For many purposes, we recommend just using the unlexicalized PCFG. With a well-engineered grammar (as supplied), it is fast, accurate, requires much less memory, and in many circumstances, lexical preferences are unavailable or inaccurate across domains or genres and it will perform just as well as a lexicalized parser. However, the factored parser will sometimes provide greater accuracy through knowledge of lexical dependencies. Using the dependency parser by itself is not very useful (its accuracy is much lower).

The factored parser and the unlexicalized PCFG parser are described in:

The factored parser uses a product-of-experts model, where the preferences of a unlexicalized PCFG parser and a lexicalized dependency parser are combined by a third parser, which does exact search using precalculated A* outside estimates.

Much of the internal guts of the parser are in one file, FactoredParser.java, and are not exposed as public.

LexicalizedParser provides a simple interface, for either training a parser from a treebank, or parsing text using a saved parser.

The parser has been ported to multiple languages. German and Chinese grammars are included. The Chinese parser is described in:

End user usage

Requirements

You need Java (JDK1.4+) installed on your system, and java in your PATH where commands are looked for.

You need a machine with a fair amount of memory. Required memory depends on the choice of parser, the size of the grammar, and other factors like presence of numerous unknown words To run the PCFG parser on sentences of up to 40 words you need 120 Mb of memory. To be able to handle longer sentence, you need more (to parse sentences up to 100 words, you need 400 Mb). For running the Factored Parser, 600 Mb is needed for dealing with sentences up to 40 words (which are quite typical in newsire!). Training a new lexicalzed parser requires about 1500m of memory; much less is needed for training a PCFG.

You need a saved parser model (grammars, lexicon, etc.), which can be represented either as a text file or as a binary (Java serialized object) representation. A number are provided (some compressed) (in /u/nlp/data/lexparser for local users, or in the root directory of the distributed version). For instance, there is englishFactored.ser.gz and englishPCFG.ser.gz for English, and chineseFactored.ser.gz and chinesePCFG.ser.gz for Chinese.

And you need the parser code accessible. This can be done by having the supplied javanlp.jar in your CLASSPATH. Then if you have some sentences in test.txt (as plain text), the following commands should work.

Command line usage

Parsing a local text file:

java -mx120m edu.stanford.nlp.parser.lexparser.LexicalizedParser englishPCFG.ser.gz test.txt

Parsing a document over the web:

java -mx120m edu.stanford.nlp.parser.lexparser.LexicalizedParser englishPCFG.ser.gz http://nlp.stanford.edu/~danklein/project-parsing.shtml
NB: This program just does very rudimentary stripping of HTML tags, and so it'll work okay on plain text web pages, but it won't work adequately on most complex commercial script-driven pages. If you want to handle these, you'll need to provide your own preprocessor, and then to call the parser on its output.

Parsing a Chinese sentence (in the default input encoding of GB18030 -- and you'll need the right fonts to see the output correctly):

java -mx120m edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams chinesePCFG.ser.gz chinese-onesent
or for Unicode (UTF-8) format files:
java -mx120m edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams -encoding UTF-8 chinesePCFG.ser.gz chinese-onesent-utf

Command-line options

The program has many options. The most useful end-user option is -maxLength n which determines the maximum length sentence that the parser will parser. Longer sentences are skipped, with a message printed to stderr.

Input formatting and tokenization options

The parser supports many different input formats: tokenized/not, sentences/not, and tagged/not.

The input may be tokenized or not, and users may supply their own tokenizers. The input is by default assumed to not be tokenized; if the input is tokenized, supply the option -tokenized. If the input is not tokenized, you may supply the name of a tokenizer class with -tokenizer tokenizerClassName; otherwise the default tokenizer (edu.stanford.nlp.processor.PTBTokenizer) is used. This tokenizer should perform well over typical plain newswire-style text.

The input may have already been split into sentences or not. The input is by default assumed to be not split; if sentences are split, supply the option -sentences delimitingToken, where the delimiting token may be any string. If the delimiting token is sDelimited the parser will accept input in which sentences are marked XML-style with <s> ... </s> (the same format as the input to Eugene Charniak's parser). If the delimiting token is newline the parser will assume that each line of the file is a sentence.

Finally, the input may be tagged or not. If it is tagged, the program assumes that words and tags are separated by a non-whitespace separating character such as '/' or '_'. You may supply the option -tagSeparator tagSeparator to specify a tag separator; otherwise the default '/' is used.

For example, you can parse the example file with partial POS-tagging with this command: java edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 20 -sentences newline -tokenized -tagSeparator / serializedPCFGParser.gz pos-sentences.txt There are some restrictions on the interpretation of POS-tagged input:

For the examples in pos-sentences.txt:

  1. This sentence is parsed correctly with no tags given.
  2. So it is also parsed correctly telling the parser butter is a verb.
  3. You get a different worse parse telling it butter is a noun.
  4. You get the same parse as 1. with all tags correctly supplied.
  5. It won't accept can as a VB, but does accept butter as a noun, so you get the same parse as 3.
  6. People can butter can be an NP.
  7. Most words can be NN, but not common function words like their, with, a.
Note that if the program is reading tags correctly, they aren't printed in the sentence it says it is parsing. Only the words are printed there.

We do not at present provide a Chinese word segmenter. We assume that Chinese input has already been word-segmented according to Penn Chinese Treebank conventions. Choosing Chinese with -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams makes space-separated words the default tokenization.

Programatic usage

LexicalizedParser can be easily called within a larger application. It implements a couple of useful interfaces that provide for simple use: edu.stanford.nlp.parser.ViterbiParser and edu.stanford.nlp.process.Function. The following simple class shows typical usage:

import java.util.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class Demo {
  public static void main(String[] args) {
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
    parse.pennPrint();
    System.out.println(parse.dependencies(new CollinsHeadFinder()));
  }
}

In a usage such as this, the parser expects sentences already tokenized according to Penn Treebank conventions. For arbitrary text, prior processing must be done to achieve such tokenization (the main method of LexicalizedParser provides an example of doing this).

Implementation notes

The current version uses class objects as temporary objects to avoid short-lived object creation, and as global numberer spaces. Because of this, the parser doesn't support concurrent usage in multiple threads.

Writing and reading trained parsers to and from files

A trained parser consists of grammars, lexicons, and option values. Once a parser has been trained, it may be written to file in one of two formats: binary serialized Java objects, or human readable text data. A parser can also be quickly reconstructed (either programmatically or at the command line) from files containing parser in either of these formats.

The binary serialized Java objects format is created using standard tools provided by the java.io package, and is not text, and not human-readable. To train and then save a parser as a binary serialized objects file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train trainFilePath start stop -saveToSerializedFile outputFilePath

The text data format is human readable and modifiable, and consists of four sections, appearing in the following order:

Each section is headed by a line consisting of multiple asterisks (*) and the name of the section. Note that the file format does not support rules of arbitrary arity, only binary and unary rules. To train and then save a parser as a text data file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train trainFilePath start stop -saveToTextFile outputFilePath

To parse a file with a saved parser, either in text data or serialized data format, use a command line invocation of the following form:

java -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser parserFilePath test.txt
@author Dan Klein @author Christopher Manning @author Roger Levy @author Teg Grenager