edu.stanford.nlp.parser.lexparser
Class LexicalizedParser

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.LexicalizedParser
All Implemented Interfaces:
Function<Object,Tree>

public class LexicalizedParser
extends Object
implements Function<Object,Tree>

This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers. The name reflects the main factored parsing model, which provides a lexicalized PCFG parser implemented as a product model of a plain PCFG parser and a lexicalized dependency parser. But you can also run either component parser alone. In particular, it is often useful to do unlexicalized PCFG parsing by using just that component parser.

See the package documentation for more details and examples of use. See the main method documentation for details of invoking the parser.

Note that training on a 1 million word treebank requires a fair amount of memory to run. Try -mx1500m.

Author:
Dan Klein (original version), Christopher Manning (better features, ParserParams, serialization), Roger Levy (internationalization), Teg Grenager (grammar compaction, tokenization, etc.), Galen Andrew (considerable refactoring)

Field Summary
static String DEFAULT_PARSER_LOC
           
 
Constructor Summary
LexicalizedParser()
          Construct a new LexicalizedParser object from a previously serialized grammar read from a property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location.
LexicalizedParser(ObjectInputStream in)
          Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream.
LexicalizedParser(Options op)
          Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location (/u/nlp/data/lexparser/englishPCFG.ser.gz).
LexicalizedParser(Options op, String[] extraFlags)
           
LexicalizedParser(ParserData pd)
          Construct a new LexicalizedParser object from a previously assembled grammar.
LexicalizedParser(String[] extraFlags)
           
LexicalizedParser(String parserFileOrUrl, boolean isTextGrammar, Options op)
          Construct a new LexicalizedParser.
LexicalizedParser(String treebankPath, FileFilter filt, Options op)
           
LexicalizedParser(String parserFileOrUrl, Options op, String... extraFlags)
          Construct a new LexicalizedParser.
LexicalizedParser(String parserFileOrUrl, String... extraFlags)
           
LexicalizedParser(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op)
          Construct a new LexicalizedParser.
LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op)
          Construct a new LexicalizedParser.
LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op, Treebank tuneTreebank)
          Construct a new LexicalizedParser.
LexicalizedParser(Treebank trainTreebank, Options op)
           
 
Method Summary
 Tree apply(Object in)
          Converts a Sentence/List/String into a Tree.
static Pair<List<Tree>,List<Tree>> getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank, Treebank tuneTreebank, Options op)
           
 Lexicon getLexicon()
           
 Options getOp()
           
static ParserData getParserDataFromFile(String parserFileOrUrl, Options op)
           
static ParserData getParserDataFromSerializedFile(String serializedFileOrUrl)
           
protected static ParserData getParserDataFromTextFile(String textFileOrUrl, Options op)
           
protected  ParserData getParserDataFromTreebank(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor)
          A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.
 ParserData getParserDataFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Treebank tuneTreebank)
           
 TreePrint getTreePrint()
          Return a TreePrint for formatting parsed output trees.
static void main(String[] args)
          A main program for using the parser with various options.
 ParserData parserData()
           
 LexicalizedParserQuery parserQuery()
           
 Tree parseTree(List<? extends HasWord> sentence)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_PARSER_LOC

public static final String DEFAULT_PARSER_LOC
Constructor Detail

LexicalizedParser

public LexicalizedParser()
Construct a new LexicalizedParser object from a previously serialized grammar read from a property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location.


LexicalizedParser

public LexicalizedParser(String[] extraFlags)

LexicalizedParser

public LexicalizedParser(Options op)
Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location (/u/nlp/data/lexparser/englishPCFG.ser.gz).

Parameters:
op - Options to the parser. These get overwritten by the Options read from the serialized parser; I think the only thing determined by them is the encoding of the grammar iff it is a text grammar

LexicalizedParser

public LexicalizedParser(Options op,
                         String[] extraFlags)

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl,
                         String... extraFlags)

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl,
                         Options op,
                         String... extraFlags)
Construct a new LexicalizedParser. This loads a grammar that was previously assembled and stored as a serialized file.

Parameters:
parserFileOrUrl - Filename/URL to load parser from
op - Options for this parser. These will normally be overwritten by options stored in the file
Throws:
IllegalArgumentException - If parser data cannot be loaded

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl,
                         boolean isTextGrammar,
                         Options op)
Construct a new LexicalizedParser. This loads a grammar that was previously assembled and stored.

Throws:
IllegalArgumentException - If parser data cannot be loaded

LexicalizedParser

public LexicalizedParser(ParserData pd)
Construct a new LexicalizedParser object from a previously assembled grammar.

Parameters:
pd - A ParserData object (not null)

LexicalizedParser

public LexicalizedParser(ObjectInputStream in)
                  throws Exception
Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream. One (ParserData) object is read from the stream. It isn't closed.

Parameters:
in - The ObjectInputStream
Throws:
Exception

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         GrammarCompactor compactor,
                         Options op)
Construct a new LexicalizedParser.

Parameters:
trainTreebank - a treebank to train from

LexicalizedParser

public LexicalizedParser(String treebankPath,
                         FileFilter filt,
                         Options op)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         Options op)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         GrammarCompactor compactor,
                         Options op,
                         Treebank tuneTreebank)
Construct a new LexicalizedParser.

Parameters:
trainTreebank - a treebank to train from
compactor - A class for compacting grammars. May be null.
op - Options for how the grammar is built from the treebank
tuneTreebank - a treebank to tune free params on (may be null)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         DiskTreebank secondaryTrainTreebank,
                         double weight,
                         GrammarCompactor compactor,
                         Options op)
Construct a new LexicalizedParser.

Parameters:
trainTreebank - a treebank to train from
secondaryTrainTreebank - another treebank to train from
weight - a weight factor to give the secondary treebank. If the weight is 0.25, each example in the secondaryTrainTreebank will be treated as 1/4 of an example sentence.
compactor - A class for compacting grammars. May be null.
op - Options for how the grammar is built from the treebank
Method Detail

getOp

public Options getOp()

apply

public Tree apply(Object in)
Converts a Sentence/List/String into a Tree. If it can't be parsed, it is made into a trivial tree in which each word is attached to a dummy tag ("X") and then to a start nonterminal (also "X").

Specified by:
apply in interface Function<Object,Tree>
Parameters:
in - The input Sentence/List/String
Returns:
A Tree that is the parse tree for the sentence. If the parser fails, a new Tree is synthesized which attaches all words to the root.
Throws:
IllegalArgumentException - If argument isn't a List or String

getTreePrint

public TreePrint getTreePrint()
Return a TreePrint for formatting parsed output trees.

Returns:
A TreePrint for formatting parsed output trees.

parseTree

public Tree parseTree(List<? extends HasWord> sentence)

parserQuery

public LexicalizedParserQuery parserQuery()

getParserDataFromFile

public static ParserData getParserDataFromFile(String parserFileOrUrl,
                                               Options op)

parserData

public ParserData parserData()

getLexicon

public Lexicon getLexicon()

getParserDataFromTextFile

protected static ParserData getParserDataFromTextFile(String textFileOrUrl,
                                                      Options op)

getParserDataFromSerializedFile

public static ParserData getParserDataFromSerializedFile(String serializedFileOrUrl)

getAnnotatedBinaryTreebankFromTreebank

public static Pair<List<Tree>,List<Tree>> getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank,
                                                                                 Treebank tuneTreebank,
                                                                                 Options op)
Returns:
a pair of binaryTrainTreebank,binaryTuneTreebank.

getParserDataFromTreebank

public final ParserData getParserDataFromTreebank(Treebank trainTreebank,
                                                  GrammarCompactor compactor,
                                                  Treebank tuneTreebank)

getParserDataFromTreebank

protected final ParserData getParserDataFromTreebank(Treebank trainTreebank,
                                                     DiskTreebank secondaryTrainTreebank,
                                                     double weight,
                                                     GrammarCompactor compactor)
A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.

Trees are not read into memory but processed as they are read from disk.

A weight (typically <= 1) can be put on the second treebank.


main

public static void main(String[] args)
A main program for using the parser with various options. This program can be used for building and serializing a parser from treebank data, for parsing sentences from a file or URL using a serialized or text grammar parser, and (mainly for parser quality testing) for training and testing a parser on a treebank all in one go.

Sample Usages:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath fileRange -saveToSerializedFile serializedGrammarFilename

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath fileRange -testTreebank testFilePath fileRange

java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] serializedGrammarPath filename+

java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -loadFromSerializedFile serializedGrammarPath -testTreebank testFilePath fileRange

If the serializedGrammarPath ends in .gz, then the grammar is written and read as a compressed file (GZip). If the serializedGrammarPath is a URL, starting with http://, then the parser is read from the URL. A fileRange specifies a numeric value that must be included within a filename for it to be used in training or testing (this works well with most current treebanks). It can be specified like a range of pages to be printed, for instance as 200-2199 or 1-300,500-725,9000 or just as 1 (if all your trees are in a single file, just give a dummy argument such as 0 or 1). The parser can write a grammar as either a serialized Java object file or in a text format (or as both), specified with the following options:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] [-saveToSerializedFile grammarPath] [-saveToTextFile grammarPath]

If no files are supplied to parse, then a hardwired sentence is parsed.

In the same position as the verbose flag (-v), many other options can be specified. The most useful to an end user are:

See also the package documentation for more details and examples of use.

Parameters:
args - Command line arguments, as above


Stanford NLP Group