edu.stanford.nlp.parser.lexparser
Class LexicalizedParser

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.LexicalizedParser
All Implemented Interfaces:
ParserQueryFactory, Function<java.util.List<? extends HasWord>,Tree>, java.io.Serializable

public class LexicalizedParser
extends java.lang.Object
implements Function<java.util.List<? extends HasWord>,Tree>, ParserQueryFactory, java.io.Serializable

This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers. The name reflects the main factored parsing model, which provides a lexicalized PCFG parser implemented as a product model of a plain PCFG parser and a lexicalized dependency parser. But you can also run either component parser alone. In particular, it is often useful to do unlexicalized PCFG parsing by using just that component parser.

See the package documentation for more details and examples of use.

For information on invoking the parser from the command-line, and for a more detailed list of options, see the main(java.lang.String[]) method.

Note that training on a 1 million word treebank requires a fair amount of memory to run. Try -mx1500m to increase the memory allocated by the JVM.

Author:
Dan Klein (original version), Christopher Manning (better features, ParserParams, serialization), Roger Levy (internationalization), Teg Grenager (grammar compaction, tokenization, etc.), Galen Andrew (considerable refactoring), John Bauer (made threadsafe)
See Also:
Serialized Form

Field Summary
 BinaryGrammar bg
           
static java.lang.String DEFAULT_PARSER_LOC
           
 DependencyGrammar dg
           
 Lexicon lex
           
 Reranker reranker
           
 Index<java.lang.String> stateIndex
           
 Index<java.lang.String> tagIndex
           
 UnaryGrammar ug
           
 Index<java.lang.String> wordIndex
           
 
Constructor Summary
LexicalizedParser(Lexicon lex, BinaryGrammar bg, UnaryGrammar ug, DependencyGrammar dg, Index<java.lang.String> stateIndex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, Options op)
           
 
Method Summary
 Tree apply(java.util.List<? extends HasWord> words)
          Converts a Sentence/List/String into a Tree.
 Lexicon getLexicon()
           
 Options getOp()
           
static LexicalizedParser getParserFromFile(java.lang.String parserFileOrUrl, Options op)
           
static LexicalizedParser getParserFromSerializedFile(java.lang.String serializedFileOrUrl)
           
protected static LexicalizedParser getParserFromTextFile(java.lang.String textFileOrUrl, Options op)
           
static LexicalizedParser getParserFromTreebank(Treebank trainTreebank, Treebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op, Treebank tuneTreebank, java.util.List<java.util.List<TaggedWord>> extraTaggedWords)
          A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.
 TreebankLangParserParams getTLPParams()
           
 TreePrint getTreePrint()
          Return a TreePrint for formatting parsed output trees.
 LexicalizedParserQuery lexicalizedParserQuery()
           
static LexicalizedParser loadModel()
          Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default classpath location (edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz).
static LexicalizedParser loadModel(java.io.ObjectInputStream ois)
          Reads one object from the given ObjectInputStream, which is assumed to be a LexicalizedParser.
static LexicalizedParser loadModel(Options op, java.lang.String... extraFlags)
          Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default classpath location (edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz).
static LexicalizedParser loadModel(java.lang.String parserFileOrUrl, Options op, java.lang.String... extraFlags)
          Construct a new LexicalizedParser.
static LexicalizedParser loadModel(java.lang.String parserFileOrUrl, java.lang.String... extraFlags)
           
static LexicalizedParser loadModelFromZip(java.lang.String zipFilename, java.lang.String modelName)
           
static void main(java.lang.String[] args)
          A main program for using the parser with various options.
 Tree parse(java.util.List<? extends HasWord> lst)
          Parses the list of HasWord.
 Tree parse(java.lang.String sentence)
          Will parse the text in sentence as if it represented a single sentence by first processing it with a tokenizer.
 java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences)
           
 java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences, int nthreads)
          Will launch multiple threads which calls parse on each of the sentences in order, returning the resulting parse trees in the same order.
 ParserQuery parserQuery()
           
 Tree parseStrings(java.util.List<java.lang.String> lst)
          Will process a list of strings into a list of HasWord and return the parse tree associated with that list.
 Tree parseTree(java.util.List<? extends HasWord> sentence)
          Similar to parse(), but instead of returning an X tree on failure, returns null.
 void saveParserToSerialized(java.lang.String filename)
          Saves the parser defined by pd to the given filename.
 void saveParserToTextFile(java.lang.String filename)
          Saves the parser defined by pd to the given filename.
 void setOptionFlags(java.lang.String... flags)
          This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments.
static LexicalizedParser trainFromTreebank(java.lang.String treebankPath, java.io.FileFilter filt, Options op)
           
static LexicalizedParser trainFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Options op)
          Construct a new LexicalizedParser.
static LexicalizedParser trainFromTreebank(Treebank trainTreebank, Options op)
           
 TreebankLanguagePack treebankLanguagePack()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lex

public Lexicon lex

bg

public BinaryGrammar bg

ug

public UnaryGrammar ug

dg

public DependencyGrammar dg

stateIndex

public Index<java.lang.String> stateIndex

wordIndex

public Index<java.lang.String> wordIndex

tagIndex

public Index<java.lang.String> tagIndex

reranker

public Reranker reranker

DEFAULT_PARSER_LOC

public static final java.lang.String DEFAULT_PARSER_LOC
Constructor Detail

LexicalizedParser

public LexicalizedParser(Lexicon lex,
                         BinaryGrammar bg,
                         UnaryGrammar ug,
                         DependencyGrammar dg,
                         Index<java.lang.String> stateIndex,
                         Index<java.lang.String> wordIndex,
                         Index<java.lang.String> tagIndex,
                         Options op)
Method Detail

getOp

public Options getOp()

getTLPParams

public TreebankLangParserParams getTLPParams()

treebankLanguagePack

public TreebankLanguagePack treebankLanguagePack()

loadModel

public static LexicalizedParser loadModel()
Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default classpath location (edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz).


loadModel

public static LexicalizedParser loadModel(Options op,
                                          java.lang.String... extraFlags)
Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default classpath location (edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz).

Parameters:
op - Options to the parser. These get overwritten by the Options read from the serialized parser; I think the only thing determined by them is the encoding of the grammar iff it is a text grammar

loadModel

public static LexicalizedParser loadModel(java.lang.String parserFileOrUrl,
                                          java.lang.String... extraFlags)

loadModel

public static LexicalizedParser loadModel(java.lang.String parserFileOrUrl,
                                          Options op,
                                          java.lang.String... extraFlags)
Construct a new LexicalizedParser. This loads a grammar that was previously assembled and stored as a serialized file.

Parameters:
parserFileOrUrl - Filename/URL to load parser from
op - Options for this parser. These will normally be overwritten by options stored in the file
Throws:
java.lang.IllegalArgumentException - If parser data cannot be loaded

loadModel

public static LexicalizedParser loadModel(java.io.ObjectInputStream ois)
Reads one object from the given ObjectInputStream, which is assumed to be a LexicalizedParser. Throws a ClassCastException if this is not true. The stream is not closed.


loadModelFromZip

public static LexicalizedParser loadModelFromZip(java.lang.String zipFilename,
                                                 java.lang.String modelName)

trainFromTreebank

public static LexicalizedParser trainFromTreebank(Treebank trainTreebank,
                                                  GrammarCompactor compactor,
                                                  Options op)
Construct a new LexicalizedParser.

Parameters:
trainTreebank - a treebank to train from

trainFromTreebank

public static LexicalizedParser trainFromTreebank(java.lang.String treebankPath,
                                                  java.io.FileFilter filt,
                                                  Options op)

trainFromTreebank

public static LexicalizedParser trainFromTreebank(Treebank trainTreebank,
                                                  Options op)

apply

public Tree apply(java.util.List<? extends HasWord> words)
Converts a Sentence/List/String into a Tree. If it can't be parsed, it is made into a trivial tree in which each word is attached to a dummy tag ("X") and then to a start nonterminal (also "X"). In all circumstances, the input will be treated as a single sentence to be parsed.

Specified by:
apply in interface Function<java.util.List<? extends HasWord>,Tree>
Parameters:
words - The input sentence (a List of words)
Returns:
A Tree that is the parse tree for the sentence. If the parser fails, a new Tree is synthesized which attaches all words to the root.
Throws:
java.lang.IllegalArgumentException - If argument isn't a List or String

parse

public Tree parse(java.lang.String sentence)
Will parse the text in sentence as if it represented a single sentence by first processing it with a tokenizer.


parseStrings

public Tree parseStrings(java.util.List<java.lang.String> lst)
Will process a list of strings into a list of HasWord and return the parse tree associated with that list.


parse

public Tree parse(java.util.List<? extends HasWord> lst)
Parses the list of HasWord. If the parse fails for some reason, an X tree is returned instead of barfing.


parseMultiple

public java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences)

parseMultiple

public java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences,
                                          int nthreads)
Will launch multiple threads which calls parse on each of the sentences in order, returning the resulting parse trees in the same order.


getTreePrint

public TreePrint getTreePrint()
Return a TreePrint for formatting parsed output trees.

Returns:
A TreePrint for formatting parsed output trees.

parseTree

public Tree parseTree(java.util.List<? extends HasWord> sentence)
Similar to parse(), but instead of returning an X tree on failure, returns null.


parserQuery

public ParserQuery parserQuery()
Specified by:
parserQuery in interface ParserQueryFactory

lexicalizedParserQuery

public LexicalizedParserQuery lexicalizedParserQuery()

getParserFromFile

public static LexicalizedParser getParserFromFile(java.lang.String parserFileOrUrl,
                                                  Options op)

getLexicon

public Lexicon getLexicon()

saveParserToSerialized

public void saveParserToSerialized(java.lang.String filename)
Saves the parser defined by pd to the given filename. If there is an error, a RuntimeIOException is thrown.


saveParserToTextFile

public void saveParserToTextFile(java.lang.String filename)
Saves the parser defined by pd to the given filename. If there is an error, a RuntimeIOException is thrown.


getParserFromTextFile

protected static LexicalizedParser getParserFromTextFile(java.lang.String textFileOrUrl,
                                                         Options op)

getParserFromSerializedFile

public static LexicalizedParser getParserFromSerializedFile(java.lang.String serializedFileOrUrl)

getParserFromTreebank

public static LexicalizedParser getParserFromTreebank(Treebank trainTreebank,
                                                      Treebank secondaryTrainTreebank,
                                                      double weight,
                                                      GrammarCompactor compactor,
                                                      Options op,
                                                      Treebank tuneTreebank,
                                                      java.util.List<java.util.List<TaggedWord>> extraTaggedWords)
A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.

Trees are not read into memory but processed as they are read from disk.

A weight (typically <= 1) can be put on the second treebank.

Parameters:
trainTreebank - A treebank to train from
secondaryTrainTreebank - Another treebank to train from
weight - A weight factor to give the secondary treebank. If the weight is 0.25, each example in the secondaryTrainTreebank will be treated as 1/4 of an example sentence.
compactor - A class for compacting grammars. May be null.
op - Options for how the grammar is built from the treebank
tuneTreebank - A treebank to tune free params on (may be null)
extraTaggedWords - A list of words to add to the Lexicon
Returns:
The trained LexicalizedParser

setOptionFlags

public void setOptionFlags(java.lang.String... flags)
This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments. This is a useful convenience method when building a parser programmatically. The options passed in should be specified like command-line arguments, including with an initial minus sign.

Notes: This can be used to set parsing-time flags for a serialized parser. You can also still change things serialized in Options, but this will probably degrade parsing performance. The vast majority of command line flags can be passed to this method, but you cannot pass in options that specify the treebank or grammar to be loaded, the grammar to be written, trees or files to be parsed or details of their encoding, nor the TreebankLangParserParams (-tLPP) to use. The TreebankLangParserParams should be set up on construction of a LexicalizedParser, by constructing an Options that uses the required TreebankLangParserParams, and passing that to a LexicalizedParser constructor. Note that despite this method being an instance method, many flags are actually set as static class variables.

Parameters:
flags - Arguments to the parser, for example, {"-outputFormat", "typedDependencies", "-maxLength", "70"}
Throws:
java.lang.IllegalArgumentException - If an unknown flag is passed in

main

public static void main(java.lang.String[] args)
A main program for using the parser with various options. This program can be used for building and serializing a parser from treebank data, for parsing sentences from a file or URL using a serialized or text grammar parser, and (mainly for parser quality testing) for training and testing a parser on a treebank all in one go.

Sample Usages:

If the serializedGrammarPath ends in .gz, then the grammar is written and read as a compressed file (GZip). If the serializedGrammarPath is a URL, starting with http://, then the parser is read from the URL. A fileRange specifies a numeric value that must be included within a filename for it to be used in training or testing (this works well with most current treebanks). It can be specified like a range of pages to be printed, for instance as 200-2199 or 1-300,500-725,9000 or just as 1 (if all your trees are in a single file, just give a dummy argument such as 0 or 1). The parser can write a grammar as either a serialized Java object file or in a text format (or as both), specified with the following options:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] [-saveToSerializedFile grammarPath] [-saveToTextFile grammarPath]

If no files are supplied to parse, then a hardwired sentence is parsed.

In the same position as the verbose flag (-v), many other options can be specified. The most useful to an end user are:

See also the package documentation for more details and examples of use.

Parameters:
args - Command line arguments, as above


Stanford NLP Group