public class LexicalizedParser extends ParserGrammar implements java.io.Serializable
See the package documentation for more details and examples of use.
For information on invoking the parser from the command-line, and for
a more detailed list of options, see the main(java.lang.String[])
method.
Note that training on a 1 million word treebank requires a fair amount of memory to run. Try -mx1500m to increase the memory allocated by the JVM.
Modifier and Type | Field and Description |
---|---|
BinaryGrammar |
bg |
static java.lang.String |
DEFAULT_PARSER_LOC |
DependencyGrammar |
dg |
Lexicon |
lex |
Reranker |
reranker |
Index<java.lang.String> |
stateIndex |
Index<java.lang.String> |
tagIndex |
UnaryGrammar |
ug |
Index<java.lang.String> |
wordIndex |
Constructor and Description |
---|
LexicalizedParser(Lexicon lex,
BinaryGrammar bg,
UnaryGrammar ug,
DependencyGrammar dg,
Index<java.lang.String> stateIndex,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex,
Options op) |
Modifier and Type | Method and Description |
---|---|
static TreeAnnotatorAndBinarizer |
buildTrainBinarizer(Options op) |
static CompositeTreeTransformer |
buildTrainTransformer(Options op) |
static CompositeTreeTransformer |
buildTrainTransformer(Options op,
TreeAnnotatorAndBinarizer binarizer) |
static LexicalizedParser |
copyLexicalizedParser(LexicalizedParser parser) |
java.lang.String[] |
defaultCoreNLPFlags()
Returns a set of options which should be set by default when used
in corenlp.
|
static Triple<Treebank,Treebank,Treebank> |
getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank,
Treebank secondaryTreebank,
Treebank tuneTreebank,
Options op) |
java.util.List<Eval> |
getExtraEvals()
Returns a list of extra Eval objects to use when scoring the parser.
|
Lexicon |
getLexicon() |
Options |
getOp() |
static LexicalizedParser |
getParserFromFile(java.lang.String parserFileOrUrl,
Options op) |
static LexicalizedParser |
getParserFromSerializedFile(java.lang.String serializedFileOrUrl) |
protected static LexicalizedParser |
getParserFromTextFile(java.lang.String textFileOrUrl,
Options op) |
static LexicalizedParser |
getParserFromTreebank(Treebank trainTreebank,
Treebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor,
Options op,
Treebank tuneTreebank,
java.util.List<java.util.List<TaggedWord>> extraTaggedWords)
A method for training from two different treebanks, the second of which is presumed
to be orders of magnitude larger.
|
java.util.List<ParserQueryEval> |
getParserQueryEvals()
Return a list of Eval-style objects which care about the whole
ParserQuery, not just the finished tree
|
TreebankLangParserParams |
getTLPParams() |
TreePrint |
getTreePrint()
Return a TreePrint for formatting parsed output trees.
|
LexicalizedParserQuery |
lexicalizedParserQuery() |
static LexicalizedParser |
loadModel()
Construct a new LexicalizedParser object from a previously
serialized grammar read from a System property
edu.stanford.nlp.SerializedLexicalizedParser , or a
default classpath location
(edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ). |
static LexicalizedParser |
loadModel(java.io.ObjectInputStream ois)
Reads one object from the given ObjectInputStream, which is
assumed to be a LexicalizedParser.
|
static LexicalizedParser |
loadModel(Options op,
java.lang.String... extraFlags)
Construct a new LexicalizedParser object from a previously
serialized grammar read from a System property
edu.stanford.nlp.SerializedLexicalizedParser , or a
default classpath location
(edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ). |
static LexicalizedParser |
loadModel(java.lang.String parserFileOrUrl,
java.util.List<java.lang.String> extraFlags) |
static LexicalizedParser |
loadModel(java.lang.String parserFileOrUrl,
Options op,
java.lang.String... extraFlags)
Construct a new LexicalizedParser.
|
static LexicalizedParser |
loadModel(java.lang.String parserFileOrUrl,
java.lang.String... extraFlags) |
static void |
main(java.lang.String[] args)
A main program for using the parser with various options.
|
Tree |
parse(java.util.List<? extends HasWord> lst)
Parses the list of HasWord.
|
java.util.List<Tree> |
parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences) |
java.util.List<Tree> |
parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences,
int nthreads)
Will launch multiple threads which calls
parse on
each of the sentences in order, returning the
resulting parse trees in the same order. |
ParserQuery |
parserQuery() |
Tree |
parseStrings(java.util.List<java.lang.String> lst)
Will process a list of strings into a list of HasWord and return
the parse tree associated with that list.
|
Tree |
parseTree(java.util.List<? extends HasWord> sentence)
Similar to parse(), but instead of returning an X tree on failure, returns null.
|
boolean |
requiresTags()
The model requires text to be pretagged
|
void |
saveParserToSerialized(java.lang.String filename)
Saves the parser defined by pd to the given filename.
|
void |
saveParserToTextFile(java.lang.String filename)
Saves the parser defined by pd to the given filename.
|
void |
setOptionFlags(java.lang.String... flags)
This will set options to the parser, in a way exactly equivalent to
passing in the same sequence of command-line arguments.
|
static LexicalizedParser |
trainFromTreebank(java.lang.String treebankPath,
java.io.FileFilter filt,
Options op) |
static LexicalizedParser |
trainFromTreebank(Treebank trainTreebank,
GrammarCompactor compactor,
Options op)
Construct a new LexicalizedParser.
|
static LexicalizedParser |
trainFromTreebank(Treebank trainTreebank,
Options op) |
TreebankLanguagePack |
treebankLanguagePack() |
apply, lemmatize, lemmatize, loadModelFromZip, loadTagger, parse, tokenize
public Lexicon lex
public BinaryGrammar bg
public UnaryGrammar ug
public DependencyGrammar dg
public Index<java.lang.String> stateIndex
public Index<java.lang.String> wordIndex
public Index<java.lang.String> tagIndex
public Reranker reranker
public static final java.lang.String DEFAULT_PARSER_LOC
public LexicalizedParser(Lexicon lex, BinaryGrammar bg, UnaryGrammar ug, DependencyGrammar dg, Index<java.lang.String> stateIndex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, Options op)
public Options getOp()
getOp
in class ParserGrammar
public TreebankLangParserParams getTLPParams()
getTLPParams
in class ParserGrammar
public TreebankLanguagePack treebankLanguagePack()
treebankLanguagePack
in class ParserGrammar
public java.lang.String[] defaultCoreNLPFlags()
ParserGrammar
defaultCoreNLPFlags
in class ParserGrammar
public boolean requiresTags()
ParserGrammar
requiresTags
in class ParserGrammar
public static LexicalizedParser loadModel()
edu.stanford.nlp.SerializedLexicalizedParser
, or a
default classpath location
(edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
).public static LexicalizedParser loadModel(Options op, java.lang.String... extraFlags)
edu.stanford.nlp.SerializedLexicalizedParser
, or a
default classpath location
(edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
).op
- Options to the parser. These get overwritten by the
Options read from the serialized parser; I think the only
thing determined by them is the encoding of the grammar
iff it is a text grammarpublic static LexicalizedParser loadModel(java.lang.String parserFileOrUrl, java.lang.String... extraFlags)
public static LexicalizedParser loadModel(java.lang.String parserFileOrUrl, java.util.List<java.lang.String> extraFlags)
public static LexicalizedParser loadModel(java.lang.String parserFileOrUrl, Options op, java.lang.String... extraFlags)
parserFileOrUrl
- Filename/URL to load parser fromop
- Options for this parser. These will normally be overwritten
by options stored in the filejava.lang.IllegalArgumentException
- If parser data cannot be loadedpublic static LexicalizedParser loadModel(java.io.ObjectInputStream ois)
public static LexicalizedParser copyLexicalizedParser(LexicalizedParser parser)
public static LexicalizedParser trainFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Options op)
trainTreebank
- a treebank to train frompublic static LexicalizedParser trainFromTreebank(java.lang.String treebankPath, java.io.FileFilter filt, Options op)
public static LexicalizedParser trainFromTreebank(Treebank trainTreebank, Options op)
public Tree parseStrings(java.util.List<java.lang.String> lst)
public Tree parse(java.util.List<? extends HasWord> lst)
parse
in class ParserGrammar
lst
- The input sentence (a List of words)public java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences)
public java.util.List<Tree> parseMultiple(java.util.List<? extends java.util.List<? extends HasWord>> sentences, int nthreads)
parse
on
each of the sentences
in order, returning the
resulting parse trees in the same order.public TreePrint getTreePrint()
public Tree parseTree(java.util.List<? extends HasWord> sentence)
parseTree
in class ParserGrammar
public java.util.List<Eval> getExtraEvals()
ParserGrammar
getExtraEvals
in class ParserGrammar
public java.util.List<ParserQueryEval> getParserQueryEvals()
ParserGrammar
getParserQueryEvals
in class ParserGrammar
public ParserQuery parserQuery()
parserQuery
in interface ParserQueryFactory
parserQuery
in class ParserGrammar
public LexicalizedParserQuery lexicalizedParserQuery()
public static LexicalizedParser getParserFromFile(java.lang.String parserFileOrUrl, Options op)
public Lexicon getLexicon()
public void saveParserToSerialized(java.lang.String filename)
public void saveParserToTextFile(java.lang.String filename)
protected static LexicalizedParser getParserFromTextFile(java.lang.String textFileOrUrl, Options op)
public static LexicalizedParser getParserFromSerializedFile(java.lang.String serializedFileOrUrl)
public static TreeAnnotatorAndBinarizer buildTrainBinarizer(Options op)
public static CompositeTreeTransformer buildTrainTransformer(Options op)
public static CompositeTreeTransformer buildTrainTransformer(Options op, TreeAnnotatorAndBinarizer binarizer)
public static Triple<Treebank,Treebank,Treebank> getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank, Treebank secondaryTreebank, Treebank tuneTreebank, Options op)
public static LexicalizedParser getParserFromTreebank(Treebank trainTreebank, Treebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op, Treebank tuneTreebank, java.util.List<java.util.List<TaggedWord>> extraTaggedWords)
trainTreebank
- A treebank to train fromsecondaryTrainTreebank
- Another treebank to train fromweight
- A weight factor to give the secondary treebank. If the weight
is 0.25, each example in the secondaryTrainTreebank will be treated as
1/4 of an example sentence.compactor
- A class for compacting grammars. May be null.op
- Options for how the grammar is built from the treebanktuneTreebank
- A treebank to tune free params on (may be null)extraTaggedWords
- A list of words to add to the Lexiconpublic void setOptionFlags(java.lang.String... flags)
-tLPP
) to use. The
TreebankLangParserParams should be set up on construction of a
LexicalizedParser, by constructing an Options that uses
the required TreebankLangParserParams, and passing that to a
LexicalizedParser constructor. Note that despite this
method being an instance method, many flags are actually set as
static class variables.setOptionFlags
in class ParserGrammar
flags
- Arguments to the parser, for example,
{"-outputFormat", "typedDependencies", "-maxLength", "70"}java.lang.IllegalArgumentException
- If an unknown flag is passed inpublic static void main(java.lang.String[] args)
Sample Usages:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -saveToSerializedFile serializedGrammarFilename
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -testTreebank testFilePath [fileRange]
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] serializedGrammarPath filename [filename]*
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -loadFromSerializedFile serializedGrammarPath -testTreebank testFilePath [fileRange]
If the serializedGrammarPath
ends in .gz
,
then the grammar is written and read as a compressed file (GZip).
If the serializedGrammarPath
is a URL, starting with
http://
, then the parser is read from the URL.
A fileRange specifies a numeric value that must be included within a
filename for it to be used in training or testing (this works well with
most current treebanks). It can be specified like a range of pages to be
printed, for instance as 200-2199
or
1-300,500-725,9000
or just as 1
(if all your
trees are in a single file, either omit this parameter or just give a dummy
argument such as 0
).
If the filename to parse is "-" then the parser parses from stdin.
If no files are supplied to parse, then a hardwired sentence
is parsed.
The parser can write a grammar as either a serialized Java object file or in a text format (or as both), specified with the following options:
java edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] -train
trainFilesPath [fileRange] [-saveToSerializedFile grammarPath]
[-saveToTextFile grammarPath]
In the same position as the verbose flag (-v
), many other
options can be specified. The most useful to an end user are:
-tLPP class
Specify a different
TreebankLangParserParams, for when using a different language or
treebank (the default is English Penn Treebank). This option MUST occur
before any other language-specific options that are used (or else they
are ignored!).
(It's usually a good idea to specify this option even when loading a
serialized grammar; it is necessary if the language pack specifies a
needed character encoding or you wish to specify language-specific
options on the command line.)-encoding charset
Specify the character encoding of the
input and output files. This will override the value in the
TreebankLangParserParams
, provided this option appears
after any -tLPP
option.-tokenized
Says that the input is already separated
into whitespace-delimited tokens. If this option is specified, any
tokenizer specified for the language is ignored, and a universal (Unicode)
tokenizer, which divides only on whitespace, is used.
Unless you also specify
-escaper
, the tokens must all be correctly
tokenized tokens of the appropriate treebank for the parser to work
well (for instance, if using the Penn English Treebank, you must have
coded "(" as "-LRB-", etc.). (Note: we do not use the backslash escaping
in front of / and * that appeared in Penn Treebank releases through 1999.)-escaper class
Specify a class of type
Function
<List<HasWord>,List<HasWord>> to do
customized escaping of tokenized text. This class will be run over the
tokenized text and can fix the representation of tokens. For instance,
it could change "(" to "-LRB-" for the Penn English Treebank. A
provided escaper that does such things for the Penn English Treebank is
edu.stanford.nlp.process.PTBEscapingProcessor
-tokenizerFactory class
Specifies a
TokenizerFactory class to be used for tokenization-tokenizerOptions options
Specifies options to a
TokenizerFactory class to be used for tokenization. A comma-separated
list. For PTBTokenizer, options of interest include
americanize=false
and quotes=ascii
(for German).
Note that any choice of tokenizer options that conflicts with the
tokenization used in the parser training data will likely degrade parser
performance. -sentences token
Specifies a token that marks sentence
boundaries. A value of newline
causes sentence breaking on
newlines. A value of onePerElement
causes each element
(using the XML -parseInside
option) to be treated as a
sentence. All other tokens will be interpreted literally, and must be
exactly the same as tokens returned by the tokenizer. For example,
you might specify "|||" and put that symbol sequence as a token between
sentences.
If no explicit sentence breaking option is chosen, sentence breaking
is done based on a set of language-particular sentence-ending patterns.
-parseInside element
Specifies that parsing should only
be done for tokens inside the indicated XML-style
elements (done as simple pattern matching, rather than XML parsing).
For example, if this is specified as sentence
, then
the text inside the sentence
element
would be parsed.
Using "-parseInside s" gives you support for the input format of
Charniak's parser. Sentences cannot span elements. Whether the
contents of the element are treated as one sentence or potentially
multiple sentences is controlled by the -sentences
flag.
The default is potentially multiple sentences.
This option gives support for extracting and parsing
text from very simple SGML and XML documents, and is provided as a
user convenience for that purpose. If you want to really parse XML
documents before NLP parsing them, you should use an XML parser, and then
call to a LexicalizedParser on appropriate CDATA.
-tagSeparator char
Specifies to look for tags on words
following the word and separated from it by a special character
char
. For instance, many tagged corpora have the
representation "house/NN" and you would use -tagSeparator /
.
Notes: This option requires that the input be pretokenized.
The separator has to be only a single character, and there is no
escaping mechanism. However, splitting is done on the last
instance of the character in the token, so that cases like
"3\/4/CD" are handled correctly. The parser will in all normal
circumstances use the tag you provide, but will override it in the
case of very common words in cases where the tag that you provide
is not one that it regards as a possible tagging for the word.
The parser supports a format where only some of the words in a sentence
have a tag (if you are calling the parser programmatically, you indicate
them by having them implement the HasTag
interface).
You can do this at the command-line by only having tags after some words,
but you are limited by the fact that there is no way to escape the
tagSeparator character.-maxLength leng
Specify the longest sentence that
will be parsed (and hence indirectly the amount of memory
needed for the parser). If this is not specified, the parser will
try to dynamically grow its parse chart when long sentence are
encountered, but may run out of memory trying to do so.-outputFormat styles
Choose the style(s) of output
sentences: penn
for prettyprinting as in the Penn
treebank files, or oneline
for printing sentences one
per line, words
, wordsAndTags
,
dependencies
, typedDependencies
,
or typedDependenciesCollapsed
.
Multiple options may be specified as a comma-separated
list. See TreePrint class for further documentation.-outputFormatOptions
Provide options that control the
behavior of various -outputFormat
choices, such as
lexicalize
, stem
, markHeadNodes
,
or xml
. TreePrint
Options are specified as a comma-separated list.-writeOutputFiles
Write output files corresponding
to the input files, with the same name but a ".stp"
file extension. The format of these files depends on the
outputFormat
option. (If not specified, output is sent
to stdout.)-outputFilesExtension
The extension that is appended to
the filename that is being parsed to produce an output file name (with the
-writeOutputFiles option). The default is stp
. Don't
include the period.
-outputFilesDirectory
The directory in which output
files are written (when the -writeOutputFiles option is specified).
If not specified, output files are written in the same directory as the
input files.
-nthreads
Parsing files and testing on treebanks
can use multiple threads. This option tells the parser how many
threads to use. A negative number indicates to use as many
threads as the machine has cores.
args
- Command line arguments, as above