|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.LexicalizedParser
public class LexicalizedParser
This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers. The name reflects the main factored parsing model, which provides a lexicalized PCFG parser implemented as a product model of a plain PCFG parser and a lexicalized dependency parser. But you can also run either component parser alone. In particular, it is often useful to do unlexicalized PCFG parsing by using just that component parser.
See the package documentation for more details and examples of use. See the main method documentation for details of invoking the parser.
Note that training on a 1 million word treebank requires a lot of memory to run. Try -mx1500m.
Field Summary | |
---|---|
protected KBestViterbiParser |
bparser
The factored parser that combines the dependency and PCFG parsers. |
protected TreeTransformer |
debinarizer
|
protected ExhaustiveDependencyParser |
dparser
The dependency parser. |
protected ExhaustivePCFGParser |
pparser
The PCFG parser. |
Constructor Summary | |
---|---|
LexicalizedParser()
Construct a new LexicalizedParser object from a previously serialized grammar read from a property edu.stanford.nlp.SerializedLexicalizedParser ,
or a default file location. |
|
LexicalizedParser(ObjectInputStream in)
Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream. |
|
LexicalizedParser(Options op)
Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser ,
or a default file location
(/u/nlp/data/lexparser/englishPCFG.ser.gz ). |
|
LexicalizedParser(ParserData pd)
Construct a new LexicalizedParser object from a previously assembled grammar. |
|
LexicalizedParser(String parserFileOrUrl)
|
|
LexicalizedParser(String parserFileOrUrl,
boolean isTextGrammar,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(String treebankPath,
FileFilter filt,
Options op)
|
|
LexicalizedParser(String parserFileOrUrl,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
DiskTreebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor,
Options op)
|
|
LexicalizedParser(Treebank trainTreebank,
GrammarCompactor compactor,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
GrammarCompactor compactor,
Options op,
Treebank tuneTreebank)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
Options op)
|
Method Summary | |
---|---|
Tree |
apply(Object in)
Converts a Sentence/List/String into a Tree. |
static Pair<List<Tree>,List<Tree>> |
getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank,
Treebank tuneTreebank,
Options op)
|
Tree |
getBestDependencyParse()
|
Tree |
getBestDependencyParse(boolean debinarize)
|
Tree |
getBestParse()
Return the best parse of the sentence most recently parsed. |
Tree |
getBestPCFGParse()
|
Tree |
getBestPCFGParse(boolean stripSubcategories)
|
List<ScoredObject<Tree>> |
getKBestPCFGParses(int k)
Returns the trees (and scores) corresponding to the k-best derivations of the sentence. |
List<ScoredObject<Tree>> |
getKGoodFactoredParses(int k)
|
Lexicon |
getLexicon()
|
Options |
getOp()
|
static ParserData |
getParserDataFromFile(String parserFileOrUrl,
Options op)
|
protected static ParserData |
getParserDataFromPetrovFiles(String grammarFile,
String lexiconFile)
|
static ParserData |
getParserDataFromSerializedFile(String serializedFileOrUrl)
|
protected static ParserData |
getParserDataFromTextFile(String textFileOrUrl,
Options op)
|
protected ParserData |
getParserDataFromTreebank(Treebank trainTreebank,
DiskTreebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor)
A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger. |
ParserData |
getParserDataFromTreebank(Treebank trainTreebank,
GrammarCompactor compactor,
Treebank tuneTreebank)
|
double |
getPCFGScore()
|
double |
getPCFGScore(String goalStr)
|
TreePrint |
getTreePrint()
Return a TreePrint for formatting parsed output trees. |
static void |
main(String[] args)
A main program for using the parser with various options. |
boolean |
parse(LatticeReader lr)
Parse a (speech) lattice with the PCFG parser. |
boolean |
parse(List<? extends HasWord> sentence)
Parse a sentence represented as a List of tokens. |
boolean |
parse(List<? extends HasWord> sentence,
String goal)
Parse a Sentence. |
boolean |
parse(String sentence)
Tokenize and parse a sentence. |
ParserData |
parserData()
|
void |
reset()
Reinitializes the parser. |
void |
setMaxLength(int maxLength)
Set the maximum length of a sentence that the parser will be willing to parse. |
void |
setOptionFlags(String... flags)
This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments. |
double |
testOnTreebank(Treebank testTreebank)
Test the parser on a treebank. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected ExhaustivePCFGParser pparser
protected ExhaustiveDependencyParser dparser
protected KBestViterbiParser bparser
protected TreeTransformer debinarizer
Constructor Detail |
---|
public LexicalizedParser()
edu.stanford.nlp.SerializedLexicalizedParser
,
or a default file location.
public LexicalizedParser(Options op)
edu.stanford.nlp.SerializedLexicalizedParser
,
or a default file location
(/u/nlp/data/lexparser/englishPCFG.ser.gz
).
op
- Options to the parser. These get overwritten by the
Options read from the serialized parser; I think the only
thing determined by them is the encoding of the grammar
iff it is a text grammarpublic LexicalizedParser(String parserFileOrUrl, Options op)
parserFileOrUrl
- Filename/URL to load parser fromop
- Options for this parser. These will normally be overwritten
by options stored in the file
IllegalArgumentException
- If parser data cannot be loadedpublic LexicalizedParser(String parserFileOrUrl)
public LexicalizedParser(String parserFileOrUrl, boolean isTextGrammar, Options op)
IllegalArgumentException
- If parser data cannot be loadedpublic LexicalizedParser(ParserData pd)
pd
- A ParserData
object (not null
)public LexicalizedParser(ObjectInputStream in) throws Exception
in
- The ObjectInputStream
Exception
public LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op)
trainTreebank
- a treebank to train frompublic LexicalizedParser(String treebankPath, FileFilter filt, Options op)
public LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op, Treebank tuneTreebank)
trainTreebank
- a treebank to train fromcompactor
- A class for compacting grammars. May be null.op
- Options for how the grammar is built from the treebanktuneTreebank
- a treebank to tune free params on (may be null)public LexicalizedParser(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op)
public LexicalizedParser(Treebank trainTreebank, Options op)
Method Detail |
---|
public Options getOp()
public Tree apply(Object in)
apply
in interface Function<Object,Tree>
in
- The input Sentence/List/String
IllegalArgumentException
- If argument isn't a List or Stringpublic TreePrint getTreePrint()
public boolean parse(List<? extends HasWord> sentence, String goal)
parse
in interface Parser
sentence
- The words to parsegoal
- What category to parse the words as
public boolean parse(String sentence)
sentence
- The sentence as a regular String
public boolean parse(List<? extends HasWord> sentence)
parse
in interface Parser
sentence
- The sentence to parse
UnsupportedOperationException
- If the Sentence is too long or
of zero length or the parse
otherwise fails for resource reasonspublic boolean parse(LatticeReader lr)
lr
- a lattice to parse
public Tree getBestParse()
getBestParse
in interface ViterbiParser
NoSuchElementException
- If no previously successfully parsed
sentencepublic List<ScoredObject<Tree>> getKGoodFactoredParses(int k)
public List<ScoredObject<Tree>> getKBestPCFGParses(int k)
k
- The number of best parses to return
public Tree getBestPCFGParse()
public Tree getBestPCFGParse(boolean stripSubcategories)
public double getPCFGScore()
public double getPCFGScore(String goalStr)
public Tree getBestDependencyParse()
public Tree getBestDependencyParse(boolean debinarize)
public void setMaxLength(int maxLength)
maxLength
- The maximum length sentence to parsepublic static ParserData getParserDataFromFile(String parserFileOrUrl, Options op)
public ParserData parserData()
public Lexicon getLexicon()
protected static ParserData getParserDataFromPetrovFiles(String grammarFile, String lexiconFile)
protected static ParserData getParserDataFromTextFile(String textFileOrUrl, Options op)
public static ParserData getParserDataFromSerializedFile(String serializedFileOrUrl)
public static Pair<List<Tree>,List<Tree>> getAnnotatedBinaryTreebankFromTreebank(Treebank trainTreebank, Treebank tuneTreebank, Options op)
public final ParserData getParserDataFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Treebank tuneTreebank)
protected final ParserData getParserDataFromTreebank(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor)
public void reset()
public double testOnTreebank(Treebank testTreebank)
Test.verbose
is true.
testTreebank
- The treebank to parse
public void setOptionFlags(String... flags)
-tLPP
)
to use. The TreebankLangParserParams should be set up on construction of
a LexicalizedParser, by constructing an Options that uses the required
TreebankLangParserParams, and passing that to a LexicalizedParser
constructor. Note that despite
this method being an instance method, many flags are actually set as
static class variables.
flags
- Arguments to the parser, for example,
{"-outputFormat", "typedDependencies", "-maxLength", "70"}
IllegalArgumentException
- If an unknown flag is passed inpublic static void main(String[] args)
Sample Usages:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v]
-train trainFilesPath fileRange -saveToSerializedFile
serializedGrammarFilename
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] -train trainFilesPath fileRange
-testTreebank testFilePath fileRange
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] serializedGrammarPath filename+
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] -loadFromSerializedFile serializedGrammarPath
-testTreebank testFilePath fileRange
If the serializedGrammarPath
ends in .gz
,
then the grammar is written and read as a compressed file (GZip).
If the serializedGrammarPath
is a URL, starting with
http://
, then the parser is read from the URL.
A fileRange specifies a numeric value that must be included within a
filename for it to be used in training or testing (this works well with
most current treebanks). It can be specified like a range of pages to be
printed, for instance as 200-2199
or
1-300,500-725,9000
or just as 1
(if all your
trees are in a single file, just give a dummy argument such as
0
or 1
).
The parser can write a grammar as either a serialized Java object file
or in a text format (or as both), specified with the following options:
java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train
trainFilesPath [fileRange] [-saveToSerializedFile grammarPath]
[-saveToTextFile grammarPath]
If no files are supplied to parse, then a hardwired sentence is parsed.
In the same position as the verbose flag (-v
), many other
options can be specified. The most useful to an end user are:
-tLPP class
Specify a different
TreebankLangParserParams, for when using a different language or
treebank (the default is English Penn Treebank). This option MUST occur
before any other language-specific options that are used (or else they
are ignored!).
(It's usually a good idea to specify this option even when loading a
serialized grammar; it is necessary if the language pack specifies a
needed character encoding or you wish to specify language-specific
options on the command line.)-encoding charset
Specify the character encoding of the
input and output files. This will override the value in the
TreebankLangParserParams
, provided this option appears
after any -tLPP
option.-tokenized
Says that the input is already separated
into whitespace-delimited tokens. If this option is specified, any
tokenizer specified for the language is ignored, and a universal (Unicode)
tokenizer, which divides only on whitespace, is used.
Unless you also specify
-escaper
, the tokens must all be correctly
tokenized tokens of the appropriate treebank for the parser to work
well (for instance, if using the Penn English Treebank, you must have
coded "(" as "-LRB-", "3/4" as "3\/4", etc.)-escaper class
Specify a class of type
Function
<List<HasWord>,List<HasWord>> to do
customized escaping of tokenized text. This class will be run over the
tokenized text and can fix the representation of tokens. For instance,
it could change "(" to "-LRB-" for the Penn English Treebank. A
provided escaper that does such things for the Penn English Treebank is
edu.stanford.nlp.process.PTBEscapingProcessor
-tokenizerFactory class
Specifies a
TokenizerFactory class to be used for tokenization-sentences token
Specifies a token that marks sentence
boundaries. A value of newline
causes sentence breaking on
newlines. A value of onePerElement
causes each element
(using the XML -parseInside
option) to be treated as a
sentence. All other tokens will be interpreted literally, and must be
exactly the same as tokens returned by the tokenizer.
If no explicit sentence breaking option is chosen, sentence breaking
is done based on a set of language-particular sentence-ending patterns.
-parseInside element
Specifies that parsing should only
be done for tokens inside the indicated XML-style
elements (done as simple pattern matching, rather than XML parsing).
For example, if this is specified as sentence
, then
the text inside the sentence
element
would be parsed.
Using "-parseInside s" gives you support for the input format of
Charniak's parser. Sentences cannot span elements. Whether the
contents of the element are treated as one sentence or potentially
multiple sentences is controlled by the -sentences
flag.
The default is potentially multiple sentences.
This option gives support for extracting and parsing
text from very simple SGML and XML documents, and is provided as a
user convenience for that purpose. If you want to really parse XML
documents before NLP parsing them, you should use an XML parser, and then
call to a LexicalizedParser on appropriate CDATA.
-tagSeparator char
Specifies to look for tags on words
following the word and separated from it by a special character
char
. For instance, many tagged corpora have the
representation "house/NN" and you would use -tagSeparator /
.
Notes: This option requires that the input be pretokenized.
The separator has to be only a single character, and there is no
escaping mechanism. However, splitting is done on the last
instance of the character in the token, so that cases like
"3\/4/CD" are handled correctly. The parser will in all normal
circumstances use the tag you provide, but will override it in the
case of very common words in cases where the tag that you provide
is not one that it regards as a possible tagging for the word.
The parser supports a format where only some of the words in a sentence
have a tag (if you are calling the parser programmatically, you indicate
them by having them implement the HasTag
interface).
You can do this at the command-line by only having tags after some words,
but you are limited by the fact that there is no way to escape the
tagSeparator character.-maxLength leng
Specify the longest sentence that
will be parsed (and hence indirectly the amount of memory
needed for the parser). If this is not specified, the parser will
try to dynamically grow its parse chart when long sentence are
encountered, but may run out of memory trying to do so.-outputFormat styles
Choose the style(s) of output
sentences: penn
for prettyprinting as in the Penn
treebank files, or oneline
for printing sentences one
per line, words
, wordsAndTags
,
dependencies
, typedDependencies
,
or typedDependenciesCollapsed
.
Multiple options may be specified as a comma-separated
list. See TreePrint class for further documentation.-outputFormatOptions
Provide options that control the
behavior of various -outputFormat
choices, such as
lexicalize
, stem
, markHeadNodes
,
or xml
.
Options are specified as a comma-separated list.-writeOutputFiles
Write output files corresponding
to the input files, with the same name but a ".stp"
file extension. The format of these files depends on the
outputFormat
option. (If not specified, output is sent
to stdout.)-outputFilesExtension
The extension that is appended to
the filename that is being parsed to produce an output file name (with the
-writeOutputFiles option). The default is stp
. Don't
include the period.
-outputFilesDirectory
The directory in which output
files are written (when the -writeOutputFiles option is specified).
If not specified, output files are written in the same directory as the
input files.
args
- Command line arguments, as above
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |