See: Description
Interface | Description |
---|---|
AbstractCollinizer |
Interface for the Collinizers
TODO: pass in both the guess and the gold |
DependencyGrammar |
An interface for DependencyGrammars.
|
Extractor<T> | |
GrammarProjection |
Maps between the states of a more split and less split grammar.
|
LatticeScorer | |
Lexicon |
An interface for lexicons interfacing to lexparser.
|
Reranker |
A scorer which the RerankingParserQuery can use to rescore
sentences.
|
RerankerQuery |
Process a Tree and return a score.
|
Rule |
Interface for int-format grammar rules.
|
Scorer |
Interface for supporting A* scoring.
|
TagProjection |
An interface for projecting POS tags onto a reduced
set for the dependency grammar.
|
TreebankLangParserParams |
Contains language-specific methods commonly necessary to get a parser
to parse an arbitrary treebank.
|
UnknownWordModel |
This class defines the runtime interface for unknown words
in lexparser.
|
UnknownWordModelTrainer |
An interface for training an UnknownWordModel.
|
WordFeatureExtractor |
An interface for getting features out of words for a feature-based lexicon.
|
Class | Description |
---|---|
AbstractDependencyGrammar |
An abstract base class for dependency grammars.
|
AbstractTreebankParserParams |
An abstract class providing a common method base from which to
complete a
TreebankLangParserParams implementing class. |
AbstractTreebankParserParams.AnnotatePunctuationFunction |
Annotation function for mapping punctuation to PTB-style equivalence classes.
|
AbstractTreeExtractor<T> |
An abstract superclass for parser classes that extract counts from Trees.
|
AbstractUnknownWordModelTrainer | |
AddTaggerToParser |
A simple tool to add a tagger to the parser for reranking purposes.
|
ArabicTreebankParserParams |
A
TreebankLangParserParams implementing class for
the Penn Arabic Treebank. |
ArabicUnknownWordModel |
This is a basic unknown word model for Arabic.
|
ArabicUnknownWordModelTrainer | |
BaseLexicon |
This is the default concrete instantiation of the Lexicon interface.
|
BaseUnknownWordModel |
An unknown word model for a generic language.
|
BaseUnknownWordModelTrainer | |
BasicCategoryTagProjection | |
BiLexPCFGParser |
Implements Eisner and Satta style algorithms for bilexical
PCFG parsing.
|
BiLexPCFGParser.N5BiLexPCFGParser | |
BinaryGrammar |
Maintains efficient indexing of binary grammar rules.
|
BinaryGrammarExtractor | |
BinaryHeadFinder | |
BinaryRule |
Binary rules (ints for parent, left and right children)
|
BoundaryRemover |
Removes a boundary symbol (Lexicon.BOUNDARY_TAG or Lexicon.BOUNDARY), which
is the rightmost daughter of a tree.
|
ChineseCharacterBasedLexicon | |
ChineseCharacterBasedLexiconTraining |
Includes a main file which trains a ChineseCharacterBasedLexicon.
|
ChineseLexicon |
A lexicon class for Chinese.
|
ChineseLexiconAndWordSegmenter |
This class lets you train a lexicon and segmenter at the same time.
|
ChineseMarkovWordSegmenter |
Performs word segmentation with a hierarchical markov model over POS
and over characters given POS.
|
ChineseMaxentLexicon |
A Lexicon class that computes the score of word|tag according to a maxent model
of tag|word (divided by MLE estimate of P(tag)).
|
ChineseSimWordAvgDepGrammar |
A Dependency grammar that smoothes by averaging over similar words.
|
ChineseTreebankParserParams |
Parameter file for parsing the Penn Chinese Treebank.
|
ChineseUnknownWordModel |
Stores, trains, and scores with an unknown word model.
|
ChineseUnknownWordModelTrainer | |
ChineseWordFeatureExtractor | |
CNFTransformers | |
CollinsPuncTransformer |
This class manipulates punctuation in trees (used with training trees)
in the same manner that Collins manipulated punctuation in trees when
building his parsing model.
|
Debinarizer |
Debinarizes a binary tree from the parser.
|
Edge |
Class for parse edges.
|
EnglishTreebankParserParams |
Parser parameters for the Penn English Treebank (WSJ, Brown, Switchboard).
|
EnglishTreebankParserParams.EnglishTest | |
EnglishTreebankParserParams.EnglishTrain | |
EnglishUnknownWordModel |
This is a basic unknown word model for English.
|
EnglishUnknownWordModelTrainer | |
ExactGrammarCompactor | |
ExhaustiveDependencyParser |
An exhaustive O(n4t2) time and O(n2t)
space dependency parser.
|
ExhaustivePCFGParser |
An exhaustive generalized CKY PCFG parser.
|
FactoredLexicon | |
FactoredLexiconEvent | |
FactoredParser | |
FastFactoredParser |
Provides a much faster way to realize the factored
parsing idea, including easily returning "k good" results
at the expense of optimality.
|
FrenchTreebankParserParams |
TreebankLangParserParams for the French Treebank corpus.
|
FrenchUnknownWordModel | |
FrenchUnknownWordModelTrainer | |
GenericTreebankParserParams | |
GermanUnknownWordModel |
An unknown word model for German; relies on BaseUnknownWordModel plus number matching.
|
GermanUnknownWordModelTrainer | |
GrammarCompactionTester | |
GrammarCompactor | |
GrammarCoverageChecker |
Checks the coverage of rules in a grammar on a test treebank.
|
HebrewTreebankParserParams |
Initial version of a parser pack for the HTB.
|
Hook |
Class for parse table hooks.
|
HookChart |
A HookChart is a chart data structure designed for use with the efficient
O(n^4) chart parsing mechanisms targetted at lexicalized parsing, which
were introduced by Eisner and Satta.
|
HTKLatticeReader | |
HTKLatticeReader.LatticeWord | |
HungarianTreebankParserParams |
Bare-bones implementation of a ParserParams for the Hungarian SPMRL treebank.
|
HungarianTreebankParserParams.HungarianSubcategoryStripper | |
IntDependency |
Maintains a dependency between head and dependent where they are each an IntTaggedWord.
|
IntTaggedWord |
Represents a WordTag (in the sense that equality is defined
on both components), where each half is represented by an
int indexed by a Index.
|
ItalianTreebankParserParams |
Bare-bones implementation of a ParserParams for the Italian Turin treebank.
|
ItalianTreebankParserParams.ItalianSubcategoryStripper | |
Item |
Abstract class for parse items.
|
IterativeCKYPCFGParser |
Does iterative deepening search inside the CKY algorithm for faster
parsing.
|
Lattice | |
LatticeEdge | |
LatticeXMLReader | |
LexicalizedParser |
This class provides the top-level API and command-line interface to a set
of reasonably good treebank-trained parsers.
|
LexicalizedParserQuery | |
LinearGrammarSmoother |
Implements linear rule smoothing a la Petrov et al.
|
MaxMatchSegmenter |
A word-segmentation scheme using the max-match algorithm.
|
MLEDependencyGrammar | |
MLEDependencyGrammarExtractor |
Gathers statistics on tree dependencies and then passes them to an
MLEDependencyGrammar for dependency grammar construction.
|
NegraPennCollinizer | |
NegraPennTreebankParserParams |
Parameter file for parsing the Penn Treebank format of the Negra
Treebank (German).
|
NodePruner |
Gets rid of extra NP under NP nodes.
|
Options |
This class contains options to the parser which MUST be the SAME at
both training and testing (parsing) time in order for the parser to
work properly.
|
Options.LexOptions | |
OutsideRuleFilter |
This class is currently unused.
|
ParentAnnotationStats |
See what parent annotation helps in treebank, based on support and
KL divergence.
|
ParseFiles |
Runs the parser over a set of files.
|
RerankingParserQuery |
Rerank trees from the ParserQuery based on scores from a Reranker.
|
SisterAnnotationStats |
See what sister annotation helps in treebank, based on support and
KL divergence.
|
SpanishTreebankParserParams |
TreebankLangParserParams for the AnCora corpus.
|
SpanishUnknownWordModel | |
SpanishUnknownWordModelTrainer | |
SplittingGrammarExtractor |
This class is a reimplementation of Berkeley's state splitting
grammar.
|
TaggerReranker |
Gives a score to a Tree based on how well it matches the output of
a tagger.
|
TestOptions |
Options to the parser which affect performance only at testing (parsing)
time.
|
TestTagProjection | |
TrainOptions |
Non-language-specific options for training a grammar from a treebank.
|
TreeAnnotator |
Performs non-language specific annotation of Trees.
|
TreeAnnotatorAndBinarizer | |
TreebankAnnotator |
Class for getting an annotated treebank.
|
TreeBinarizer |
Binarizes trees, typically in such a way that head-argument structure is respected.
|
TreeCollinizer |
Does detransformations to a parsed sentence to map it back to the
standard treebank form for output or evaluation.
|
TregexPoweredTreebankParserParams |
An extension of
AbstractTreebankParserParams
which provides support for Tregex-powered annotations. |
TregexPoweredTreebankParserParams.AnnotateHeadFunction |
Annotate a tree constituent with its lexical head.
|
TregexPoweredTreebankParserParams.SimpleStringFunction |
Annotates all nodes that match the tregex query with some string.
|
TueBaDZParserParams |
TreebankLangParserParams for the German Tuebingen corpus.
|
UnaryGrammar |
Maintains efficient indexing of unary grammar rules.
|
UnaryRule |
Unary grammar rules (with ints for parent and child).
|
UnknownGTTrainer |
This class trains a Good-Turing model for unknown words from a
collection of trees.
|
Enum | Description |
---|---|
TrainOptions.TransformMatrixType |
This package contains implementations of three probabilistic parsers for natural language text. There is an accurate unlexicalized probabilistic context-free grammar (PCFG) parser, a probabilistic lexical dependency parser, and a factored, lexicalized probabilistic context free grammar parser, which does joint inference over the product of the first two parsers. The parser supports various languages and input formats. For English, for most purposes, we now recommend just using the unlexicalized PCFG. With a well-engineered grammar (as supplied for English), it is fast, accurate, requires much less memory, and in many real-world uses, lexical preferences are unavailable or inaccurate across domains or genres and the unlexicalized parser will perform just as well as a lexicalized parser. However, the factored parser will sometimes provide greater accuracy on English through knowledge of lexical dependencies. Moreover, it is considerably better than the PCFG parser alone for most other languages (with less rigid word order), including German, Chinese, and Arabic. The dependency parser can be run alone, but this is usually not useful (its accuracy is much lower). The output of the parser can be presented in various forms, such as just part-of-speech tags, phrase structure trees, or dependencies, and is controlled by options passed to the TreePrint class.
The factored parser and the unlexicalized PCFG parser are described in:
The factored parser uses a product model, where the preferences of an unlexicalized PCFG parser and a lexicalized dependency parser are combined by a third parser, which does exact search using A* outside estimates (which are Viterbi outside scores, precalculated during PCFG and dependency parsing of the sentence).
We have been splitting up the parser into public classes, but some of
the internals are still contained in the file
FactoredParser.java
.
The class LexicalizedParser
provides an interface for
either
training a parser from a treebank, or parsing text using a saved
parser. It can be called programmatically, or the commandline main()
method supports many options.
The parser has been ported to multiple languages. German, Chinese, and Arabic grammars are included. The first publication below documents the Chinese parser. The German parser was developed for and used in the second paper (but the paper contains very little detail on it).
The grammatical relations output of the parser is presented in:
You need Java 1.6+ installed on your system, and
java
in your PATH where commands are looked for.
You need a machine with a fair amount of memory. Required memory depends on the choice of parser, the size of the grammar, and other factors like the presence of numerous unknown words To run the PCFG parser on sentences of up to 40 words you need 100 MB of memory. To be able to handle longer sentences, you need more (to parse sentences up to 100 words, you need 400 MB). For running the Factored Parser, 600 MB is needed for dealing with sentences up to 40 words. Factored parsing of sentences up to 200 words requires around 3GB of memory. Training a new lexicalized parser requires about 1500m of memory; much less is needed for training a PCFG.
For just parsing text, you need a saved parser model (grammars, lexicon,
etc.), which can be
represented either as a text file or as a binary (Java serialized
object) representation, and which can be gzip compressed.
A number are provided contained in the supplied
stanford-parser-$VERSION-models.jar file in the distributed version,
and can be accessed from there by having this jar file on your
CLASSPATH and specifying them via a classpath entry such as:
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz
.
(Stanford NLP people can also find the grammars in the directory
/u/nlp/data/lexparser
.) Other available grammars include
englishFactored.ser.gz
for English, and
chineseFactored.ser.gz
for Chinese.
You need the parser code and grammars
accessible. This can be done by having the supplied jar files on
your CLASSPATH. The examples below assume you are in the parser
distribution home directory. From there you can set up the classpath with the
command-line argument -cp "*"
(or perhaps -cp "*;"
on certain versions of Windows).
Then if you have some sentences in testsent.txt
(as plain
text), the following commands should work.
Parsing a local text file:
java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz testsent.txt
Parsing a document over the web:
java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
-maxLength 40 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz https://nlp.stanford.edu/software/lex-parser.html
Note the -maxLength
flag: this will set a maximum length
sentence to parse. If you do not set one, the parser will try to parse
sentences up to any length, but will usually run out of memory when
trying to do this. This is important with web pages with text that may
not be real sentences (or just with technical documents that turn out to
have 300 word sentences).
The parser just does very rudimentary stripping of HTML tags, and
so it'll work okay on plain text web pages, but it won't work
adequately on most complex commercial script-driven pages. If you
want to handle these, you'll need to provide your own preprocessor,
and then to call the parser on its output.
The parser will send parse trees to stdout
and other
information on what it is doing to stderr
, so one commonly
wants to direct just stdout
to an output file, in the
standard way.
Parsing a Chinese sentence (in the default input encoding for Chinese of GB18030 - note you'll need the right fonts to see the output correctly):
java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent
or for Unicode (UTF-8) format files:
java -mx100m -cp "*"edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
-encoding UTF-8 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent-utf
For Chinese, the package includes two simple word segmenters. One is a
lexicon-based maximum match segmenter, and the other uses the parser to
do Hidden Markov Model-based word segmentation. These segmentation
methods are okay, but if you would like a high quality segmentation of
Chinese text, you will have to segment the Chinese by yourself as a
preprocessing step. The supplied grammars assume that
Chinese input has already been word-segmented according to Penn
Chinese Treebank conventions. Choosing
Chinese with -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
makes space-separated words the default tokenization.
To do word segmentation within the parser, give one of the options
-segmentMarkov
or -segmentMaxMatch
.
The parser also supports other languages including German and French.
The program has many options. The most useful end-user option is
-maxLength n
which determines the maximum
length sentence that the parser will parser. Longer sentences are
skipped, with a message printed to stderr
.
The parser supports many different input formats: tokenized/not, sentences/not, and tagged/not.
The input may be
tokenized or not, and users may supply their own tokenizers. The input
is by default assumed to not be tokenized; if the
input is tokenized, supply the option -tokenized
. If the
input is not tokenized, you may supply the name of a tokenizer class
with -tokenizer tokenizerClassName
; otherwise the default
tokenizer (edu.stanford.nlp.processor.PTBTokenizer
) is
used. This tokenizer should perform well over typical plain
newswire-style text.
The
input may have already been split into sentences or not. The input is by
default assumed
to be not split; if sentences are split, supply the option
-sentences delimitingToken
, where the delimiting token
may be any string. As a special case, if the delimiting token
is "newline"
the parser will assume that each line of the
file is a sentence.
Simple XML can also be parsed. The main method does not incorporate an XML
parser, but one can fake certain simple cases with the
-parseInside regex
which will only parse the tokens inside
elements matched by the regular expression regex
. These
elements are assumed to be pure CDATA.
If you use -parseInside s
, then the parser will accept
input in which sentences are marked XML-style with
<s> ... </s> (the same format as the input to
Eugene Charniak's parser).
Finally, the input may be tagged or not. If it is tagged, the program
assumes that words and tags are separated by a non-whitespace
separating character such as '/' or '_'. You give the option
-tagSeparator tagSeparator
to specify tagged text with a
tag separator. You also need to tell the parser to use a different
tokenizer, using the flags
-tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer
-tokenizerMethod newCoreLabelTokenizerFactory
You can see examples of many of these options in the
test
directory. As an example, you can parse the example file with partial POS-tagging
with this command:
java edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 20 -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory englishPCFG.ser.gz pos-sentences.txt
There are some restrictions on the interpretation of POS-tagged input:
For the examples in pos-sentences.txt
:
Note that if the program is reading tags correctly, they aren't printed in the sentence it says it is parsing. Only the words are printed there.
You can set how sentences are printed out by using the
-outputFormat format
option. The native and default format is as
trees are formatted in the Penn Treebank, but there are a number of
other useful options:
penn
The default.oneline
Printed out on one line.wordsAndTags
Use the parser as a POS tagger.latexTree
Help write your LaTeX papers (for use with
Avery Andrews' trees.sty
package.typedDependenciesCollapsed
Write sentences in a typed
dependency format that represents sentences via grammatical relations
between words. Suitable for representing text as a semantic network.You can get each sentence printed in multiple formats by giving a comma-separated list of formats. See the TreePrint class for more information on available output formats and options.
LexicalizedParser
can be easily called
within a larger
application. It implements a couple of useful interfaces that
provide for simple use:
edu.stanford.nlp.parser.ViterbiParser
and edu.stanford.nlp.process.Function
.
The following simple class shows typical usage:
import java.util.*; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"); lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"}); String[] sent = { "This", "is", "an", "easy", "sentence", "." }; List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent); Tree parse = lp.apply(rawWords); parse.pennPrint(); System.out.println(); TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); System.out.println(tdl); System.out.println(); TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed"); tp.printTree(parse); } }
In a usage such as this, the parser expects sentences already
tokenized according to Penn Treebank conventions. For arbitrary text,
prior processing must be done to achieve such tokenization (the
main method of LexicalizedParser provides an
example of doing this). The example shows how most command-line
arguments can also be passed to the parser when called
programmatically. Note that using the
-retainTmpSubcategories
option is necessary to get the best
results in the typed dependencies output recognizing temporal noun phrases
("last week", "next February").
Some code fragments which include tokenization using Penn Treebank conventions follows:
import java.io.StringReader; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.objectbank.TokenizerFactory; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.process.PTBTokenizer; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; LexicalizedParser lp = LexicalizedParser.loadModel("englishPCFG.ser.gz"); lp.setOptionFlags(new String[]{"-outputFormat", "penn,typedDependenciesCollapsed", "-retainTmpSubcategories"}); TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); public Tree processSentence(String sentence) { List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sentence)).tokenize(); Tree bestParse = lp.parseTree(rawWords); return bestParse; }
A trained parser consists of grammars, a lexicon, and option values. Once a parser has been trained, it may be written to file in one of two formats: binary serialized Java objects or human readable text data. A parser can also be quickly reconstructed (either programmatically or at the command line) from files containing a parser in either of these formats.
The binary serialized Java
objects format is created using standard tools provided by the java.io
package, and is not text, and not human-readable. To train and then save a parser
as a binary serialized objects file, use a command line invocation of the form:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
-train trainFilePath [fileRange] -saveToSerializedFile outputFilePath
The text data format is human readable and modifiable, and consists of four sections, appearing in the following order:
Each section is headed by a line consisting of multiple asterisks (*) and the name of the section. Note that the file format does not support rules of arbitrary arity, only binary and unary rules. To train and then save a parser as a text data file, use a command line invocation of the form:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
-train trainFilePath start stop -saveToTextFile outputFilePath
To parse a file with a saved parser, either in text data or serialized data format, use a command line invocation of the following form:
java -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
parserFilePath test.txt
If you want to use the text grammars in another parser and duplicate our performance, you will need to know how we handle the POS tagging of rare and unknown words:
For more information, you should next look at the Javadocs for the
LexicalizedParser class. In particular, the main
method of
that class documents more precisely a number of the input preprocessing
options that were presented chattily above.