edu.stanford.nlp.parser.lexparser (Stanford JavaNLP API)

Interface Summary
Interface	Description
AbstractCollinizer	Interface for the Collinizers TODO: pass in both the guess and the gold
DependencyGrammar	An interface for DependencyGrammars.
Extractor<T>
GrammarProjection	Maps between the states of a more split and less split grammar.
LatticeScorer
Lexicon	An interface for lexicons interfacing to lexparser.
Reranker	A scorer which the RerankingParserQuery can use to rescore sentences.
RerankerQuery	Process a Tree and return a score.
Rule	Interface for int-format grammar rules.
Scorer	Interface for supporting A* scoring.
TagProjection	An interface for projecting POS tags onto a reduced set for the dependency grammar.
TreebankLangParserParams	Contains language-specific methods commonly necessary to get a parser to parse an arbitrary treebank.
UnknownWordModel	This class defines the runtime interface for unknown words in lexparser.
UnknownWordModelTrainer	An interface for training an UnknownWordModel.
WordFeatureExtractor	An interface for getting features out of words for a feature-based lexicon.

Class Summary
Class	Description
AbstractDependencyGrammar	An abstract base class for dependency grammars.
AbstractTreebankParserParams	An abstract class providing a common method base from which to complete a `TreebankLangParserParams` implementing class.
AbstractTreebankParserParams.AnnotatePunctuationFunction	Annotation function for mapping punctuation to PTB-style equivalence classes.
AbstractTreeExtractor<T>	An abstract superclass for parser classes that extract counts from Trees.
AbstractUnknownWordModelTrainer
AddTaggerToParser	A simple tool to add a tagger to the parser for reranking purposes.
ArabicTreebankParserParams	A `TreebankLangParserParams` implementing class for the Penn Arabic Treebank.
ArabicUnknownWordModel	This is a basic unknown word model for Arabic.
ArabicUnknownWordModelTrainer
BaseLexicon	This is the default concrete instantiation of the Lexicon interface.
BaseUnknownWordModel	An unknown word model for a generic language.
BaseUnknownWordModelTrainer
BasicCategoryTagProjection
BiLexPCFGParser	Implements Eisner and Satta style algorithms for bilexical PCFG parsing.
BiLexPCFGParser.N5BiLexPCFGParser
BinaryGrammar	Maintains efficient indexing of binary grammar rules.
BinaryGrammarExtractor
BinaryHeadFinder
BinaryRule	Binary rules (ints for parent, left and right children)
BoundaryRemover	Removes a boundary symbol (Lexicon.BOUNDARY_TAG or Lexicon.BOUNDARY), which is the rightmost daughter of a tree.
ChineseCharacterBasedLexicon
ChineseCharacterBasedLexiconTraining	Includes a main file which trains a ChineseCharacterBasedLexicon.
ChineseLexicon	A lexicon class for Chinese.
ChineseLexiconAndWordSegmenter	This class lets you train a lexicon and segmenter at the same time.
ChineseMarkovWordSegmenter	Performs word segmentation with a hierarchical markov model over POS and over characters given POS.
ChineseMaxentLexicon	A Lexicon class that computes the score of word\|tag according to a maxent model of tag\|word (divided by MLE estimate of P(tag)).
ChineseSimWordAvgDepGrammar	A Dependency grammar that smoothes by averaging over similar words.
ChineseTreebankParserParams	Parameter file for parsing the Penn Chinese Treebank.
ChineseUnknownWordModel	Stores, trains, and scores with an unknown word model.
ChineseUnknownWordModelTrainer
ChineseWordFeatureExtractor
CNFTransformers
CollinsPuncTransformer	This class manipulates punctuation in trees (used with training trees) in the same manner that Collins manipulated punctuation in trees when building his parsing model.
Debinarizer	Debinarizes a binary tree from the parser.
Edge	Class for parse edges.
EnglishTreebankParserParams	Parser parameters for the Penn English Treebank (WSJ, Brown, Switchboard).
EnglishTreebankParserParams.EnglishTest
EnglishTreebankParserParams.EnglishTrain
EnglishUnknownWordModel	This is a basic unknown word model for English.
EnglishUnknownWordModelTrainer
ExactGrammarCompactor
ExhaustiveDependencyParser	An exhaustive O(n⁴t²) time and O(n²t) space dependency parser.
ExhaustivePCFGParser	An exhaustive generalized CKY PCFG parser.
FactoredLexicon
FactoredLexiconEvent
FactoredParser
FastFactoredParser	Provides a much faster way to realize the factored parsing idea, including easily returning "k good" results at the expense of optimality.
FrenchTreebankParserParams	TreebankLangParserParams for the French Treebank corpus.
FrenchUnknownWordModel
FrenchUnknownWordModelTrainer
GenericTreebankParserParams
GermanUnknownWordModel	An unknown word model for German; relies on BaseUnknownWordModel plus number matching.
GermanUnknownWordModelTrainer
GrammarCompactionTester
GrammarCompactor
GrammarCoverageChecker	Checks the coverage of rules in a grammar on a test treebank.
HebrewTreebankParserParams	Initial version of a parser pack for the HTB.
Hook	Class for parse table hooks.
HookChart	A HookChart is a chart data structure designed for use with the efficient O(n^4) chart parsing mechanisms targetted at lexicalized parsing, which were introduced by Eisner and Satta.
HTKLatticeReader
HTKLatticeReader.LatticeWord
HungarianTreebankParserParams	Bare-bones implementation of a ParserParams for the Hungarian SPMRL treebank.
HungarianTreebankParserParams.HungarianSubcategoryStripper
IntDependency	Maintains a dependency between head and dependent where they are each an IntTaggedWord.
IntTaggedWord	Represents a WordTag (in the sense that equality is defined on both components), where each half is represented by an int indexed by a Index.
ItalianTreebankParserParams	Bare-bones implementation of a ParserParams for the Italian Turin treebank.
ItalianTreebankParserParams.ItalianSubcategoryStripper
Item	Abstract class for parse items.
IterativeCKYPCFGParser	Does iterative deepening search inside the CKY algorithm for faster parsing.
Lattice
LatticeEdge
LatticeXMLReader
LexicalizedParser	This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers.
LexicalizedParserQuery
LinearGrammarSmoother	Implements linear rule smoothing a la Petrov et al.
MaxMatchSegmenter	A word-segmentation scheme using the max-match algorithm.
MLEDependencyGrammar
MLEDependencyGrammarExtractor	Gathers statistics on tree dependencies and then passes them to an MLEDependencyGrammar for dependency grammar construction.
NegraPennCollinizer
NegraPennTreebankParserParams	Parameter file for parsing the Penn Treebank format of the Negra Treebank (German).
NodePruner	Gets rid of extra NP under NP nodes.
Options	This class contains options to the parser which MUST be the SAME at both training and testing (parsing) time in order for the parser to work properly.
Options.LexOptions
OutsideRuleFilter	This class is currently unused.
ParentAnnotationStats	See what parent annotation helps in treebank, based on support and KL divergence.
ParseFiles	Runs the parser over a set of files.
RerankingParserQuery	Rerank trees from the ParserQuery based on scores from a Reranker.
SisterAnnotationStats	See what sister annotation helps in treebank, based on support and KL divergence.
SpanishTreebankParserParams	TreebankLangParserParams for the AnCora corpus.
SpanishUnknownWordModel
SpanishUnknownWordModelTrainer
SplittingGrammarExtractor	This class is a reimplementation of Berkeley's state splitting grammar.
TaggerReranker	Gives a score to a Tree based on how well it matches the output of a tagger.
TestOptions	Options to the parser which affect performance only at testing (parsing) time.
TestTagProjection
TrainOptions	Non-language-specific options for training a grammar from a treebank.
TreeAnnotator	Performs non-language specific annotation of Trees.
TreeAnnotatorAndBinarizer
TreebankAnnotator	Class for getting an annotated treebank.
TreeBinarizer	Binarizes trees, typically in such a way that head-argument structure is respected.
TreeCollinizer	Does detransformations to a parsed sentence to map it back to the standard treebank form for output or evaluation.
TregexPoweredTreebankParserParams	An extension of `AbstractTreebankParserParams` which provides support for Tregex-powered annotations.
TregexPoweredTreebankParserParams.AnnotateHeadFunction	Annotate a tree constituent with its lexical head.
TregexPoweredTreebankParserParams.SimpleStringFunction	Annotates all nodes that match the tregex query with some string.
TueBaDZParserParams	TreebankLangParserParams for the German Tuebingen corpus.
UnaryGrammar	Maintains efficient indexing of unary grammar rules.
UnaryRule	Unary grammar rules (with ints for parent and child).
UnknownGTTrainer	This class trains a Good-Turing model for unknown words from a collection of trees.

Enum Summary
Enum Description

TrainOptions.TransformMatrixType

Enum Summary
Enum	Description
TrainOptions.TransformMatrixType

Package edu.stanford.nlp.parser.lexparser Description

This package contains implementations of three probabilistic parsers for natural language text. There is an accurate unlexicalized probabilistic context-free grammar (PCFG) parser, a probabilistic lexical dependency parser, and a factored, lexicalized probabilistic context free grammar parser, which does joint inference over the product of the first two parsers. The parser supports various languages and input formats. For English, for most purposes, we now recommend just using the unlexicalized PCFG. With a well-engineered grammar (as supplied for English), it is fast, accurate, requires much less memory, and in many real-world uses, lexical preferences are unavailable or inaccurate across domains or genres and the unlexicalized parser will perform just as well as a lexicalized parser. However, the factored parser will sometimes provide greater accuracy on English through knowledge of lexical dependencies. Moreover, it is considerably better than the PCFG parser alone for most other languages (with less rigid word order), including German, Chinese, and Arabic. The dependency parser can be run alone, but this is usually not useful (its accuracy is much lower). The output of the parser can be presented in various forms, such as just part-of-speech tags, phrase structure trees, or dependencies, and is controlled by options passed to the TreePrint class.

References

The factored parser and the unlexicalized PCFG parser are described in:

Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. Advances in Neural Information Processing Systems 15 (NIPS 2002). [pdf]
Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the Association for Computational Linguistics, 2003. [pdf]

The factored parser uses a product model, where the preferences of an unlexicalized PCFG parser and a lexicalized dependency parser are combined by a third parser, which does exact search using A* outside estimates (which are Viterbi outside scores, precalculated during PCFG and dependency parsing of the sentence).

We have been splitting up the parser into public classes, but some of the internals are still contained in the file FactoredParser.java.

The class LexicalizedParser provides an interface for either training a parser from a treebank, or parsing text using a saved parser. It can be called programmatically, or the commandline main() method supports many options.

The parser has been ported to multiple languages. German, Chinese, and Arabic grammars are included. The first publication below documents the Chinese parser. The German parser was developed for and used in the second paper (but the paper contains very little detail on it).

Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank? ACL 2003, pp. 439-446.
Roger Levy and Christopher D. Manning. 2004. Deep dependencies from context-free statistical parsers: correcting the surface dependency approximation. ACL 2004, pp. 328-335.

The grammatical relations output of the parser is presented in:

Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. LREC 2006.

End user usage

Requirements

You need Java 1.6+ installed on your system, and java in your PATH where commands are looked for.

You need a machine with a fair amount of memory. Required memory depends on the choice of parser, the size of the grammar, and other factors like the presence of numerous unknown words To run the PCFG parser on sentences of up to 40 words you need 100 MB of memory. To be able to handle longer sentences, you need more (to parse sentences up to 100 words, you need 400 MB). For running the Factored Parser, 600 MB is needed for dealing with sentences up to 40 words. Factored parsing of sentences up to 200 words requires around 3GB of memory. Training a new lexicalized parser requires about 1500m of memory; much less is needed for training a PCFG.

For just parsing text, you need a saved parser model (grammars, lexicon, etc.), which can be represented either as a text file or as a binary (Java serialized object) representation, and which can be gzip compressed. A number are provided contained in the supplied stanford-parser-$VERSION-models.jar file in the distributed version, and can be accessed from there by having this jar file on your CLASSPATH and specifying them via a classpath entry such as: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz. (Stanford NLP people can also find the grammars in the directory /u/nlp/data/lexparser.) Other available grammars include englishFactored.ser.gz for English, and chineseFactored.ser.gz for Chinese.

You need the parser code and grammars accessible. This can be done by having the supplied jar files on your CLASSPATH. The examples below assume you are in the parser distribution home directory. From there you can set up the classpath with the command-line argument -cp "*" (or perhaps -cp "*;" on certain versions of Windows). Then if you have some sentences in testsent.txt (as plain text), the following commands should work.

Command-line parsing usage

Parsing a local text file:

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz testsent.txt

Parsing a document over the web:

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 40 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz https://nlp.stanford.edu/software/lex-parser.html

Note the -maxLength flag: this will set a maximum length sentence to parse. If you do not set one, the parser will try to parse sentences up to any length, but will usually run out of memory when trying to do this. This is important with web pages with text that may not be real sentences (or just with technical documents that turn out to have 300 word sentences). The parser just does very rudimentary stripping of HTML tags, and so it'll work okay on plain text web pages, but it won't work adequately on most complex commercial script-driven pages. If you want to handle these, you'll need to provide your own preprocessor, and then to call the parser on its output.

The parser will send parse trees to stdout and other information on what it is doing to stderr, so one commonly wants to direct just stdout to an output file, in the standard way.

Other languages: Chinese

Parsing a Chinese sentence (in the default input encoding for Chinese of GB18030 - note you'll need the right fonts to see the output correctly):

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent

or for Unicode (UTF-8) format files:

java -mx100m -cp "*"edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams -encoding UTF-8 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent-utf

For Chinese, the package includes two simple word segmenters. One is a lexicon-based maximum match segmenter, and the other uses the parser to do Hidden Markov Model-based word segmentation. These segmentation methods are okay, but if you would like a high quality segmentation of Chinese text, you will have to segment the Chinese by yourself as a preprocessing step. The supplied grammars assume that Chinese input has already been word-segmented according to Penn Chinese Treebank conventions. Choosing Chinese with -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams makes space-separated words the default tokenization. To do word segmentation within the parser, give one of the options -segmentMarkov or -segmentMaxMatch.

Other languages

The parser also supports other languages including German and French.

Command-line options

The program has many options. The most useful end-user option is -maxLength n which determines the maximum length sentence that the parser will parser. Longer sentences are skipped, with a message printed to stderr.

Input formatting and tokenization options

The parser supports many different input formats: tokenized/not, sentences/not, and tagged/not.

The input may be tokenized or not, and users may supply their own tokenizers. The input is by default assumed to not be tokenized; if the input is tokenized, supply the option -tokenized. If the input is not tokenized, you may supply the name of a tokenizer class with -tokenizer tokenizerClassName; otherwise the default tokenizer (edu.stanford.nlp.processor.PTBTokenizer) is used. This tokenizer should perform well over typical plain newswire-style text.

The input may have already been split into sentences or not. The input is by default assumed to be not split; if sentences are split, supply the option -sentences delimitingToken, where the delimiting token may be any string. As a special case, if the delimiting token is "newline" the parser will assume that each line of the file is a sentence.

Simple XML can also be parsed. The main method does not incorporate an XML parser, but one can fake certain simple cases with the -parseInside regex which will only parse the tokens inside elements matched by the regular expression regex. These elements are assumed to be pure CDATA. If you use -parseInside s, then the parser will accept input in which sentences are marked XML-style with <s> ... </s> (the same format as the input to Eugene Charniak's parser).

Finally, the input may be tagged or not. If it is tagged, the program assumes that words and tags are separated by a non-whitespace separating character such as '/' or '_'. You give the option -tagSeparator tagSeparator to specify tagged text with a tag separator. You also need to tell the parser to use a different tokenizer, using the flags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory

You can see examples of many of these options in the test directory. As an example, you can parse the example file with partial POS-tagging with this command:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 20 -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory englishPCFG.ser.gz pos-sentences.txt

There are some restrictions on the interpretation of POS-tagged input:

The tagset must match the parser POS set. If you are using our supplied parser data files, that means you must be using Penn Treebank POS tags.
An indicated tagging will determine which of the taggings allowed by the lexicon will be used, but the parser will not accept tags not allowed by its lexicon. This is usually not problematic, since rare or unknown words are allowed to have many POS tags, but would be if you were trying to persuade it that are should be tagged as a noun in the sentence "100 are make up one hectare." since it will only allow are to have a verbal tagging.

For the examples in pos-sentences.txt:

This sentence is parsed correctly with no tags given.
So it is also parsed correctly telling the parser butter is a verb.
You get a different worse parse telling it butter is a noun.
You get the same parse as 1. with all tags correctly supplied.
It won't accept can as a VB, but does accept butter as a noun, so you get the same parse as 3.
People can butter can be an NP.
Most words can be NN, but not common function words like their, with, a.

Note that if the program is reading tags correctly, they aren't printed in the sentence it says it is parsing. Only the words are printed there.

Output formatting options

You can set how sentences are printed out by using the -outputFormat format option. The native and default format is as trees are formatted in the Penn Treebank, but there are a number of other useful options:

penn The default.
oneline Printed out on one line.
wordsAndTags Use the parser as a POS tagger.
latexTree Help write your LaTeX papers (for use with Avery Andrews' trees.sty package.
typedDependenciesCollapsed Write sentences in a typed dependency format that represents sentences via grammatical relations between words. Suitable for representing text as a semantic network.

You can get each sentence printed in multiple formats by giving a comma-separated list of formats. See the TreePrint class for more information on available output formats and options.

Programmatic usage

LexicalizedParser can be easily called within a larger application. It implements a couple of useful interfaces that provide for simple use: edu.stanford.nlp.parser.ViterbiParser and edu.stanford.nlp.process.Function. The following simple class shows typical usage:

 import java.util.*;
 import edu.stanford.nlp.ling.*;
 import edu.stanford.nlp.trees.*;
 import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
 class ParserDemo {
 public static void main(String[] args) {
 LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
 lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
 String[] sent = { "This", "is", "an", "easy", "sentence", "." };
 List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
 Tree parse = lp.apply(rawWords);
 parse.pennPrint();
 System.out.println();
 TreebankLanguagePack tlp = new PennTreebankLanguagePack();
 GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
 List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
 System.out.println(tdl);
 System.out.println();
 TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
 tp.printTree(parse);
 }
 }

In a usage such as this, the parser expects sentences already tokenized according to Penn Treebank conventions. For arbitrary text, prior processing must be done to achieve such tokenization (the main method of LexicalizedParser provides an example of doing this). The example shows how most command-line arguments can also be passed to the parser when called programmatically. Note that using the -retainTmpSubcategories option is necessary to get the best results in the typed dependencies output recognizing temporal noun phrases ("last week", "next February").

Some code fragments which include tokenization using Penn Treebank conventions follows:

 import java.io.StringReader;
 import edu.stanford.nlp.trees.Tree;
 import edu.stanford.nlp.objectbank.TokenizerFactory;
 import edu.stanford.nlp.process.CoreLabelTokenFactory;
 import edu.stanford.nlp.ling.CoreLabel;
 import edu.stanford.nlp.process.PTBTokenizer;
 import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
 LexicalizedParser lp = LexicalizedParser.loadModel("englishPCFG.ser.gz");
 lp.setOptionFlags(new String[]{"-outputFormat", "penn,typedDependenciesCollapsed", "-retainTmpSubcategories"});
 TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
 public Tree processSentence(String sentence) {
 List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sentence)).tokenize();
 Tree bestParse = lp.parseTree(rawWords);
 return bestParse;
 }

Writing and reading trained parsers to and from files

A trained parser consists of grammars, a lexicon, and option values. Once a parser has been trained, it may be written to file in one of two formats: binary serialized Java objects or human readable text data. A parser can also be quickly reconstructed (either programmatically or at the command line) from files containing a parser in either of these formats.

The binary serialized Java objects format is created using standard tools provided by the java.io package, and is not text, and not human-readable. To train and then save a parser as a binary serialized objects file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train trainFilePath [fileRange] -saveToSerializedFile outputFilePath

The text data format is human readable and modifiable, and consists of four sections, appearing in the following order:

Options - consists of variable-value pairs, one per line, which must remain constant across training and parsing.
Lexicon - consists of lexical entries, one per line, each of which is preceded by the keyword SEEN or UNSEEN, and followed by a raw count.
Unary Grammar - consists of unary rewrite rules, one per line, each of which is of the form A -> B, followed by the normalized log probability.
Binary Grammar - consists of binary rewrite rules, one per line, each of which is of the form A -> B C, followed by the normalized log probability.
Dependency Grammar

Each section is headed by a line consisting of multiple asterisks (*) and the name of the section. Note that the file format does not support rules of arbitrary arity, only binary and unary rules. To train and then save a parser as a text data file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train trainFilePath start stop -saveToTextFile outputFilePath

To parse a file with a saved parser, either in text data or serialized data format, use a command line invocation of the following form:

java -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser parserFilePath test.txt

A Note on Text Grammars

If you want to use the text grammars in another parser and duplicate our performance, you will need to know how we handle the POS tagging of rare and unknown words:

Unknown Words: rather than scoring all words unseen during training with a single distribution over tags, we score unknown words based on their word shape signatures, defined as follows. Beginning with the original string, all lowercase alphabetic characters are replaced with x, uppercase with X, digits with d, and other characters are unchanged. Then, consecutive duplicates are eliminated. For example, Formula-1 would become Xx-1. The probability of tags given signatures is estimated on words occurring in only the second half of the training data, then inverted. However, in the current release of the parser, this is all done programmatically, and so the text lexicon contains only a single UNK token. To duplicate our behavior, one would be best off building one's own lexicon with the above behavior.
Rare Words: all words with frequency less than a cut-off (of 100) are allowed to take tags with which they were not seen during training. In this case, they are eligible for (i) all tags that either they were seen with, or (ii) any tag an unknown word can receive (lexicon entry for UNK). The probability of a tag given a rare word is an interpolation of the word's own tag distribution and the unknown distribution for that word's signature. Because of the tag-splitting used in our parser, this ability to take out-of-lexicon tags is fairly important, and not represented in our text lexicon.

For additional information

For more information, you should next look at the Javadocs for the LexicalizedParser class. In particular, the main method of that class documents more precisely a number of the input preprocessing options that were presented chattily above.

Author:: Dan Klein, Christopher Manning, Roger Levy, Teg Grenager, Galen Andrew