public class DependencyParser extends Object
This is an implementation of the method described in
Danqi Chen and Christopher Manning. A Fast and Accurate Dependency Parser Using Neural Networks. In EMNLP 2014.
New models can be trained from the command line; see main(java.lang.String[])
for details on training options. This parser will also output
CoNLL-X format predictions; again see main(java.lang.String[])
for available
options.
This parser can also be used programmatically. The easiest way to
prepare the parser with a pre-trained model is to call
loadFromModelFile(String)
. Then call
predict(edu.stanford.nlp.util.CoreMap)
on the returned
parser instance in order to get new parses.
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_MODEL |
Constructor and Description |
---|
DependencyParser(Properties properties) |
Modifier and Type | Method and Description |
---|---|
Dataset |
genTrainExamples(List<CoreMap> sents,
List<edu.stanford.nlp.parser.nndep.DependencyTree> trees) |
List<Integer> |
getFeatures(Configuration c) |
int |
getLabelID(String s) |
int |
getPosID(String s) |
int |
getWordID(String s)
Get an integer ID for the given word.
|
static DependencyParser |
loadFromModelFile(String modelFile)
Convenience method; see
loadFromModelFile(String, java.util.Properties) . |
static DependencyParser |
loadFromModelFile(String modelFile,
Properties extraProperties)
Load a saved parser model.
|
void |
loadModelFile(String modelFile)
Load a parser model file, printing out some messages about the grammar in the file.
|
static void |
main(String[] args)
A main program for training, testing and using the parser.
|
GrammaticalStructure |
predict(CoreMap sentence)
Determine the dependency parse of the given sentence using the loaded model.
|
GrammaticalStructure |
predict(List<? extends HasWord> sentence)
Convenience method for
predict(edu.stanford.nlp.util.CoreMap) . |
double |
testCoNLL(String testFile,
String outFile)
Run the parser in the modelFile on a testFile and perhaps save output.
|
void |
train(String trainFile,
String modelFile) |
void |
train(String trainFile,
String devFile,
String modelFile) |
void |
train(String trainFile,
String devFile,
String modelFile,
String embedFile)
Train a new dependency parser model.
|
void |
writeModelFile(String modelFile) |
public static final String DEFAULT_MODEL
public DependencyParser(Properties properties)
public int getWordID(String s)
embeddings
.public int getPosID(String s)
public int getLabelID(String s)
public List<Integer> getFeatures(Configuration c)
public Dataset genTrainExamples(List<CoreMap> sents, List<edu.stanford.nlp.parser.nndep.DependencyTree> trees)
public void writeModelFile(String modelFile)
public static DependencyParser loadFromModelFile(String modelFile)
loadFromModelFile(String, java.util.Properties)
.public static DependencyParser loadFromModelFile(String modelFile, Properties extraProperties)
modelFile
- Path to serialized model (may be GZipped)extraProperties
- Extra test-time properties not already associated with model (may be null)initialize(boolean)
modelpublic void loadModelFile(String modelFile)
modelFile
- The file (classpath resource, etc.) to load the model from.public void train(String trainFile, String devFile, String modelFile, String embedFile)
trainFile
- Training datadevFile
- Development data (used for regular UAS evaluation
of model)modelFile
- String to which model should be savedembedFile
- File containing word embeddings for words used in
training corpuspublic void train(String trainFile, String devFile, String modelFile)
train(String, String, String, String)
public void train(String trainFile, String modelFile)
train(String, String, String, String)
public GrammaticalStructure predict(CoreMap sentence)
IllegalStateException
- If parser has not yet been loaded and initialized
(see initialize(boolean)
public GrammaticalStructure predict(List<? extends HasWord> sentence)
predict(edu.stanford.nlp.util.CoreMap)
. The tokens of the provided sentence must
also have tag annotations (the parser requires part-of-speech tags).predict(edu.stanford.nlp.util.CoreMap)
public double testCoNLL(String testFile, String outFile)
testFile
- File to parse. In CoNLL-X format. Assumed to have gold answers included.outFile
- File to write results to in CoNLL-X format. If null, no output is writtenpublic static void main(String[] args)
You can use this program to train new parsers from treebank data, evaluate on test treebank data, or parse raw text input.
Sample usages:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath -devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
java edu.stanford.nlp.parser.nndep.DependencyParser -model modelOutputFile.txt.gz -textFile rawTextToParse -outFile dependenciesOutputFile.txt
java edu.stanford.nlp.parser.nndep.DependencyParser -model modelOutputFile.txt.gz -textFile - -outFile -
See below for more information on all of these training / test options and more. Input / output options:
Option | Required for training | Required for testing / parsing | Description |
---|---|---|---|
‑devFile | Optional | No | Path to a development-set treebank in CoNLL-X format. If provided, the |
‑embedFile | Optional (highly recommended!)No | A word embedding file, containing distributed representations of English words. Each line of the provided file should contain a single word followed by the elements of the corresponding word embedding (space-delimited). It is not absolutely necessary that all words in the treebank be covered by this embedding file, though the parser's performance will generally improve if you are able to provide better embeddings for more words. | |
‑model | Yes | Yes | Path to a model file. If the path ends in .gz, the model will be read as a Gzipped model file. During training, we write to this path; at test time we read a pre-trained model from this path. |
‑textFile | No | Yes (or testFile) | Path to a plaintext file containing sentences to be parsed. |
‑testFile | No | Yes (or textFile) | Path to a test-set treebank in CoNLL-X format for final evaluation of the parser. |
‑trainFile | Yes | No | Path to a training treebank in CoNLL-X format |
Option | Default | Description |
---|---|---|
‑adaAlpha | 0.01 | Global learning rate for AdaGrad training |
‑adaEps | 1e-6 | Epsilon value added to the denominator of AdaGrad update expression for numerical stability |
‑batchSize | 10000 | Size of mini-batch used for training |
‑clearGradientsPerIter | 0 | Clear AdaGrad gradient histories every n iterations. If zero, no gradient clearing is performed. |
‑dropProb | 0.5 | Dropout probability. For each training example we randomly choose some amount of units to disable in the neural network classifier. This parameter controls the proportion of units "dropped out." |
‑embeddingSize | 50 | Dimensionality of word embeddings provided |
‑evalPerIter | 100 | Run full UAS (unlabeled attachment score) evaluation every time we finish this number of iterations. (Only valid if a development treebank is provided with ‑devFile.) |
‑hiddenSize | 200 | Dimensionality of hidden layer in neural network classifier |
‑initRange | 0.01 | Bounds of range within which weight matrix elements should be initialized. Each element is drawn from a uniform distribution over the range [-initRange, initRange]. |
‑maxIter | 20000 | Number of training iterations to complete before stopping and saving the final model. |
‑numPreComputed | 100000 | The parser pre-computes hidden-layer unit activations for particular inputs words at both training and testing time in order to speed up feedforward computation in the neural network. This parameter determines how many words for which we should compute hidden-layer activations. |
‑regParameter | 1e-8 | Regularization parameter for training |
‑saveIntermediate | true | If true, continually save the model version which gets the highest UAS value on the dev set. (Only valid if a development treebank is provided with ‑devFile.) |
‑trainingThreads | 1 | Number of threads to use during training. Note that depending on training batch size, it may be unwise to simply choose the maximum amount of threads for your machine. On our 16-core test machines: a batch size of 10,000 runs fastest with around 6 threads; a batch size of 100,000 runs best with around 10 threads. |
‑wordCutOff | 1 | The parser can optionally ignore rare words by simply choosing an arbitrary "unknown" feature representation for words that appear with frequency less than n in the corpus. This n is controlled by the wordCutOff parameter. |
Option | Default | Description |
---|---|---|
‑escaper | N/A | If provided, use this word-escaper when parsing raw sentences. (Should be a fully-qualified class name like edu.stanford.nlp.trees.international.arabic.ATBEscaper.) |
‑sentenceDelimiter | N/A | If provided, assume that the given textFile has already been sentence-split, and that sentences are separated by this delimiter. |
‑tagger.model | edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger | Path to a part-of-speech tagger to use to pre-tag the raw sentences before parsing. |