edu.stanford.nlp.parser.lexparser
Class TrainOptions

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.TrainOptions
All Implemented Interfaces:
java.io.Serializable

public class TrainOptions
extends java.lang.Object
implements java.io.Serializable

Non-language-specific options for training a grammar from a treebank. These options are not used at parsing time.

Author:
Dan Klein, Christopher Manning
See Also:
Serialized Form

Nested Class Summary
static class TrainOptions.TransformMatrixType
           
 
Field Summary
 boolean basicCategoryTagsInDependencyGrammar
          Where to use the basic or split tags in the dependency grammar
 boolean cheatPCFG
          Add all test set trees to training data for PCFG.
 boolean collinsPunc
          Promote/delete punctuation like Collins.
 int compactGrammar
          How to compact grammars as FSMs.
 int debugOutputSeconds
          If larger than 0, the parser may choose to output debug information every X seconds
static int DEFAULT_BATCH_SIZE
          When training the DV parsing method, how many trees to use in one batch
static double DEFAULT_DELTA_MARGIN
           
static int DEFAULT_DV_ITERATIONS
          When training the DV parsing method, how many iterations to loop
static int DEFAULT_K_BEST
          When training the DV parsing method, how many of the top K trees to analyze from the underlying parser
static double DEFAULT_LEARNING_RATE
           
static int DEFAULT_QN_ITERATIONS_PER_BATCH
          When training the DV parsing method, how many iterations to loop for one batch of trees
static double DEFAULT_REGCOST
          regularization constant
static double DEFAULT_SCALING_FOR_INIT
           
static java.lang.String DEFAULT_UNK_WORD
           
 java.util.Set<java.lang.String> deleteSplitters
           
 double deltaMargin
          How much to penalize the wrong trees for how different they are from the gold tree when training
 int dvBatchSize
           
 int dvIterations
           
 int dvKBest
           
 long dvSeed
           
 boolean dvSimplifiedModel
          Make the dv model as simple as possible
 double fractionBeforeUnseenCounting
          Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.
 boolean gPA
          This variable controls doing 2 levels of parent annotation.
 int HSEL_CUT
           
 boolean hSelSplit
           
 double learningRate
          How fast to learn (can mean different things for different algorithms)
 boolean leftRec
          Left edge is right-recursive (X << X) Bad.
 boolean leftToRight
           
 boolean lowercaseWordVectors
          Whether or not to lowercase word vectors
 boolean markFinalStates
          Whether or not to mark final states in binarized grammar.
 boolean markovFactor
          Whether to do "horizontal Markovization" (as in ACL 2003 paper).
 int markovOrder
           
 int markUnary
          Mark all unary nodes specially.
 boolean markUnaryTags
          Mark POS tags which are the sole member of their phrasal constituent.
 int maxTrainTimeSeconds
           
 boolean noRebinarization
          When binarizing trees, don't binarize trees with two children.
 boolean noTagSplit
           
 int openClassTypesThreshold
          A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag.
 boolean PA
          This variable controls doing parent annotation of phrasal nodes.
 boolean postGPA
           
 boolean postPA
           
 java.util.Set postSplitters
           
 boolean postSplitWithBaseCategory
          Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.
 boolean predictSplits
          Use the method reported by Berkeley for splitting and recombining states.
 TreeTransformer preTransformer
          A transformer to use on the training data before any other processing step.
 java.io.PrintWriter printAnnotatedPW
           
 boolean printAnnotatedRuleCounts
           
 boolean printAnnotatedStateCounts
           
 java.io.PrintWriter printBinarizedPW
           
 boolean printStates
           
 int printTreeTransformations
          Just for debugging: check that your tree transforms work correctly.
 int qnEstimates
          When training the DV parsing method, how many estimates to keep for the qn approximation.
 int qnIterationsPerBatch
           
 double qnTolerance
          When training the DV parsing method, the tolerance to use if we want to stop qn early
 double regCost
           
 boolean rightRec
          Right edge is right-recursive (X << X) Bad.
 double ruleDiscount
          Discounts the count of BinaryRule's (only, apparently) in training data.
 boolean ruleSmoothing
          Enables linear rule smoothing during grammar extraction but before grammar compaction.
 double ruleSmoothingAlpha
           
 double scalingForInit
          How much to scale certain parameters when initializing models.
 boolean selectivePostSplit
           
 double selectivePostSplitCutOff
           
 boolean selectiveSplit
          Only split the "common high KL divergence" parent categories....
 double selectiveSplitCutOff
           
 boolean simpleBinarizedLabels
          When binarizing trees, don't annotate the labels with anything
 boolean sisterAnnotate
          Selective Sister annotation.
 java.util.Set<java.lang.String> sisterSplitters
           
 boolean smoothing
          TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser.
 int splitCount
          If we are predicting splits, we loop this many times
 boolean splitPrePreT
          Mark all pre-preterminals (also does splitBaseNP: don't need both)
 double splitRecombineRate
          If we are predicting splits, we recombine states at this rate every loop
 java.util.Set<java.lang.String> splitters
          Set the splitter strings.
 java.lang.String taggedFiles
          A set of files to use as extra information in the lexicon.
 boolean tagPA
          Parent annotation on tags.
 boolean tagSelectivePostSplit
           
 double tagSelectivePostSplitCutOff
           
 boolean tagSelectiveSplit
          Do parent annotation on tags selectively.
 double tagSelectiveSplitCutOff
           
 int trainingThreads
          If the training algorithm allows for parallelization, how many threads to use
 int trainLengthLimit
           
 java.lang.String trainTreeFile
           
 TrainOptions.TransformMatrixType transformMatrixType
           
 boolean unknownCapsVector
          Whether or not to build an unknown word vector for words with caps in them
 boolean unknownChineseNumberVector
          Whether or not to build an unknown word vector to match Chinese numbers
 boolean unknownChinesePercentVector
          Whether or not to build an unknown word vector to match Chinese percentages
 boolean unknownChineseYearVector
          Whether or not to build an unknown word vector to match Chinese years
 boolean unknownDashedWordVectors
          Whether or not to handle unknown dashed words by taking the last part
 boolean unknownNumberVector
          Whether or not to build an unknown word vector specifically for numbers
 java.lang.String unkWord
          Some models will use external data sources which contain information about unknown words.
 boolean useContextWords
           
 
Constructor Summary
TrainOptions()
           
 
Method Summary
 int compactGrammar()
           
 void display()
           
 boolean outsideFactor()
          If true, declare early -- leave this on except maybe with markov on.
static void printTrainTree(java.io.PrintWriter pw, java.lang.String message, Tree t)
           
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

trainTreeFile

public java.lang.String trainTreeFile

trainLengthLimit

public int trainLengthLimit

cheatPCFG

public boolean cheatPCFG
Add all test set trees to training data for PCFG. (Currently only supported in FactoredParser main.)


markovFactor

public boolean markovFactor
Whether to do "horizontal Markovization" (as in ACL 2003 paper). False means regular PCFG expansions.


markovOrder

public int markovOrder

hSelSplit

public boolean hSelSplit

HSEL_CUT

public int HSEL_CUT

markFinalStates

public boolean markFinalStates
Whether or not to mark final states in binarized grammar. This must be off to get most value out of grammar compaction.


openClassTypesThreshold

public int openClassTypesThreshold
A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag. Unknown words will only possibly be tagged as open-class tags (unless flexiTag is on). If flexiTag is on, unknown words will be able to be tagged any POS for which the unseenMap has nonzero count (that is, the tag was seen for a new word after unseen signature counting was started).


fractionBeforeUnseenCounting

public double fractionBeforeUnseenCounting
Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.


PA

public boolean PA
This variable controls doing parent annotation of phrasal nodes. Good.


gPA

public boolean gPA
This variable controls doing 2 levels of parent annotation. Bad.


postPA

public boolean postPA

postGPA

public boolean postGPA

selectiveSplit

public boolean selectiveSplit
Only split the "common high KL divergence" parent categories.... Good.


selectiveSplitCutOff

public double selectiveSplitCutOff

selectivePostSplit

public boolean selectivePostSplit

selectivePostSplitCutOff

public double selectivePostSplitCutOff

postSplitWithBaseCategory

public boolean postSplitWithBaseCategory
Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.


sisterAnnotate

public boolean sisterAnnotate
Selective Sister annotation.


sisterSplitters

public java.util.Set<java.lang.String> sisterSplitters

markUnary

public int markUnary
Mark all unary nodes specially. Good for just PCFG. Bad for factored. markUnary affects phrasal nodes. A value of 0 means to do nothing; a value of 1 means to mark the parent (higher) node of a unary rewrite. A value of 2 means to mark the child (lower) node of a unary rewrie. Values of 1 and 2 only apply if the child (lower) node is phrasal. (A value of 1 is better than 2 in combos.) A value of 1 corresponds to the old boolean -unary flag.


markUnaryTags

public boolean markUnaryTags
Mark POS tags which are the sole member of their phrasal constituent. This is like markUnary=2, applied to POS tags.


splitPrePreT

public boolean splitPrePreT
Mark all pre-preterminals (also does splitBaseNP: don't need both)


tagPA

public boolean tagPA
Parent annotation on tags. Good (for PCFG?)


tagSelectiveSplit

public boolean tagSelectiveSplit
Do parent annotation on tags selectively. Neutral, but less splits.


tagSelectiveSplitCutOff

public double tagSelectiveSplitCutOff

tagSelectivePostSplit

public boolean tagSelectivePostSplit

tagSelectivePostSplitCutOff

public double tagSelectivePostSplitCutOff

rightRec

public boolean rightRec
Right edge is right-recursive (X << X) Bad. (NP only is good)


leftRec

public boolean leftRec
Left edge is right-recursive (X << X) Bad.


collinsPunc

public boolean collinsPunc
Promote/delete punctuation like Collins. Bad (!)


splitters

public java.util.Set<java.lang.String> splitters
Set the splitter strings. These are a set of parent and/or grandparent annotated categories which should be split off.


postSplitters

public java.util.Set postSplitters

deleteSplitters

public java.util.Set<java.lang.String> deleteSplitters

printTreeTransformations

public int printTreeTransformations
Just for debugging: check that your tree transforms work correctly. This will print the transformations of the first printTreeTransformations trees.


printAnnotatedPW

public java.io.PrintWriter printAnnotatedPW

printBinarizedPW

public java.io.PrintWriter printBinarizedPW

printStates

public boolean printStates

compactGrammar

public int compactGrammar
How to compact grammars as FSMs. 0 = no compaction [uses makeSyntheticLabel1], 1 = no compaction but use label names that wrap from right to left in binarization [uses makeSyntheticLabel2], 2 = wrapping labels and materialize unary at top rewriting passive to active, 3 = ExactGrammarCompactor, 4 = LossyGrammarCompactor, 5 = CategoryMergingGrammarCompactor. (May 2007 CDM note: options 4 and 5 don't seem to be functioning sensibly. 0, 1, and 3 seem to be the 'good' options. 2 is only useful as input to 3. There seems to be no reason not to use 0, despite the default.)


leftToRight

public boolean leftToRight

noTagSplit

public boolean noTagSplit

ruleSmoothing

public boolean ruleSmoothing
Enables linear rule smoothing during grammar extraction but before grammar compaction. The alpha term is the same as that described in Petrov et al. (2006), and has range [0,1].


ruleSmoothingAlpha

public double ruleSmoothingAlpha

smoothing

public boolean smoothing
TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser. It should be removed.


ruleDiscount

public double ruleDiscount
Discounts the count of BinaryRule's (only, apparently) in training data.


printAnnotatedRuleCounts

public boolean printAnnotatedRuleCounts

printAnnotatedStateCounts

public boolean printAnnotatedStateCounts

basicCategoryTagsInDependencyGrammar

public boolean basicCategoryTagsInDependencyGrammar
Where to use the basic or split tags in the dependency grammar


preTransformer

public TreeTransformer preTransformer
A transformer to use on the training data before any other processing step. This is specified by using the -preTransformer flag when training the parser. A comma separated list of classes will be turned into a CompositeTransformer. This can be used to strip subcategories, to run a tsurgeon pattern, or any number of other useful operations.


taggedFiles

public java.lang.String taggedFiles
A set of files to use as extra information in the lexicon. This can provide tagged words which are not part of trees


predictSplits

public boolean predictSplits
Use the method reported by Berkeley for splitting and recombining states. This is an experimental and still in development reimplementation of that work.


splitCount

public int splitCount
If we are predicting splits, we loop this many times


splitRecombineRate

public double splitRecombineRate
If we are predicting splits, we recombine states at this rate every loop


simpleBinarizedLabels

public boolean simpleBinarizedLabels
When binarizing trees, don't annotate the labels with anything


noRebinarization

public boolean noRebinarization
When binarizing trees, don't binarize trees with two children. Only applies when using inside markov binarization for now.


trainingThreads

public int trainingThreads
If the training algorithm allows for parallelization, how many threads to use


DEFAULT_K_BEST

public static final int DEFAULT_K_BEST
When training the DV parsing method, how many of the top K trees to analyze from the underlying parser

See Also:
Constant Field Values

dvKBest

public int dvKBest

DEFAULT_DV_ITERATIONS

public static final int DEFAULT_DV_ITERATIONS
When training the DV parsing method, how many iterations to loop

See Also:
Constant Field Values

dvIterations

public int dvIterations

DEFAULT_BATCH_SIZE

public static final int DEFAULT_BATCH_SIZE
When training the DV parsing method, how many trees to use in one batch

See Also:
Constant Field Values

dvBatchSize

public int dvBatchSize

DEFAULT_REGCOST

public static final double DEFAULT_REGCOST
regularization constant

See Also:
Constant Field Values

regCost

public double regCost

DEFAULT_QN_ITERATIONS_PER_BATCH

public static final int DEFAULT_QN_ITERATIONS_PER_BATCH
When training the DV parsing method, how many iterations to loop for one batch of trees

See Also:
Constant Field Values

qnIterationsPerBatch

public int qnIterationsPerBatch

qnEstimates

public int qnEstimates
When training the DV parsing method, how many estimates to keep for the qn approximation.


qnTolerance

public double qnTolerance
When training the DV parsing method, the tolerance to use if we want to stop qn early


debugOutputSeconds

public int debugOutputSeconds
If larger than 0, the parser may choose to output debug information every X seconds


dvSeed

public long dvSeed

DEFAULT_LEARNING_RATE

public static final double DEFAULT_LEARNING_RATE
See Also:
Constant Field Values

learningRate

public double learningRate
How fast to learn (can mean different things for different algorithms)


DEFAULT_DELTA_MARGIN

public static final double DEFAULT_DELTA_MARGIN
See Also:
Constant Field Values

deltaMargin

public double deltaMargin
How much to penalize the wrong trees for how different they are from the gold tree when training


unknownNumberVector

public boolean unknownNumberVector
Whether or not to build an unknown word vector specifically for numbers


unknownDashedWordVectors

public boolean unknownDashedWordVectors
Whether or not to handle unknown dashed words by taking the last part


unknownCapsVector

public boolean unknownCapsVector
Whether or not to build an unknown word vector for words with caps in them


dvSimplifiedModel

public boolean dvSimplifiedModel
Make the dv model as simple as possible


unknownChineseYearVector

public boolean unknownChineseYearVector
Whether or not to build an unknown word vector to match Chinese years


unknownChineseNumberVector

public boolean unknownChineseNumberVector
Whether or not to build an unknown word vector to match Chinese numbers


unknownChinesePercentVector

public boolean unknownChinesePercentVector
Whether or not to build an unknown word vector to match Chinese percentages


DEFAULT_SCALING_FOR_INIT

public static final double DEFAULT_SCALING_FOR_INIT
See Also:
Constant Field Values

scalingForInit

public double scalingForInit
How much to scale certain parameters when initializing models. For example, the DVParser uses this to rescale its initial matrices.


maxTrainTimeSeconds

public int maxTrainTimeSeconds

DEFAULT_UNK_WORD

public static final java.lang.String DEFAULT_UNK_WORD
See Also:
Constant Field Values

unkWord

public java.lang.String unkWord
Some models will use external data sources which contain information about unknown words. This variable is a way to provide the name of the unknown word in the external data source.


lowercaseWordVectors

public boolean lowercaseWordVectors
Whether or not to lowercase word vectors


transformMatrixType

public TrainOptions.TransformMatrixType transformMatrixType

useContextWords

public boolean useContextWords
Constructor Detail

TrainOptions

public TrainOptions()
Method Detail

outsideFactor

public boolean outsideFactor()
If true, declare early -- leave this on except maybe with markov on.

Returns:
Whether to do outside factorization in binarization of the grammar

compactGrammar

public int compactGrammar()

display

public void display()

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

printTrainTree

public static void printTrainTree(java.io.PrintWriter pw,
                                  java.lang.String message,
                                  Tree t)


Stanford NLP Group