edu.stanford.nlp.parser.lexparser
Class TrainOptions

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.TrainOptions
All Implemented Interfaces:
Serializable

public class TrainOptions
extends Object
implements Serializable

Non-language-specific options for training a grammar from a treebank. These options are not used at parsing time. But they are all static so it isn't possible to train multiple parsers in multiple threads at present with different options, until this is changed.

Author:
Dan Klein, Christopher Manning
See Also:
Serialized Form

Field Summary
 boolean basicCategoryTagsInDependencyGrammar
          Where to use the basic or split tags in the dependency grammar
 boolean cheatPCFG
          Add all test set trees to training data for PCFG.
 boolean collinsPunc
          Promote/delete punctuation like Collins.
 int compactGrammar
          How to compact grammars as FSMs.
 Set<String> deleteSplitters
           
 double fractionBeforeUnseenCounting
          Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.
 boolean gPA
          This variable controls doing 2 levels of parent annotation.
 int HSEL_CUT
           
 boolean hSelSplit
           
 boolean leftRec
          Left edge is right-recursive (X << X) Bad.
 boolean leftToRight
           
 boolean markFinalStates
          Whether or not to mark final states in binarized grammar.
 boolean markovFactor
          Whether to do "horizontal Markovization" (as in ACL 2003 paper).
 int markovOrder
           
 int markUnary
          Mark all unary nodes specially.
 boolean markUnaryTags
          Mark POS tags which are the sole member of their phrasal constituent.
 boolean noTagSplit
           
 int openClassTypesThreshold
          A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag.
 boolean PA
          This variable controls doing parent annotation of phrasal nodes.
 boolean postGPA
           
 boolean postPA
           
 Set postSplitters
           
 boolean postSplitWithBaseCategory
          Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.
 boolean predictSplits
          Use the method reported by Berkeley for splitting and recombining states.
 TreeTransformer preTransformer
          A transformer to use on the training data before any other processing step.
 PrintWriter printAnnotatedPW
           
 boolean printAnnotatedRuleCounts
           
 boolean printAnnotatedStateCounts
           
 PrintWriter printBinarizedPW
           
 boolean printStates
           
 int printTreeTransformations
          Just for debugging: check that your tree transforms work correctly.
 boolean rightRec
          Right edge is right-recursive (X << X) Bad.
 double ruleDiscount
          Discounts the count of BinaryRule's (only, apparently) in training data.
 boolean ruleSmoothing
          Enables linear rule smoothing during grammar extraction but before grammar compaction.
 double ruleSmoothingAlpha
           
 boolean selectivePostSplit
           
 double selectivePostSplitCutOff
           
 boolean selectiveSplit
          Only split the "common high KL divergence" parent categories....
 double selectiveSplitCutOff
           
 boolean sisterAnnotate
          Selective Sister annotation.
 Set<String> sisterSplitters
           
 boolean smoothing
          TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser.
 int splitCount
          If we are predicting splits, we loop this many times
 boolean splitPrePreT
          Mark all pre-preterminals (also does splitBaseNP: don't need both)
 double splitRecombineRate
          If we are predicting splits, we recombine states at this rate every loop
 Set<String> splitters
          Set the splitter strings.
 String taggedFiles
          A set of files to use as extra information in the lexicon.
 boolean tagPA
          Parent annotation on tags.
 boolean tagSelectivePostSplit
           
 double tagSelectivePostSplitCutOff
           
 boolean tagSelectiveSplit
          Do parent annotation on tags selectively.
 double tagSelectiveSplitCutOff
           
 int trainLengthLimit
           
 String trainTreeFile
           
 
Constructor Summary
TrainOptions()
           
 
Method Summary
 int compactGrammar()
           
 void display()
           
 boolean outsideFactor()
          If true, declare early -- leave this on except maybe with markov on.
static void printTrainTree(PrintWriter pw, String message, Tree t)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

trainTreeFile

public String trainTreeFile

trainLengthLimit

public int trainLengthLimit

cheatPCFG

public boolean cheatPCFG
Add all test set trees to training data for PCFG. (Currently only supported in FactoredParser main.)


markovFactor

public boolean markovFactor
Whether to do "horizontal Markovization" (as in ACL 2003 paper). False means regular PCFG expansions.


markovOrder

public int markovOrder

hSelSplit

public boolean hSelSplit

HSEL_CUT

public int HSEL_CUT

markFinalStates

public boolean markFinalStates
Whether or not to mark final states in binarized grammar. This must be off to get most value out of grammar compaction.


openClassTypesThreshold

public int openClassTypesThreshold
A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag. Unknown words will only possibly be tagged as open-class tags (unless flexiTag is on). If flexiTag is on, unknown words will be able to be tagged any POS for which the unseenMap has nonzero count (that is, the tag was seen for a new word after unseen signature counting was started).


fractionBeforeUnseenCounting

public double fractionBeforeUnseenCounting
Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.


PA

public boolean PA
This variable controls doing parent annotation of phrasal nodes. Good.


gPA

public boolean gPA
This variable controls doing 2 levels of parent annotation. Bad.


postPA

public boolean postPA

postGPA

public boolean postGPA

selectiveSplit

public boolean selectiveSplit
Only split the "common high KL divergence" parent categories.... Good.


selectiveSplitCutOff

public double selectiveSplitCutOff

selectivePostSplit

public boolean selectivePostSplit

selectivePostSplitCutOff

public double selectivePostSplitCutOff

postSplitWithBaseCategory

public boolean postSplitWithBaseCategory
Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.


sisterAnnotate

public boolean sisterAnnotate
Selective Sister annotation.


sisterSplitters

public Set<String> sisterSplitters

markUnary

public int markUnary
Mark all unary nodes specially. Good for just PCFG. Bad for factored. markUnary affects phrasal nodes. A value of 0 means to do nothing; a value of 1 means to mark the parent (higher) node of a unary rewrite. A value of 2 means to mark the child (lower) node of a unary rewrie. Values of 1 and 2 only apply if the child (lower) node is phrasal. (A value of 1 is better than 2 in combos.) A value of 1 corresponds to the old boolean -unary flag.


markUnaryTags

public boolean markUnaryTags
Mark POS tags which are the sole member of their phrasal constituent. This is like markUnary=2, applied to POS tags.


splitPrePreT

public boolean splitPrePreT
Mark all pre-preterminals (also does splitBaseNP: don't need both)


tagPA

public boolean tagPA
Parent annotation on tags. Good (for PCFG?)


tagSelectiveSplit

public boolean tagSelectiveSplit
Do parent annotation on tags selectively. Neutral, but less splits.


tagSelectiveSplitCutOff

public double tagSelectiveSplitCutOff

tagSelectivePostSplit

public boolean tagSelectivePostSplit

tagSelectivePostSplitCutOff

public double tagSelectivePostSplitCutOff

rightRec

public boolean rightRec
Right edge is right-recursive (X << X) Bad. (NP only is good)


leftRec

public boolean leftRec
Left edge is right-recursive (X << X) Bad.


collinsPunc

public boolean collinsPunc
Promote/delete punctuation like Collins. Bad (!)


splitters

public Set<String> splitters
Set the splitter strings. These are a set of parent and/or grandparent annotated categories which should be split off.


postSplitters

public Set postSplitters

deleteSplitters

public Set<String> deleteSplitters

printTreeTransformations

public int printTreeTransformations
Just for debugging: check that your tree transforms work correctly. This will print the transformations of the first printTreeTransformations trees.


printAnnotatedPW

public PrintWriter printAnnotatedPW

printBinarizedPW

public PrintWriter printBinarizedPW

printStates

public boolean printStates

compactGrammar

public int compactGrammar
How to compact grammars as FSMs. 0 = no compaction [uses makeSyntheticLabel1], 1 = no compaction but use label names that wrap from right to left in binarization [uses makeSyntheticLabel2], 2 = wrapping labels and materialize unary at top rewriting passive to active, 3 = ExactGrammarCompactor, 4 = LossyGrammarCompactor, 5 = CategoryMergingGrammarCompactor. (May 2007 CDM note: options 4 and 5 don't seem to be functioning sensibly. 0, 1, and 3 seem to be the 'good' options. 2 is only useful as input to 3. There seems to be no reason not to use 0, despite the default.)


leftToRight

public boolean leftToRight

noTagSplit

public boolean noTagSplit

ruleSmoothing

public boolean ruleSmoothing
Enables linear rule smoothing during grammar extraction but before grammar compaction. The alpha term is the same as that described in Petrov et al. (2006), and has range [0,1].


ruleSmoothingAlpha

public double ruleSmoothingAlpha

smoothing

public boolean smoothing
TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser. It should be removed.


ruleDiscount

public double ruleDiscount
Discounts the count of BinaryRule's (only, apparently) in training data.


printAnnotatedRuleCounts

public boolean printAnnotatedRuleCounts

printAnnotatedStateCounts

public boolean printAnnotatedStateCounts

basicCategoryTagsInDependencyGrammar

public boolean basicCategoryTagsInDependencyGrammar
Where to use the basic or split tags in the dependency grammar


preTransformer

public TreeTransformer preTransformer
A transformer to use on the training data before any other processing step. This is specified by using the -preTransformer flag when training the parser. A comma separated list of classes will be turned into a CompositeTransformer. This can be used to strip subcategories, to run a tsurgeon pattern, or any number of other useful operations.


taggedFiles

public String taggedFiles
A set of files to use as extra information in the lexicon. This can provide tagged words which are not part of trees


predictSplits

public boolean predictSplits
Use the method reported by Berkeley for splitting and recombining states. This is an experimental and still in development reimplementation of that work.


splitCount

public int splitCount
If we are predicting splits, we loop this many times


splitRecombineRate

public double splitRecombineRate
If we are predicting splits, we recombine states at this rate every loop

Constructor Detail

TrainOptions

public TrainOptions()
Method Detail

outsideFactor

public boolean outsideFactor()
If true, declare early -- leave this on except maybe with markov on.

Returns:
Whether to do outside factorization in binarization of the grammar

compactGrammar

public int compactGrammar()

display

public void display()

printTrainTree

public static void printTrainTree(PrintWriter pw,
                                  String message,
                                  Tree t)


Stanford NLP Group