edu.stanford.nlp.parser.lexparser
Class Train

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.Train

public class Train
extends java.lang.Object

Non-language-specific options for training a grammar from a treebank. These options are not used at parsing time. But they are all static so it isn't possible to train multiple parsers in multiple threads at present with different options, until this is changed.

Author:
Dan Klein, Christopher Manning

Field Summary
static boolean basicCategoryTagsInDependencyGrammar
          Where to use the basic or split tags in the dependency grammar
static boolean cheatPCFG
          Add all test set trees to training data for PCFG.
static boolean collinsPunc
          Promote/delete punctuation like Collins.
static int compactGrammar
          How to compact grammars as FSMs.
static java.util.Set<java.lang.String> deleteSplitters
           
static double fractionBeforeUnseenCounting
          Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.
static boolean gPA
          This variable controls doing 2 levels of parent annotation.
static int HSEL_CUT
           
static boolean hSelSplit
           
static int leaveItAll
          if true, leave all PTB (functional tag) annotations (bad)
static boolean leftRec
          Left edge is right-recursive (X << X) Bad.
static boolean leftToRight
           
static boolean markFinalStates
          Whether or not to mark final states in binarized grammar.
static boolean markovFactor
           
static int markovOrder
           
static int markUnary
          Mark all unary nodes specially.
static boolean markUnaryTags
          Mark POS tags which are the sole member of their phrasal constituent.
static boolean noTagSplit
           
static int openClassTypesThreshold
          A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag.
static boolean PA
          This variable controls doing parent annotation of phrasal nodes.
static boolean postGPA
           
static boolean postPA
           
static java.util.Set postSplitters
           
static boolean postSplitWithBaseCategory
          Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.
static java.io.PrintWriter printAnnotatedPW
           
static boolean printAnnotatedRuleCounts
           
static boolean printAnnotatedStateCounts
           
static java.io.PrintWriter printBinarizedPW
           
static boolean printStates
           
static int printTreeTransformations
          Just for debugging: check that your tree transforms work correctly.
static boolean rightRec
          Right edge is right-recursive (X << X) Bad.
static double ruleDiscount
          Discounts the count of BinaryRule's (only, apparently) in training data.
static boolean ruleSmoothing
          Enables linear rule smoothing during grammar extraction but before grammar compaction.
static double ruleSmoothingAlpha
           
static boolean selectivePostSplit
           
static double selectivePostSplitCutOff
           
static boolean selectiveSplit
          Only split the "common high KL divergence" parent categories....
static double selectiveSplitCutOff
           
static boolean sisterAnnotate
          Selective Sister annotation.
static java.util.Set<java.lang.String> sisterSplitters
           
static boolean smoothing
          TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser.
static boolean splitPrePreT
          Mark all pre-preterminals (also does splitBaseNP: don't need both)
static java.util.Set<java.lang.String> splitters
          Set the splitter strings.
static boolean tagPA
          Parent annotation on tags.
static boolean tagSelectivePostSplit
           
static double tagSelectivePostSplitCutOff
           
static boolean tagSelectiveSplit
          Do parent annotation on tags selectively.
static double tagSelectiveSplitCutOff
           
static java.lang.String trainTreeFile
           
 
Method Summary
static int compactGrammar()
           
static void display()
           
static boolean outsideFactor()
          If true, declare early -- leave this on except maybe with markov on.
static void printTrainTree(java.io.PrintWriter pw, java.lang.String message, Tree t)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

trainTreeFile

public static java.lang.String trainTreeFile

leaveItAll

public static int leaveItAll
if true, leave all PTB (functional tag) annotations (bad)


cheatPCFG

public static boolean cheatPCFG
Add all test set trees to training data for PCFG. (Currently only supported in FactoredParser main.)


markovFactor

public static boolean markovFactor

markovOrder

public static int markovOrder

hSelSplit

public static boolean hSelSplit

HSEL_CUT

public static int HSEL_CUT

markFinalStates

public static boolean markFinalStates
Whether or not to mark final states in binarized grammar. This must be off to get most value out of grammar compaction.


openClassTypesThreshold

public static int openClassTypesThreshold
A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag. Unknown words will only possibly be tagged as open-class tags (unless flexiTag is on). If flexiTag is on, unknown words will be able to be tagged any POS for which the unseenMap has nonzero count (that is, the tag was seen for a new word after unseen signature counting was started).


fractionBeforeUnseenCounting

public static double fractionBeforeUnseenCounting
Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.


PA

public static boolean PA
This variable controls doing parent annotation of phrasal nodes. Good.


gPA

public static boolean gPA
This variable controls doing 2 levels of parent annotation. Bad.


postPA

public static boolean postPA

postGPA

public static boolean postGPA

selectiveSplit

public static boolean selectiveSplit
Only split the "common high KL divergence" parent categories.... Good.


selectiveSplitCutOff

public static double selectiveSplitCutOff

selectivePostSplit

public static boolean selectivePostSplit

selectivePostSplitCutOff

public static double selectivePostSplitCutOff

postSplitWithBaseCategory

public static boolean postSplitWithBaseCategory
Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.


sisterAnnotate

public static boolean sisterAnnotate
Selective Sister annotation.


sisterSplitters

public static java.util.Set<java.lang.String> sisterSplitters

markUnary

public static int markUnary
Mark all unary nodes specially. Good for just PCFG. Bad for factored. markUnary affects phrasal nodes. A value of 0 means to do nothing; a value of 1 means to mark the parent (higher) node of a unary rewrite. A value of 2 means to mark the child (lower) node of a unary rewrie. Values of 1 and 2 only apply if the child (lower) node is phrasal. (A value of 1 is better than 2 in combos.) A value of 1 corresponds to the old boolean -unary flag.


markUnaryTags

public static boolean markUnaryTags
Mark POS tags which are the sole member of their phrasal constituent. This is like markUnary=2, applied to POS tags.


splitPrePreT

public static boolean splitPrePreT
Mark all pre-preterminals (also does splitBaseNP: don't need both)


tagPA

public static boolean tagPA
Parent annotation on tags. Good (for PCFG?)


tagSelectiveSplit

public static boolean tagSelectiveSplit
Do parent annotation on tags selectively. Neutral, but less splits.


tagSelectiveSplitCutOff

public static double tagSelectiveSplitCutOff

tagSelectivePostSplit

public static boolean tagSelectivePostSplit

tagSelectivePostSplitCutOff

public static double tagSelectivePostSplitCutOff

rightRec

public static boolean rightRec
Right edge is right-recursive (X << X) Bad. (NP only is good)


leftRec

public static boolean leftRec
Left edge is right-recursive (X << X) Bad.


collinsPunc

public static boolean collinsPunc
Promote/delete punctuation like Collins. Bad (!)


splitters

public static java.util.Set<java.lang.String> splitters
Set the splitter strings. These are a set of parent and/or grandparent annotated categories which should be split off.


postSplitters

public static java.util.Set postSplitters

deleteSplitters

public static java.util.Set<java.lang.String> deleteSplitters

printTreeTransformations

public static int printTreeTransformations
Just for debugging: check that your tree transforms work correctly. This will print the transformations of the first printTreeTransformations trees.


printAnnotatedPW

public static java.io.PrintWriter printAnnotatedPW

printBinarizedPW

public static java.io.PrintWriter printBinarizedPW

printStates

public static boolean printStates

compactGrammar

public static int compactGrammar
How to compact grammars as FSMs. 0 = no compaction [uses makeSyntheticLabel1], 1 = no compaction but use label names that wrap from right to left in binarization [uses makeSyntheticLabel2], 2 = wrapping labels and materialize unary at top rewriting passive to active, 3 = ExactGrammarCompactor, 4 = LossyGrammarCompactor, 5 = CategoryMergingGrammarCompactor. (May 2007 CDM note: options 4 and 5 don't seem to be functioning sensibly. 0, 1, and 3 seem to be the 'good' options. 2 is only useful as input to 3. There seems to be no reason not to use 0, despite the default.)


leftToRight

public static boolean leftToRight

noTagSplit

public static boolean noTagSplit

ruleSmoothing

public static boolean ruleSmoothing
Enables linear rule smoothing during grammar extraction but before grammar compaction. The alpha term is the same as that described in Petrov et al. (2006), and has range [0,1].


ruleSmoothingAlpha

public static double ruleSmoothingAlpha

smoothing

public static boolean smoothing
TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser. It should be removed.


ruleDiscount

public static double ruleDiscount
Discounts the count of BinaryRule's (only, apparently) in training data.


printAnnotatedRuleCounts

public static boolean printAnnotatedRuleCounts

printAnnotatedStateCounts

public static boolean printAnnotatedStateCounts

basicCategoryTagsInDependencyGrammar

public static boolean basicCategoryTagsInDependencyGrammar
Where to use the basic or split tags in the dependency grammar

Method Detail

outsideFactor

public static boolean outsideFactor()
If true, declare early -- leave this on except maybe with markov on.

Returns:
Whether to do outside factorization in binarization of the grammar

compactGrammar

public static int compactGrammar()

display

public static void display()

printTrainTree

public static void printTrainTree(java.io.PrintWriter pw,
                                  java.lang.String message,
                                  Tree t)


Stanford NLP Group