edu.stanford.nlp.parser.lexparser
Class ChineseTreebankParserParams

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
      extended by edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
All Implemented Interfaces:
TreebankLangParserParams, TreebankFactory, Serializable

public class ChineseTreebankParserParams
extends AbstractTreebankParserParams

Parameter file for parsing the Penn Chinese Treebank. Includes category enrichments specific to the Penn Chinese Treebank.

Author:
Roger Levy, Christopher Manning, Galen Andrew
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
AbstractTreebankParserParams.AnnotatePunctuationFunction, AbstractTreebankParserParams.RemoveGFSubcategoryStripper, AbstractTreebankParserParams.SubcategoryStripper
 
Field Summary
 boolean bikelHeadFinder
           
 boolean charTags
           
 boolean chineseSelectiveTagPA
           
 boolean chineseSplitDouHao
          Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation.
 boolean chineseSplitPunct
          Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao.
 boolean chineseSplitPunctLR
          Chinese: split left right/paren quote (if chineseSplitPunct is also true.
 int chineseSplitVP
          Chinese VP splitting.
 boolean chineseVerySelectiveTagPA
           
static boolean DEFAULT_USE_GOOD_TURNING_UNKNOWN_WORD_MODEL
          Parameters specific for creating a ChineseLexicon
 boolean discardFrags
           
 boolean dominatesV
          Verbal distance -- mark whether symbol dominates a verb (V*).
 boolean gpaAD
          Grandparent annotate all AD.
 double lengthPenalty
          Parameters for a ChineseCharacterBasedLexicon
 boolean markADgrandchildOfIP
          Chinese: mark ADs that are grandchild of IP.
 boolean markCC
          Mark phrases which are conjunctions.
 boolean markIPadjsubj
           
 boolean markIPconj
          Chinese: mark IPs that are conjuncts.
 boolean markIPsisDEC
          Chinese: mark IPs that are part of prenominal modifiers.
 boolean markIPsisterBA
          Chinese: mark IPs that are sister of BA.
 boolean markIPsisterVVorP
          Chinese: mark IP's that are sister of VV or P.
 boolean markModifiedNP
          Chinese: mark left-modified NPs (rightmost NPs with a left-side mod).
 boolean markMultiNtag
          Chinese: mark nominal tags that are part of multi-nominal rewrites.
 boolean markNPconj
          Chinese: mark NPs that are conjuncts.
 boolean markNPmodNP
          Chinese: mark NP modifiers of NPs.
 boolean markPostverbalP
          Chinese: mark P with a left aunt VV, and PP with a left sister VV.
 boolean markPostverbalPP
           
 boolean markPsisterIP
          Chinese: mark P's that are sister of IP.
 boolean markVPadjunct
          Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution).
 boolean markVVsisterIP
          Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs).
 boolean mergeNNVV
          Chinese: merge NN and VV.
 boolean paRootDtr
          Chinese: parent annotate daughter of root.
 int penaltyType
          penaltyType should be set as follows: 0: no length penalty 1: quadratic length penalty 2: penalty for continuation chars only TODO: make this an enum
 boolean segment
           
 String segmenterClass
           
 boolean segmentMarkov
           
 boolean splitBaseNP
          Mark base NPs.
 boolean splitNPTMP
          Whether to retain the -TMP functional tag on various phrasal categories.
 boolean splitPPTMP
           
 boolean splitXPTMP
           
 boolean sunJurafskyHeadFinder
           
 boolean tagWordSize
          Annotate tags for number of characters contained.
 boolean unaryCP
           
 boolean unaryIP
          Chinese: unary category marking
 boolean useCharacterBasedLexicon
           
 boolean useCharBasedUnknownWordModel
           
 boolean useGoodTuringUnknownWordModel
           
 boolean useMaxentDepGrammar
           
 boolean useMaxentLexicon
           
 boolean useSimilarWordMap
           
 boolean useUnknownCharacterModel
           
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
evalGF, inputEncoding, outputEncoding, tlp
 
Constructor Summary
ChineseTreebankParserParams()
           
 
Method Summary
 TreeTransformer collinizer()
          Returns a ChineseCollinizer
 TreeTransformer collinizerEvalb()
          Returns a ChineseCollinizer that doesn't delete punctuation
 ArrayList<Word> defaultTestSentence()
          Return a default sentence for the language (for testing)
 Extractor<DependencyGrammar> dependencyGrammarExtractor(Options op, Index<String> wordIndex, Index<String> tagIndex)
           
 DiskTreebank diskTreebank()
          Uses a DiskTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer.
 void display()
          display language-specific settings
 HeadFinder headFinder()
          Returns a ChineseHeadFinder
 Lexicon lex(Options op, Index<String> wordIndex, Index<String> tagIndex)
          Returns a ChineseLexicon
static void main(String[] args)
          For testing: loads a treebank and prints the trees.
 MemoryTreebank memoryTreebank()
          Uses a MemoryTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer
 double[] MLEDependencyGrammarSmoothingParams()
          Give the parameters for smoothing in the MLEDependencyGrammar.
 int setOptionFlag(String[] args, int i)
          Set language-specific options according to flags.
 String[] sisterSplitters()
          Returns the splitting strings used for selective splits.
 Tree transformTree(Tree t, Tree root)
          transformTree does all language-specific tree transformations.
 TreeReaderFactory treeReaderFactory()
          Returns a factory for reading in trees from the source you want.
 HeadFinder typedDependencyHeadFinder()
          The HeadFinder to use when extracting typed dependencies.
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
dependencyObjectify, getInputEncoding, getOutputEncoding, isEvalGF, lex, parsevalObjectify, parsevalObjectify, ppAttachmentEval, processHeadWord, pw, pw, setEvalGF, setEvaluateGrammaticalFunctions, setInputEncoding, setOutputEncoding, setupForEval, subcategoryStripper, testMemoryTreebank, treebank, treebankLanguagePack, treeTokenizerFactory, typedDependencyClasser, typedDependencyObjectify, unorderedTypedDependencyObjectify, unorderedUntypedDependencyObjectify, untypedDependencyObjectify
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

charTags

public boolean charTags

useCharacterBasedLexicon

public boolean useCharacterBasedLexicon

useMaxentLexicon

public boolean useMaxentLexicon

useMaxentDepGrammar

public boolean useMaxentDepGrammar

segment

public boolean segment

segmentMarkov

public boolean segmentMarkov

sunJurafskyHeadFinder

public boolean sunJurafskyHeadFinder

bikelHeadFinder

public boolean bikelHeadFinder

discardFrags

public boolean discardFrags

useSimilarWordMap

public boolean useSimilarWordMap

segmenterClass

public String segmenterClass

chineseSplitDouHao

public boolean chineseSplitDouHao
Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation. Good but included below.


chineseSplitPunct

public boolean chineseSplitPunct
Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao. Good.


chineseSplitPunctLR

public boolean chineseSplitPunctLR
Chinese: split left right/paren quote (if chineseSplitPunct is also true. Only very marginal gains, but seems positive.


markVVsisterIP

public boolean markVVsisterIP
Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs). Good: give 0.5%


markPsisterIP

public boolean markPsisterIP
Chinese: mark P's that are sister of IP. Negative effect


markIPsisterVVorP

public boolean markIPsisterVVorP
Chinese: mark IP's that are sister of VV or P. These rarely have punctuation. Small positive effect.


markADgrandchildOfIP

public boolean markADgrandchildOfIP
Chinese: mark ADs that are grandchild of IP.


gpaAD

public boolean gpaAD
Grandparent annotate all AD. Seems slightly negative.


chineseVerySelectiveTagPA

public boolean chineseVerySelectiveTagPA

chineseSelectiveTagPA

public boolean chineseSelectiveTagPA

markIPsisterBA

public boolean markIPsisterBA
Chinese: mark IPs that are sister of BA. These always have overt NP. Very slightly positive.


markVPadjunct

public boolean markVPadjunct
Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution). Necessary even with chineseSplitVP==3 and parent annotation because parent annotation happens with unsplit parent categories. Slightly positive.


markNPmodNP

public boolean markNPmodNP
Chinese: mark NP modifiers of NPs. Quite positive (0.5%)


markModifiedNP

public boolean markModifiedNP
Chinese: mark left-modified NPs (rightmost NPs with a left-side mod). Slightly positive.


markNPconj

public boolean markNPconj
Chinese: mark NPs that are conjuncts. Negative on small set.


markMultiNtag

public boolean markMultiNtag
Chinese: mark nominal tags that are part of multi-nominal rewrites. Doesn't seem any good.


markIPsisDEC

public boolean markIPsisDEC
Chinese: mark IPs that are part of prenominal modifiers. Negative.


markIPconj

public boolean markIPconj
Chinese: mark IPs that are conjuncts. Or those that have (adjuncts or subjects)


markIPadjsubj

public boolean markIPadjsubj

chineseSplitVP

public int chineseSplitVP
Chinese VP splitting. 0 = none; 1 = mark with -BA a VP that directly dominates a BA; 2 = mark with -BA a VP that directly dominates a BA or a VP that directly dominates a BA 3 = split VPs into VP-COMP, VP-CRD, VP-ADJ. (Negative value.)


mergeNNVV

public boolean mergeNNVV
Chinese: merge NN and VV. A lark.


unaryIP

public boolean unaryIP
Chinese: unary category marking


unaryCP

public boolean unaryCP

paRootDtr

public boolean paRootDtr
Chinese: parent annotate daughter of root. Meant only for selectivesplit=false.


markPostverbalP

public boolean markPostverbalP
Chinese: mark P with a left aunt VV, and PP with a left sister VV. Note that it's necessary to mark both to thread the context-marking. Used to identify post-verbal P's, which are rare.


markPostverbalPP

public boolean markPostverbalPP

splitBaseNP

public boolean splitBaseNP
Mark base NPs. Good.


tagWordSize

public boolean tagWordSize
Annotate tags for number of characters contained.


markCC

public boolean markCC
Mark phrases which are conjunctions. Appears negative, even with 200K words training data.


splitNPTMP

public boolean splitNPTMP
Whether to retain the -TMP functional tag on various phrasal categories. On 80K words training, minutely helpful; on 200K words, best option gives 0.6%. Doing splitNPTMP and splitPPTMP (but not splitXPTMP) is best.


splitPPTMP

public boolean splitPPTMP

splitXPTMP

public boolean splitXPTMP

dominatesV

public boolean dominatesV
Verbal distance -- mark whether symbol dominates a verb (V*). Seems bad for Chinese.


DEFAULT_USE_GOOD_TURNING_UNKNOWN_WORD_MODEL

public static final boolean DEFAULT_USE_GOOD_TURNING_UNKNOWN_WORD_MODEL
Parameters specific for creating a ChineseLexicon

See Also:
Constant Field Values

useGoodTuringUnknownWordModel

public boolean useGoodTuringUnknownWordModel

useCharBasedUnknownWordModel

public boolean useCharBasedUnknownWordModel

lengthPenalty

public double lengthPenalty
Parameters for a ChineseCharacterBasedLexicon


useUnknownCharacterModel

public boolean useUnknownCharacterModel

penaltyType

public int penaltyType
penaltyType should be set as follows: 0: no length penalty 1: quadratic length penalty 2: penalty for continuation chars only TODO: make this an enum

Constructor Detail

ChineseTreebankParserParams

public ChineseTreebankParserParams()
Method Detail

headFinder

public HeadFinder headFinder()
Returns a ChineseHeadFinder

Specified by:
headFinder in interface TreebankLangParserParams
Specified by:
headFinder in class AbstractTreebankParserParams

typedDependencyHeadFinder

public HeadFinder typedDependencyHeadFinder()
Description copied from class: AbstractTreebankParserParams
The HeadFinder to use when extracting typed dependencies.

Specified by:
typedDependencyHeadFinder in interface TreebankLangParserParams
Specified by:
typedDependencyHeadFinder in class AbstractTreebankParserParams

lex

public Lexicon lex(Options op,
                   Index<String> wordIndex,
                   Index<String> tagIndex)
Returns a ChineseLexicon

Specified by:
lex in interface TreebankLangParserParams
Overrides:
lex in class AbstractTreebankParserParams
Parameters:
op - Options as to how the Lexicon behaves
Returns:
A Lexicon, constructed based on the given option

MLEDependencyGrammarSmoothingParams

public double[] MLEDependencyGrammarSmoothingParams()
Description copied from class: AbstractTreebankParserParams
Give the parameters for smoothing in the MLEDependencyGrammar. Defaults are the ones previously hard coded into MLEDependencyGrammar.

Specified by:
MLEDependencyGrammarSmoothingParams in interface TreebankLangParserParams
Overrides:
MLEDependencyGrammarSmoothingParams in class AbstractTreebankParserParams
Returns:
an array of doubles with smooth_aT_hTWd, smooth_aTW_hTWd, smooth_stop, and interp

treeReaderFactory

public TreeReaderFactory treeReaderFactory()
Description copied from interface: TreebankLangParserParams
Returns a factory for reading in trees from the source you want. It's the responsibility of trf to deal properly with character-set encoding of the input. It also is the responsibility of trf to properly normalize trees.

Returns:
A factory that vends an appropriate TreeReader

diskTreebank

public DiskTreebank diskTreebank()
Uses a DiskTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer.

Specified by:
diskTreebank in interface TreebankLangParserParams
Specified by:
diskTreebank in class AbstractTreebankParserParams

memoryTreebank

public MemoryTreebank memoryTreebank()
Uses a MemoryTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer

Specified by:
memoryTreebank in interface TreebankLangParserParams
Specified by:
memoryTreebank in class AbstractTreebankParserParams

collinizer

public TreeTransformer collinizer()
Returns a ChineseCollinizer

Specified by:
collinizer in interface TreebankLangParserParams
Specified by:
collinizer in class AbstractTreebankParserParams
Returns:
A TreeTransformer that performs adjustments to trees to delete or equivalence class things not evaluated in the parser performance evaluation.

collinizerEvalb

public TreeTransformer collinizerEvalb()
Returns a ChineseCollinizer that doesn't delete punctuation

Specified by:
collinizerEvalb in interface TreebankLangParserParams
Specified by:
collinizerEvalb in class AbstractTreebankParserParams

sisterSplitters

public String[] sisterSplitters()
Description copied from class: AbstractTreebankParserParams
Returns the splitting strings used for selective splits.

Specified by:
sisterSplitters in interface TreebankLangParserParams
Specified by:
sisterSplitters in class AbstractTreebankParserParams
Returns:
An array containing ancestor-annotated Strings: categories should be split according to these ancestor annotations.

transformTree

public Tree transformTree(Tree t,
                          Tree root)
transformTree does all language-specific tree transformations. Any parameterizations should be inside the specific TreebankLangParserParams class.

Specified by:
transformTree in interface TreebankLangParserParams
Specified by:
transformTree in class AbstractTreebankParserParams
Parameters:
t - The input tree (with non-language specific annotation already done, so you need to strip back to basic categories)
root - The root of the current tree (can be null for words)
Returns:
The fully annotated tree node (with daughters still as you want them in the final result)

display

public void display()
Description copied from class: AbstractTreebankParserParams
display language-specific settings

Specified by:
display in interface TreebankLangParserParams
Specified by:
display in class AbstractTreebankParserParams

setOptionFlag

public int setOptionFlag(String[] args,
                         int i)
Set language-specific options according to flags. This routine should process the option starting in args[i] (which might potentially be several arguments long if it takes arguments). It should return the index after the last index it consumed in processing. In particular, if it cannot process the current option, the return value should be i.

Specified by:
setOptionFlag in interface TreebankLangParserParams
Overrides:
setOptionFlag in class AbstractTreebankParserParams
Parameters:
args - Array of command line arguments
i - Index in command line arguments to try to process as an option
Returns:
The index of the item after arguments processed as part of this command line option.

dependencyGrammarExtractor

public Extractor<DependencyGrammar> dependencyGrammarExtractor(Options op,
                                                               Index<String> wordIndex,
                                                               Index<String> tagIndex)
Specified by:
dependencyGrammarExtractor in interface TreebankLangParserParams
Overrides:
dependencyGrammarExtractor in class AbstractTreebankParserParams

defaultTestSentence

public ArrayList<Word> defaultTestSentence()
Return a default sentence for the language (for testing)

Returns:
A default sentence of the language

main

public static void main(String[] args)
For testing: loads a treebank and prints the trees.



Stanford NLP Group