edu.stanford.nlp.parser.lexparser
Class ChineseTreebankParserParams

java.lang.Object
  |
  +--edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
All Implemented Interfaces:
Serializable, TreebankLangParserParams

public class ChineseTreebankParserParams
extends Object
implements TreebankLangParserParams, Serializable

Parameter file for parsing the Penn Chinese Treebank. Includes category enrichments specific to the Penn Chinese Treebank.

Author:
Roger Levy
See Also:
Serialized Form

Field Summary
static boolean chineseSelectiveTagPA
           
static boolean chineseSplitDouHao
          Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation.
static boolean chineseSplitPunct
          Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao.
static boolean chineseSplitPunctLR
          Chinese: split left right/paren quote (if chineseSplitPunct is also true.
static boolean chineseSplitVP3
          Chinese: split VPs into VP-COMP, VP-CRD, VP-ADJ.
static boolean chineseVerySelectiveTagPA
           
static boolean gpaAD
          Grandparent annotate all AD.
static boolean markADgrandchildOfIP
          Chinese: mark ADs that are grandchild of IP.
static boolean markIPadjsubj
           
static boolean markIPconj
          Chinese: mark IPs that are conjuncts.
static boolean markIPsisDEC
          Chinese: mark IPs that are part of prenominal modifiers.
static boolean markIPsisterBA
          Chinese: mark IPs that are sister of BA.
static boolean markIPsisterVVorP
          Chinese: mark IP's that are sister of VV or P.
static boolean markModifiedNP
          Chinese: mark left-modified NPs (rightmost NPs with a left-side mod).
static boolean markMultiNtag
          Chinese: mark nominal tags that are part of multi-nominal rewrites.
static boolean markNPconj
          Chinese: mark NPs that are conjuncts.
static boolean markNPmodNP
          Chinese: mark NP modifiers of NPs.
static boolean markPostverbalP
          Chinese: mark P with a left aunt VV, and PP with a left sister VV.
static boolean markPostverbalPP
           
static boolean markPsisterIP
          Chinese: mark P's that are sister of IP.
static boolean markVPadjunct
          Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution).
static boolean markVVsisterIP
          Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs).
static boolean mergeNNVV
          Chinese: merge NN and VV.
static boolean paRootDtr
          Chinese: parent annotate daughter of root.
static int selectiveSplitLevel
          How selectively to split.
static boolean splitBaseNP
          Mark base NPs.
static boolean tagWordSize
          Annotate tags for number of characters contained.
static boolean unaryCP
           
static boolean unaryIP
          Chinese: unary category marking
 
Constructor Summary
ChineseTreebankParserParams()
           
 
Method Summary
 TreeTransformer collinizer()
          the tree transformer used to produce trees for evaluation.
 TreeTransformer collinizerEvalb()
           
 void display()
          display language-specific settings
 HeadFinder headFinder()
           
 edu.stanford.nlp.parser.lexparser.Lexicon lex()
          reads in trees from the source you want.
 MemoryTreebank memoryTreebank()
          returns a MemoryTreebank appropriate to the treebank source
 PrintWriter pw()
          the PrintWriter used to print output.
 PrintWriter pw(OutputStream o)
          the PrintWriter used to print output.
 void setInputEncoding(String encoding)
           
 int setOptionFlag(String[] args, int i)
          Set language-specific options according to flags.
 void setOutputEncoding(String encoding)
           
 String[] sisterSplitters()
          Returns the splitting strings used for selective splits.
 String[] splitters()
          Returns the splitting strings used for selective splits.
 MemoryTreebank testMemoryTreebank()
          returns a MemoryTreebank appropriate to the testing treebank source
 edu.stanford.nlp.parser.lexparser.TreeHeadPair transformTree(Tree t, Tree root, edu.stanford.nlp.parser.lexparser.TreeHeadPair thp)
          transformTree does all language-specific tree transformations.
 TreebankLanguagePack treebankLanguagePack()
          Returns a ChineseTreebankLanguagePack
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

chineseSplitDouHao

public static boolean chineseSplitDouHao
Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation. Good but included below.


chineseSplitPunct

public static boolean chineseSplitPunct
Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao. Good.


chineseSplitPunctLR

public static boolean chineseSplitPunctLR
Chinese: split left right/paren quote (if chineseSplitPunct is also true. Only very marginal gains, but seems positive.


markVVsisterIP

public static boolean markVVsisterIP
Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs). Good: give 0.5%


markPsisterIP

public static boolean markPsisterIP
Chinese: mark P's that are sister of IP. Negative effect


markIPsisterVVorP

public static boolean markIPsisterVVorP
Chinese: mark IP's that are sister of VV or P. These rarely have punctuation. Small positive effect.


markADgrandchildOfIP

public static boolean markADgrandchildOfIP
Chinese: mark ADs that are grandchild of IP.


gpaAD

public static boolean gpaAD
Grandparent annotate all AD. Seems slightly negative.


chineseVerySelectiveTagPA

public static boolean chineseVerySelectiveTagPA

chineseSelectiveTagPA

public static boolean chineseSelectiveTagPA

markIPsisterBA

public static boolean markIPsisterBA
Chinese: mark IPs that are sister of BA. These always have overt NP. Very slightly positive.


markVPadjunct

public static boolean markVPadjunct
Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution). Necessary even with chineseSplitVP3 and parent annotation because parent annotation happens with unsplit parent categories. Slightly positive.


markNPmodNP

public static boolean markNPmodNP
Chinese: mark NP modifiers of NPs. Quite positive (0.5%)


markModifiedNP

public static boolean markModifiedNP
Chinese: mark left-modified NPs (rightmost NPs with a left-side mod). Slightly positive.


markNPconj

public static boolean markNPconj
Chinese: mark NPs that are conjuncts. Negative on small set.


markMultiNtag

public static boolean markMultiNtag
Chinese: mark nominal tags that are part of multi-nominal rewrites. Doesn't seem any good.


markIPsisDEC

public static boolean markIPsisDEC
Chinese: mark IPs that are part of prenominal modifiers. Negative.


markIPconj

public static boolean markIPconj
Chinese: mark IPs that are conjuncts. Or those that have (adjuncts or subjects)


markIPadjsubj

public static boolean markIPadjsubj

chineseSplitVP3

public static boolean chineseSplitVP3
Chinese: split VPs into VP-COMP, VP-CRD, VP-ADJ. Negative value.


mergeNNVV

public static boolean mergeNNVV
Chinese: merge NN and VV. A lark.


unaryIP

public static boolean unaryIP
Chinese: unary category marking


unaryCP

public static boolean unaryCP

paRootDtr

public static boolean paRootDtr
Chinese: parent annotate daughter of root. Meant only for selectivesplit=false.


markPostverbalP

public static boolean markPostverbalP
Chinese: mark P with a left aunt VV, and PP with a left sister VV. Note that it's necessary to mark both to thread the context-marking. Used to identify post-verbal P's, which are rare.


markPostverbalPP

public static boolean markPostverbalPP

selectiveSplitLevel

public static int selectiveSplitLevel
How selectively to split.


splitBaseNP

public static boolean splitBaseNP
Mark base NPs. Good.


tagWordSize

public static boolean tagWordSize
Annotate tags for number of characters contained.

Constructor Detail

ChineseTreebankParserParams

public ChineseTreebankParserParams()
Method Detail

setInputEncoding

public void setInputEncoding(String encoding)
Specified by:
setInputEncoding in interface TreebankLangParserParams

setOutputEncoding

public void setOutputEncoding(String encoding)
Specified by:
setOutputEncoding in interface TreebankLangParserParams

headFinder

public HeadFinder headFinder()
Specified by:
headFinder in interface TreebankLangParserParams

lex

public edu.stanford.nlp.parser.lexparser.Lexicon lex()
Description copied from interface: TreebankLangParserParams
reads in trees from the source you want. It's the responsibility of tr to deal properly with character-set encoding of the input. It also is the responsibility of tr to properly normalize trees

Specified by:
lex in interface TreebankLangParserParams

memoryTreebank

public MemoryTreebank memoryTreebank()
Description copied from interface: TreebankLangParserParams
returns a MemoryTreebank appropriate to the treebank source

Specified by:
memoryTreebank in interface TreebankLangParserParams

testMemoryTreebank

public MemoryTreebank testMemoryTreebank()
returns a MemoryTreebank appropriate to the testing treebank source

Specified by:
testMemoryTreebank in interface TreebankLangParserParams

collinizer

public TreeTransformer collinizer()
the tree transformer used to produce trees for evaluation. Will be applied both to the

Specified by:
collinizer in interface TreebankLangParserParams

collinizerEvalb

public TreeTransformer collinizerEvalb()
Specified by:
collinizerEvalb in interface TreebankLangParserParams

treebankLanguagePack

public TreebankLanguagePack treebankLanguagePack()
Returns a ChineseTreebankLanguagePack

Specified by:
treebankLanguagePack in interface TreebankLangParserParams

pw

public PrintWriter pw()
the PrintWriter used to print output. It's the responsibility of pw to deal properly with character encodings for the relevant treebank

Specified by:
pw in interface TreebankLangParserParams

pw

public PrintWriter pw(OutputStream o)
the PrintWriter used to print output. It's the responsibility of pw to deal properly with character encodings for the relevant treebank

Specified by:
pw in interface TreebankLangParserParams

splitters

public String[] splitters()
Description copied from interface: TreebankLangParserParams
Returns the splitting strings used for selective splits.

Specified by:
splitters in interface TreebankLangParserParams
Returns:
An array containing ancestor-annotated Strings: categories should be split according to these ancestor annotations.

sisterSplitters

public String[] sisterSplitters()
Description copied from interface: TreebankLangParserParams
Returns the splitting strings used for selective splits.

Specified by:
sisterSplitters in interface TreebankLangParserParams
Returns:
An array containing ancestor-annotated Strings: categories should be split according to these ancestor annotations.

transformTree

public edu.stanford.nlp.parser.lexparser.TreeHeadPair transformTree(Tree t,
                                                                    Tree root,
                                                                    edu.stanford.nlp.parser.lexparser.TreeHeadPair thp)
transformTree does all language-specific tree transformations. Any parameterizations should be inside the specific TreebankLangParserarams class.

Specified by:
transformTree in interface TreebankLangParserParams

display

public void display()
Description copied from interface: TreebankLangParserParams
display language-specific settings

Specified by:
display in interface TreebankLangParserParams

setOptionFlag

public int setOptionFlag(String[] args,
                         int i)
Set language-specific options according to flags. This routine should process the option starting in args[i] (which might potentially be several arguments long if it takes arguments). It should return the index after the last index it consumed in processing. In particular, if it cannot process the current option, the return value should be i.

Specified by:
setOptionFlag in interface TreebankLangParserParams


Stanford NLP Group