edu.stanford.nlp.parser.lexparser
Class EnglishTreebankParserParams.EnglishTrain
java.lang.Object
edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams.EnglishTrain
- Enclosing class:
- EnglishTreebankParserParams
public static class EnglishTreebankParserParams.EnglishTrain
- extends Object
Field Summary |
static boolean |
correctTags
'Correct' tags to produce verbs in VPs, etc. |
static boolean |
dominatesC
Verbal distance -- mark whether symbol dominates a conjunction (CC) |
static boolean |
dominatesI
Verbal distance -- mark whether symbol dominates a preposition (IN) |
static int |
dominatesV
Verbal distance -- mark whether symbol dominates a verb (V*, MD). |
static boolean |
gpaRootVP
Grand-parent annotate (root mark) VP below ROOT. |
static boolean |
joinJJ
Joint comparative and superlative adjective with positive. |
static boolean |
joinNounTags
Join proper nouns with common nouns. |
static boolean |
joinPound
Join pound with dollar. |
static int |
makePPTOintoIN
Change TO inside PP to IN. |
static int |
markCC
Mark phrases which are conjunctions. |
static boolean |
markContainedVP
|
static int |
markDitransV
Attempt to record ditransitive verbs. |
static boolean |
markReflexivePRP
Mark reflexivie PRP words. |
static boolean |
rightPhrasal
Right edge has a phrasal node. |
static int |
sisterSplitLevel
Set the support * KL cutoff level (1-4) for sister splitting
-- don't use it, as far as we can tell so far |
static int |
splitAux
Make special tags for forms of BE and HAVE (and maybe DO/HELP, etc.). |
static int |
splitBaseNP
Mark base NPs. |
static int |
splitCC
Provide annotation of conjunctions. |
static int |
splitIN
Annotate prepositions into subcategories. |
static boolean |
splitJJCOMP
Put a special tag on 'adjectives with complements'. |
static boolean |
splitMoreLess
Specially mark the comparative/superlative words: less, least,
more, most |
static int |
splitNNP
Mark NNP words as to position in phrase (single, left, right, inside)
or subcategorizes NNP(S) as initials or initial/final in NP. |
static boolean |
splitNOT
Annotates forms of "not" specially as tag "NOT". |
static int |
splitNPADV
Retain NP-ADV annotation. |
static int |
splitNPNNP
Mark NP-NNP. |
static int |
splitNPpercent
Mark phrases that are headed by %. |
static boolean |
splitNPPRP
|
static boolean |
splitNumNP
Mark "numeric NPs". |
static boolean |
splitPercent
Mark the nouns that are percent signs. |
static int |
splitPoss
Give a special tag to NPs which are possessive NPs (end in 's). |
static boolean |
splitPPJJ
A special test for "such" mainly ("such as Fred"). |
static boolean |
splitQuotes
Mark quote marks for single vs. |
static boolean |
splitRB
Split modifier (NP, AdjP) adverbs from others. |
static int |
splitSbar
Split SBAR nodes. |
static boolean |
splitSFP
Separate out sentence final punct. |
static int |
splitSGapped
Mark specially S nodes with "gapped" subject (control, raising). |
static int |
splitSTag
Mark S/SINV/SQ nodes according to verbal tag. |
static int |
splitTMP
Retain NP-TMP (or maybe PP-TMP) annotation. |
static boolean |
splitTRJJ
Put a special tag on 'transitive adjectives' with NP complement, like
'due May 15' -- it also catches 'such' in 'such as NP', which may
be a good. |
static int |
splitVP
Add (head) tags to VPs. |
static boolean |
splitVPNPAgr
Put enough marking on VP and NP to permit "agreement". |
static boolean |
tagRBGPA
Grand parent annotate RB to try to distinguish sentential ones and
ones in places like NP post modifier (things like 'very' are already
distinguished as their parent is ADJP). |
static boolean |
unaryDT
Mark "Intransitive" DT. |
static boolean |
unaryIN
Mark "Intransitive" IN. |
static boolean |
unaryPRP
"Intransitive" PRP. |
static boolean |
unaryRB
Mark "Intransitive" RB. |
static boolean |
vpSubCat
Pitiful attempt at marking V* preterms with their surface subcat
frames. |
Method Summary |
static void |
display()
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
splitIN
public static int splitIN
- Annotate prepositions into subcategories. Values:
0 = no annotation
1 = IN with a ^S.* parent (putative subordinating
conjunctions) marked differently from others (real prepositions). OK.
2 = Annotate IN prepositions 3 ways: ^S.* parent, ^N.* parent or rest
(generally predicative ADJP, VP). Better than sIN=1. Good.
3 = Annotate prepositions 6 ways: real feature engineering. Great.
4 = Refinement of 3: allows -SC under SINV, WHADVP for -T and no -SCC
if the parent is an NP.
5 = Like 4 but maps TO to IN in a "nominal" (N*, P*, A*) context.
6 = 4, but mark V/A complement and leave noun ones unmarked instead.
splitQuotes
public static boolean splitQuotes
- Mark quote marks for single vs. double so don't get mismatched ones.
splitSFP
public static boolean splitSFP
- Separate out sentence final punct. (. ! ?). Doesn't help.
splitPercent
public static boolean splitPercent
- Mark the nouns that are percent signs. Slightly good.
splitNPpercent
public static int splitNPpercent
- Mark phrases that are headed by %.
A value of 0 = do nothing, 1 = only NP, 2 = NP and ADJP,
3 = NP, ADJP and QP, 4 = any phrase.
tagRBGPA
public static boolean tagRBGPA
- Grand parent annotate RB to try to distinguish sentential ones and
ones in places like NP post modifier (things like 'very' are already
distinguished as their parent is ADJP).
splitNNP
public static int splitNNP
- Mark NNP words as to position in phrase (single, left, right, inside)
or subcategorizes NNP(S) as initials or initial/final in NP.
joinPound
public static boolean joinPound
- Join pound with dollar.
joinJJ
public static boolean joinJJ
- Joint comparative and superlative adjective with positive.
joinNounTags
public static boolean joinNounTags
- Join proper nouns with common nouns. This isn't to improve
performance, but because Genia doesn't use proper noun tags in
general.
splitPPJJ
public static boolean splitPPJJ
- A special test for "such" mainly ("such as Fred"). A wash, so omit
splitTRJJ
public static boolean splitTRJJ
- Put a special tag on 'transitive adjectives' with NP complement, like
'due May 15' -- it also catches 'such' in 'such as NP', which may
be a good. Matches 658 times in 2-21 training corpus. Wash.
splitJJCOMP
public static boolean splitJJCOMP
- Put a special tag on 'adjectives with complements'. This acts as a
general subcat feature for adjectives.
splitMoreLess
public static boolean splitMoreLess
- Specially mark the comparative/superlative words: less, least,
more, most
unaryDT
public static boolean unaryDT
- Mark "Intransitive" DT. Good.
unaryRB
public static boolean unaryRB
- Mark "Intransitive" RB. Good.
unaryPRP
public static boolean unaryPRP
- "Intransitive" PRP. Wash -- basically a no-op really.
markReflexivePRP
public static boolean markReflexivePRP
- Mark reflexivie PRP words.
unaryIN
public static boolean unaryIN
- Mark "Intransitive" IN. Minutely negative.
splitCC
public static int splitCC
- Provide annotation of conjunctions. Gives modest gains (numbers
shown F1 increase with respect to goodPCFG in June 2005). A value of
1 annotates both "and" and "or" as "CC-C" (+0.29%),
2 annotates "but" and "&" separately (+0.17%),
3 annotates just "and" (equalsIgnoreCase) (+0.11%),
0 annotates nothing (+0.00%).
splitNOT
public static boolean splitNOT
- Annotates forms of "not" specially as tag "NOT". BAD
splitRB
public static boolean splitRB
- Split modifier (NP, AdjP) adverbs from others.
This does nothing if you're already doing tagPA.
splitAux
public static int splitAux
- Make special tags for forms of BE and HAVE (and maybe DO/HELP, etc.).
A value of 0 is do nothing.
A value of 1 is the basic form. Positive PCFG effect,
but neutral to negative in Factored, and impossible if you use gPA.
A value of 2 adds in "s" = "'s"
and delves further to disambiguate "'s" as BE or HAVE. Theoretically
good, but no practical gains.
A value of 3 adds DO.
A value of 4 adds HELP (which also takes VB form complement) as DO.
A value of 5 adds LET (which also takes VB form complement) as DO.
A value of 6 adds MAKE (which also takes VB form complement) as DO.
A value of 7 adds WATCH, SEE (which also take VB form complement) as DO.
A value of 8 adds come, go, but not inflections (which colloquially
can take a VB form complement) as DO.
A value of 9 adds GET as BE.
Differences are small. You get about 0.3 F1 by doing something; the best
appear to be 2 or 3 for sentence exact and 7 or 8 for LP/LR F1.
vpSubCat
public static boolean vpSubCat
- Pitiful attempt at marking V* preterms with their surface subcat
frames. Bad so far.
markDitransV
public static int markDitransV
- Attempt to record ditransitive verbs. The value 0 means do nothing;
1 records two or more NP or S* arguments, and 2 means to only record
two or more NP arguments (that aren't NP-TMP).
1 gave neutral to bad results.
splitVP
public static int splitVP
- Add (head) tags to VPs. An argument of
0 = no head-subcategorization of VPs,
1 = add head tags (anything, as given by HeadFinder),
2 = add head tags, but collapse finite verb tags (VBP, VBD, VBZ, MD)
together,
3 = only annotate verbal tags, and collapse finite verb tags
(annotation is VBF, TO, VBG, VBN, VB, or zero),
4 = only split on categories of VBF, TO, VBG, VBN, VB, and map
cases that are not headed by a verbal category to an appropriate
category based on word suffix (ing, d, t, s, to) or to VB otherwise.
We usually use a value of 3; 2 or 3 is much better than 0.
See also
splitVPNPAgr
. If it is true, its effects override
any value set for this parameter.
splitVPNPAgr
public static boolean splitVPNPAgr
- Put enough marking on VP and NP to permit "agreement".
splitSTag
public static int splitSTag
- Mark S/SINV/SQ nodes according to verbal tag. Meanings are:
0 = no subcategorization.
1 = mark with head tag
2 = mark only -VBF if VBZ/VBD/VBP/MD tag
3 = as 2 and mark -VBNF if TO/VBG/VBN/VB
4 = as 2 but only mark S not SINV/SQ
5 = as 3 but only mark S not SINV/SQ
Previously seen as bad. Option 4 might be promising now.
markContainedVP
public static boolean markContainedVP
splitNPPRP
public static boolean splitNPPRP
dominatesV
public static int dominatesV
- Verbal distance -- mark whether symbol dominates a verb (V*, MD).
Very good.
dominatesI
public static boolean dominatesI
- Verbal distance -- mark whether symbol dominates a preposition (IN)
dominatesC
public static boolean dominatesC
- Verbal distance -- mark whether symbol dominates a conjunction (CC)
markCC
public static int markCC
- Mark phrases which are conjunctions.
0 = No marking
1 = Any phrase with a CC daughter that isn't first or last. Possibly marginally positive.
2 = As 0 but also a non-marginal CONJP daughter. In principle good, but no gains.
3 = More like Charniak. Not yet implemented. Need to annotate _before_ annotate children!
np or vp with two or more np/vp children, a comma, cc or conjp, and nothing else.
splitSGapped
public static int splitSGapped
- Mark specially S nodes with "gapped" subject (control, raising).
1 is basic version. 2 is better mark S nodes with "gapped" subject.
3 seems best on small training set, but all of these are too similar;
4 can't be differentiated.
5 is done on tree before empty splitting. (Bad!?)
splitNumNP
public static boolean splitNumNP
- Mark "numeric NPs". Probably bad?
splitPoss
public static int splitPoss
- Give a special tag to NPs which are possessive NPs (end in 's).
A value of 0 means do nothing, 1 means tagging possessive NPs with
"-P", 2 means restructure possessive NPs so that they introduce a
POSSP node that
takes as children the POS and a regularly structured NP.
I.e., recover standard good linguistic practice circa 1985.
This seems a good idea, but is almost a no-op (modulo fine points of
markovization), since the previous NP-P phrase already uniquely
captured what is now a POSSP.
splitBaseNP
public static int splitBaseNP
- Mark base NPs. A value of 0 = no marking, 1 = marking
baseNP (ones which rewrite just as preterminals), and 2 = doing
Collins-style marking, where an extra NP node is inserted above a
baseNP, if it isn't
already in an NP over NP construction, as in Collins 1999.
This option shouldn't really be in EnglishTrain since it's needed
at parsing time. But we don't currently use it....
A value of 1 is good.
splitTMP
public static int splitTMP
- Retain NP-TMP (or maybe PP-TMP) annotation. Good.
The values for this parameter are defined in
NPTmpRetainingTreeNormalizer.
splitSbar
public static int splitSbar
- Split SBAR nodes.
1 = mark 'in order to' purpose clauses; this is actually a small and
inconsistent part of what is marked SBAR-PRP in the treebank, which
is mainly 'because' reason clauses.
2 = mark all infinitive SBAR.
3 = do 1 and 2.
A value of 1 seems minutely positive; 2 and 3 seem negative.
Also get 'in case Sfin', 'In order to', and on one occasion
'in order that'
splitNPADV
public static int splitNPADV
- Retain NP-ADV annotation. 0 means strip "-ADV" annotation. 1 means to
retain it, and to percolate it down to a head tag providing it can
do it through a path of only NP nodes.
splitNPNNP
public static int splitNPNNP
- Mark NP-NNP. 0 is nothing; 1 is only NNP head, 2 is NNP and NNPS
head; 3 is NNP or NNPS anywhere in local NP. All bad!
correctTags
public static boolean correctTags
- 'Correct' tags to produce verbs in VPs, etc. where possible
rightPhrasal
public static boolean rightPhrasal
- Right edge has a phrasal node. Bad?
sisterSplitLevel
public static int sisterSplitLevel
- Set the support * KL cutoff level (1-4) for sister splitting
-- don't use it, as far as we can tell so far
gpaRootVP
public static boolean gpaRootVP
- Grand-parent annotate (root mark) VP below ROOT. Seems negative.
makePPTOintoIN
public static int makePPTOintoIN
- Change TO inside PP to IN.
display
public static void display()
Stanford NLP Group