edu.stanford.nlp.ie
Class NERFeatureFactory
java.lang.Object
edu.stanford.nlp.sequences.FeatureFactory
edu.stanford.nlp.ie.NERFeatureFactory
- All Implemented Interfaces:
- Serializable
public class NERFeatureFactory
- extends FeatureFactory
Features for Named Entity Recognition. The code here creates the features
by processing Lists of CoreLabels.
Look at SeqClassifierFlags
to see where the flags are set for
what options to use for what flags.
To add a new feature extractor, you should do the following:
- Add a variable (boolean, int, String, etc. as appropriate) to
SeqClassifierFlags to mark if the new extractor is turned on or
its value, etc. Add it at the bottom of the list of variables
currently in the class (this avoids problems with older serialized
files breaking). Make the default value of the variable false/null/0
(this is again for backwards compatibility).
- Add a clause to the big if/then/else of setProperties(Properties) in
SeqClassifierFlags. Unless it is a macro option, make the option name
the same as the variable name used in step 1.
- Add code to NERFeatureFactory for this feature. First decide which
classes (hidden states) are involved in the feature. If only the
current class, you add the feature extractor to the
featuresC
code, if both the current and previous class,
then featuresCpC
, etc.
Parameters can be defined using a Properties file
(specified on the command-line with -prop
propFile),
or directly on the command line. The following properties are recognized:
Property Name | Type | Default Value | Description |
loadClassifier | String | n/a | Path to serialized classifier to load |
loadAuxClassifier | String | n/a | Path to auxiliary classifier to load. |
serializeTo | String | n/a | Path to serialize classifier to |
trainFile | String | n/a | Path of file to use as training data |
testFile | String | n/a | Path of file to use as training data |
useWord | boolean | true | Gives you feature for w |
useBinnedLength | String | null | If non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound up to the next bound are binned Len-range |
useNGrams | boolean | false | Make features from letter n-grams |
lowercaseNGrams | boolean | false | Make features from letter n-grams only lowercase |
dehyphenateNGrams | boolean | false | Remove hyphens before making features from letter n-grams |
conjoinShapeNGrams | boolean | false | Conjoin word shape and n-gram features |
usePrev | boolean | false | Gives you feature for (pw,c), and together with other options enables other previous features, such as (pt,c) [with useTags) |
useNext | boolean | false | Gives you feature for (nw,c), and together with other options enables other next features, such as (nt,c) [with useTags) |
useTags | boolean | false | Gives you features for (t,c), (pt,c) [if usePrev], (nt,c) [if useNext] |
useWordPairs | boolean | false | Gives you
features for (pw, w, c) and (w, nw, c) |
useGazettes | boolean | false | |
wordShape | String | none | Either "none" for no wordShape use, or the name of a word shape function recognized by WordShapeClassifier.lookupShaper(String) |
useSequences | boolean | true | |
usePrevSequences | boolean | false | |
useNextSequences | boolean | false | |
useLongSequences | boolean | false | Use plain higher-order state sequences out to minimum of length or maxLeft |
useBoundarySequences | boolean | false | Use extra second order class sequence features when previous is CoNLL boundary, so entity knows it can span boundary. |
useTaggySequences | boolean | false | Use first, second, and third order class and tag sequence interaction features |
useExtraTaggySequences | boolean | false | Add in sequences of tags with just current class features |
useTaggySequencesShapeInteraction | boolean | false | Add in terms that join sequences of 2 or 3 tags with the current shape |
strictlyFirstOrder | boolean | false | As an override to whatever other options are in effect, deletes all features other than C and CpC clique features when building the classifier |
entitySubclassification | String | "IO" | If
set, convert the labeling of classes (but not the background) into
one of several alternate encodings (IO, IOB1, IOB2, IOE1, IOE2, SBIEO, with
a S(ingle), B(eginning),
E(nding), I(nside) 4-way classification for each class. By default, we
either do no re-encoding, or the CoNLLDocumentIteratorFactory does a
lossy encoding as IO. Note that this is all CoNLL-specific, and depends on
their way of prefix encoding classes, and is only implemented by
the CoNLLDocumentIteratorFactory. |
useGazettePhrases | boolean | false | |
useSum | boolean | false | |
tolerance | double | 1e-4 | Convergence tolerance in optimization |
printFeatures | String | null | print out the features of the classifier to a file based on this name (suffixed "-1" and "-2") |
useSymTags | boolean | false | Gives you
features (pt, t, nt, c), (t, nt, c), (pt, t, c) |
useSymWordPairs | boolean | false | Gives you
features (pw, nw, c) |
printClassifier | String | null | Style in which to print the classifier. One of: HighWeight, HighMagnitude, Collection, AllWeights, WeightHistogram |
printClassifierParam | int | 100 | A parameter
to the printing style, which may give, for example the number of parameters
to print |
intern | boolean | false | If true,
(String) intern read in data and classes and feature (pre-)names such
as substring features |
intern2 | boolean | false | If true, intern all (final) feature names (if only current word and ngram features are used, these will already have been interned by intern, and this is an unnecessary no-op) |
cacheNGrams | boolean | false | If true,
record the NGram features that correspond to a String (under the current
option settings) and reuse rather than recalculating if the String is seen
again. |
selfTest | boolean | false | |
sloppyGazette | boolean | false | |
cleanGazette | boolean | false | |
noMidNGrams | boolean | false | Do not include character n-gram features for n-grams that contain neither the beginning or end of the word |
maxNGramLeng | int | -1 | If this number is
positive, n-grams above this size will not be used in the model |
useReverse | boolean | false | |
retainEntitySubclassification | boolean | false | If true, rather than undoing a recoding of entity tag subtypes (such as BIO variants), just leave them in the output. |
useLemmas | boolean | false | Include the lemma of a word as a feature. |
usePrevNextLemmas | boolean | false | Include the previous/next lemma of a word as a feature. |
useLemmaAsWord | boolean | false | Include the lemma of a word as a feature. |
normalizeTerms | boolean | false | If this is true, some words are normalized: day and month names are lowercased (as for normalizeTimex) and some British spellings are mapped to American English spellings (e.g., -our/-or, etc.). |
normalizeTimex | boolean | false | If this is true, capitalization of day and month names is normalized to lowercase |
useNB | boolean | false | |
useTypeSeqs | boolean | false | Use basic zeroeth order word shape features. |
useTypeSeqs2 | boolean | false | Add additional first and second order word shape features |
useTypeSeqs3 | boolean | false | Adds one more first order shape sequence |
useDisjunctive | boolean | false | Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) |
disjunctionWidth | int | 4 | The number of words on each side of the current word that are included in the disjunction features |
useDisjunctiveShapeInteraction | boolean | false | Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) interacting with the word shape of the current word |
useWideDisjunctive | boolean | false | Include in features giving disjunctions of words anywhere in the left or right wideDisjunctionWidth words (preserving direction but not position) |
wideDisjunctionWidth | int | 4 | The number of words on each side of the current word that are included in the disjunction features |
usePosition | boolean | false | Use combination of position in sentence and class as a feature |
useBeginSent | boolean | false | Use combination of initial position in sentence and class (and word shape) as a feature. (Doesn't seem to help.) |
useDisjShape | boolean | false | Include features giving disjunctions of word shapes anywhere in the left or right disjunctionWidth words (preserving direction but not position) |
useClassFeature | boolean | false | Include a feature for the class (as a class marginal) |
useShapeConjunctions | boolean | false | Conjoin shape with tag or position |
useWordTag | boolean | false | Include word and tag pair features |
useLastRealWord | boolean | false | Iff the prev word is of length 3 or less, add an extra feature that combines the word two back and the current word's shape. Weird! |
useNextRealWord | boolean | false | Iff the next word is of length 3 or less, add an extra feature that combines the word after next and the current word's shape. Weird! |
useTitle | boolean | false | Match a word against a list of name titles (Mr, Mrs, etc.) |
useOccurrencePatterns | boolean | false | This is a very engineered feature designed to capture multiple references to names. If the current word isn't capitalized, followed by a non-capitalized word, and preceded by a word with alphabetic characters, it returns NO-OCCURRENCE-PATTERN. Otherwise, if the previous word is a capitalized NNP, then if in the next 150 words you find this PW-W sequence, you get XY-NEXT-OCCURRENCE-XY, else if you find W you get XY-NEXT-OCCURRENCE-Y. Similarly for backwards and XY-PREV-OCCURRENCE-XY and XY-PREV-OCCURRENCE-Y. Else (if the previous word isn't a capitalized NNP), under analogous rules you get one or more of X-NEXT-OCCURRENCE-YX, X-NEXT-OCCURRENCE-XY, X-NEXT-OCCURRENCE-X, X-PREV-OCCURRENCE-YX, X-PREV-OCCURRENCE-XY, X-PREV-OCCURRENCE-X. |
useTypeySequences | boolean | false | Some first order word shape patterns. |
useGenericFeatures | boolean | false | If true, any features you include in the map will be incorporated into the model with values equal to those given in the file; values are treated as strings unless you use the "realValued" option (described below) |
justify | boolean | false | Print out all
feature/class pairs and their weight, and then for each input data
point, print justification (weights) for active features |
normalize | boolean | false | For the CMMClassifier (only) if this is true then the Scorer normalizes scores as probabilities. |
useHuber | boolean | false | Use a Huber loss prior rather than the default quadratic loss. |
useQuartic | boolean | false | Use a Quartic prior rather than the default quadratic loss. |
sigma | double | 1.0 | |
epsilon | double | 0.01 | Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear |
beamSize | int | 30 | |
maxLeft | int | 2 | The number of things to the left that have to be cached to run the Viterbi algorithm: the maximum context of class features used. |
dontExtendTaggy | boolean | false | Don't extend the range of useTaggySequences when maxLeft is increased. |
numFolds | int | 1 | The number of folds to use for cross-validation. |
startFold | int | 1 | The starting fold to run. |
numFoldsToRun | int | 1 | The number of folds to run. |
mergeTags | boolean | false | Whether to merge B- and I- tags. |
splitDocuments | boolean | true | Whether or not to split the data into seperate documents for training/testing |
maxDocSize | int | 10000 | If this number is greater than 0, attempt to split documents bigger than this value into multiple documents at sentence boundaries during testing; otherwise do nothing. |
Note: flags/properties overwrite left to right. That is, the parameter
setting specified last is the one used.
DOCUMENTATION ON FEATURE TEMPLATES
w = word
t = tag
p = position (word index in sentence)
c = class
p = paren
g = gazette
a = abbrev
s = shape
r = regent (dependency governor)
h = head word of phrase
n(w) = ngrams from w
g(w) = gazette entries containing w
l(w) = length of w
o(...) = occurrence patterns of words
useReverse reverses meaning of prev, next everywhere below (on in macro)
"Prolog" booleans: , = AND and ; = OR
Mac: Y = turned on in -macro,
+ = additional positive things relative to -macro for CoNLL NERFeatureFactory
(perhaps none...)
- = Known negative for CoNLL NERFeatureFactory relative to -macro
Bio: + = additional things that are positive for BioCreative
- = things negative relative to -macro
HighMagnitude: There are no (0) to a few (+) to many (+++) high weight
features of this template. (? = not used in goodCoNLL, but usually = 0)
Feature Mac Bio CRFFlags HighMagnitude
---------------------------------------------------------------------
w,c Y useWord 0 (useWord is almost useless with unlimited ngram features, but helps a fraction in goodCoNLL, if only because of prior fiddling
p,c usePosition ?
p=0,c useBeginSent ?
p=0,s,c useBeginSent ?
t,c Y useTags ++
pw,c Y usePrev +
pt,c Y usePrev,useTags 0
nw,c Y useNext ++
nt,c Y useNext,useTags 0
pw,w,c Y useWordPairs +
w,nw,c Y useWordPairs +
pt,t,nt,c useSymTags ?
t,nt,c useSymTags ?
pt,t,c useSymTags ?
pw,nw,c useSymWordPairs ?
pc,c Y usePrev,useSequences,usePrevSequences +++
pc,w,c Y usePrev,useSequences,usePrevSequences 0
nc,c useNext,useSequences,useNextSequences ?
w,nc,c useNext,useSequences,useNextSequences ?
pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ?
w,pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ?
(pw;p2w;p3w;p4w),c + useDisjunctive (out to disjunctionWidth now) +++
(nw;n2w;n3w;n4w),c + useDisjunctive (out to disjunctionWidth now) ++++
(pw;p2w;p3w;p4w),s,c + useDisjunctiveShapeInteraction ?
(nw;n2w;n3w;n4w),s,c + useDisjunctiveShapeInteraction ?
(pw;p2w;p3w;p4w),c + useWideDisjunctive (to wideDisjunctionWidth) ?
(nw;n2w;n3w;n4w),c + useWideDisjunctive (to wideDisjunctionWidth) ?
(ps;p2s;p3s;p4s),c useDisjShape (out to disjunctionWidth now) ?
(ns;n2s;n3s;n4s),c useDisjShape (out to disjunctionWidth now) ?
pt,pc,t,c Y useTaggySequences +
p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=2 +
p3t,p3c,p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=3,!dontExtendTaggy ?
p2c,pc,c Y useLongSequences ++
p3c,p2c,pc,c Y useLongSequences,maxLeft>=3 ?
p4c,p3c,p2c,pc,c Y useLongSequences,maxLeft>=4 ?
p2c,pc,c,pw=BOUNDARY useBoundarySequences 0 (OK, but!)
p2t,pt,t,c - useExtraTaggySequences ?
p3t,p2t,pt,t,c - useExtraTaggySequences ?
p2t,pt,t,s,p2c,pc,c - useTaggySequencesShapeInteraction ?
p3t,p2t,pt,t,s,p3c,p2c,pc,c useTaggySequencesShapeInteraction ?
s,pc,c Y useTypeySequences ++
ns,pc,c Y useTypeySequences // error for ps? not? 0
ps,pc,s,c Y useTypeySequences 0
// p2s,p2c,ps,pc,s,c Y useTypeySequences,maxLeft>=2 // duplicated a useTypeSeqs2 feature
n(w),c Y useNGrams (noMidNGrams, MaxNGramLeng, lowercaseNGrams, dehyphenateNGrams) +++
n(w),s,c useNGrams,conjoinShapeNGrams ?
g,c + useGazFeatures // test refining this? ?
pg,pc,c + useGazFeatures ?
ng,c + useGazFeatures ?
// pg,g,c useGazFeatures ?
// pg,g,ng,c useGazFeatures ?
// p2g,p2c,pg,pc,g,c useGazFeatures ?
g,w,c useMoreGazFeatures ?
pg,pc,g,c useMoreGazFeatures ?
g,ng,c useMoreGazFeatures ?
g(w),c useGazette,sloppyGazette (contains same word) ?
g(w),[pw,nw,...],c useGazette,cleanGazette (entire entry matches) ?
s,c Y wordShape >= 0 +++
ps,c Y wordShape >= 0,useTypeSeqs +
ns,c Y wordShape >= 0,useTypeSeqs +
pw,s,c Y wordShape >= 0,useTypeSeqs +
s,nw,c Y wordShape >= 0,useTypeSeqs +
ps,s,c Y wordShape >= 0,useTypeSeqs 0
s,ns,c Y wordShape >= 0,useTypeSeqs ++
ps,s,ns,c Y wordShape >= 0,useTypeSeqs ++
pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2 0
p2c,p2s,pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2,maxLeft>=2 +++
pc,ps,s,ns,c wordShape >= 0,useTypeSeqs,useTypeSeqs3 ?
p2w,s,c if l(pw) <= 3 Y useLastRealWord // weird features, but work 0
n2w,s,c if l(nw) <= 3 Y useNextRealWord ++
o(pw,w,nw),c Y useOccurrencePatterns // don't fully grok but has to do with capitalized name patterns ++
a,c useAbbr;useMinimalAbbr
pa,a,c useAbbr
a,na,c useAbbr
pa,a,na,c useAbbr
pa,pc,a,c useAbbr;useMinimalAbbr
p2a,p2c,pa,pc,a useAbbr
w,a,c useMinimalAbbr
p2a,p2c,a,c useMinimalAbbr
RESTR. w,(pw,pc;p2w,p2c;p3w,p3c;p4w,p4c) + useParenMatching,maxLeft>=n
c - useClassFeature
p,s,c - useShapeConjunctions
t,s,c - useShapeConjunctions
w,t,c + useWordTag ?
w,pt,c + useWordTag ?
w,nt,c + useWordTag ?
r,c useNPGovernor (only for baseNP words)
r,t,c useNPGovernor (only for baseNP words)
h,c useNPHead (only for baseNP words)
h,t,c useNPHead (only for baseNP words)
- Author:
- Dan Klein, Jenny Finkel, Christopher Manning, Shipra Dingare, Huy Nguyen
- See Also:
- Serialized Form
Fields inherited from class edu.stanford.nlp.sequences.FeatureFactory |
cliqueC, cliqueCnC, cliqueCp2C, cliqueCp3C, cliqueCp4C, cliqueCp5C, cliqueCpC, cliqueCpCnC, cliqueCpCp2C, cliqueCpCp2Cp3C, cliqueCpCp2Cp3Cp4C, cliqueCpCp2Cp3Cp4Cp5C, flags, knownCliques |
Method Summary |
void |
clearSubstringList()
|
protected Collection<String> |
featuresC(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCnC(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCp2C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCp3C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCp4C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCp5C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCpC(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCpCnC(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCpCp2C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCpCp2Cp3C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
protected Collection<String> |
featuresCpCp2Cp3Cp4C(PaddedList<? extends CoreLabel> cInfo,
int loc)
|
Collection |
getCliqueFeatures(PaddedList<? extends CoreLabel> cInfo,
int loc,
Clique clique)
Extracts all the features from the input data at a certain index. |
void |
init(SeqClassifierFlags flags)
|
void |
initGazette()
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NERFeatureFactory
public NERFeatureFactory()
init
public void init(SeqClassifierFlags flags)
- Overrides:
init
in class FeatureFactory
getCliqueFeatures
public Collection getCliqueFeatures(PaddedList<? extends CoreLabel> cInfo,
int loc,
Clique clique)
- Extracts all the features from the input data at a certain index.
- Specified by:
getCliqueFeatures
in class FeatureFactory
- Parameters:
cInfo
- The complete data set as a List of WordInfoloc
- The index at which to extract features.clique
- The particular clique for which to extract features. It
should be a member of the knownCliques list.
- Returns:
- A
Collection
of the features
calculated for the word at the specified position in info.
clearSubstringList
public void clearSubstringList()
featuresC
protected Collection<String> featuresC(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCpC
protected Collection<String> featuresCpC(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCp2C
protected Collection<String> featuresCp2C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCp3C
protected Collection<String> featuresCp3C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCp4C
protected Collection<String> featuresCp4C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCp5C
protected Collection<String> featuresCp5C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCpCp2C
protected Collection<String> featuresCpCp2C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCpCp2Cp3C
protected Collection<String> featuresCpCp2Cp3C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCpCp2Cp3Cp4C
protected Collection<String> featuresCpCp2Cp3Cp4C(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCnC
protected Collection<String> featuresCnC(PaddedList<? extends CoreLabel> cInfo,
int loc)
featuresCpCnC
protected Collection<String> featuresCpCnC(PaddedList<? extends CoreLabel> cInfo,
int loc)
initGazette
public void initGazette()
Stanford NLP Group