edu.stanford.nlp.ie
Class NERFeatureFactory

java.lang.Object
  extended by edu.stanford.nlp.sequences.FeatureFactory
      extended by edu.stanford.nlp.ie.NERFeatureFactory
All Implemented Interfaces:
Serializable

public class NERFeatureFactory
extends FeatureFactory

Features for Named Entity Recognition. The code here creates the features by processing Lists of CoreLabels. Look at SeqClassifierFlags to see where the flags are set for what options to use for what flags.

To add a new feature extractor, you should do the following:

  1. Add a variable (boolean, int, String, etc. as appropriate) to SeqClassifierFlags to mark if the new extractor is turned on or its value, etc. Add it at the bottom of the list of variables currently in the class (this avoids problems with older serialized files breaking). Make the default value of the variable false/null/0 (this is again for backwards compatibility).
  2. Add a clause to the big if/then/else of setProperties(Properties) in SeqClassifierFlags. Unless it is a macro option, make the option name the same as the variable name used in step 1.
  3. Add code to NERFeatureFactory for this feature. First decide which classes (hidden states) are involved in the feature. If only the current class, you add the feature extractor to the featuresC code, if both the current and previous class, then featuresCpC, etc.

Parameters can be defined using a Properties file (specified on the command-line with -prop propFile), or directly on the command line. The following properties are recognized:

If true, a gazette feature fires when all tokens of a gazette entry match

Property NameTypeDefault ValueDescription
loadClassifier Stringn/aPath to serialized classifier to load
loadAuxClassifier Stringn/aPath to auxiliary classifier to load.
serializeToStringn/aPath to serialize classifier to
trainFileStringn/aPath of file to use as training data
testFileStringn/aPath of file to use as training data
useWordbooleantrueGives you feature for w
useBinnedLengthStringnullIf non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound up to the next bound are binned Len-range
useNGramsbooleanfalseMake features from letter n-grams
lowercaseNGramsbooleanfalseMake features from letter n-grams only lowercase
dehyphenateNGramsbooleanfalseRemove hyphens before making features from letter n-grams
conjoinShapeNGramsbooleanfalseConjoin word shape and n-gram features
usePrevbooleanfalseGives you feature for (pw,c), and together with other options enables other previous features, such as (pt,c) [with useTags)
useNextbooleanfalseGives you feature for (nw,c), and together with other options enables other next features, such as (nt,c) [with useTags)
useTagsbooleanfalseGives you features for (t,c), (pt,c) [if usePrev], (nt,c) [if useNext]
useWordPairsbooleanfalseGives you features for (pw, w, c) and (w, nw, c)
useGazettesbooleanfalseIf true, use gazette features (defined by other flags)
gazetteStringnullThe value can be one or more filenames (names separated by a comma, semicolon or space). If provided gazettes are loaded from these files. Each line should be an entity class name, followed by whitespace followed by an entity (which might be a phrase of several tokens with a single space between words). Giving this property turns on useGazettes, so you normally don't need to specify it (but can use it to turn off gazettes specified in a properties file).
sloppyGazettebooleanfalseIf true, a gazette feature fires when any token of a gazette entry matches
cleanGazettebooleanfalse
wordShapeStringnoneEither "none" for no wordShape use, or the name of a word shape function recognized by WordShapeClassifier.lookupShaper(String)
useSequencesbooleantrue
usePrevSequencesbooleanfalse
useNextSequencesbooleanfalse
useLongSequencesbooleanfalseUse plain higher-order state sequences out to minimum of length or maxLeft
useBoundarySequencesbooleanfalseUse extra second order class sequence features when previous is CoNLL boundary, so entity knows it can span boundary.
useTaggySequencesbooleanfalseUse first, second, and third order class and tag sequence interaction features
useExtraTaggySequencesbooleanfalseAdd in sequences of tags with just current class features
useTaggySequencesShapeInteractionbooleanfalseAdd in terms that join sequences of 2 or 3 tags with the current shape
strictlyFirstOrderbooleanfalseAs an override to whatever other options are in effect, deletes all features other than C and CpC clique features when building the classifier
entitySubclassificationString"IO"If set, convert the labeling of classes (but not the background) into one of several alternate encodings (IO, IOB1, IOB2, IOE1, IOE2, SBIEO, with a S(ingle), B(eginning), E(nding), I(nside) 4-way classification for each class. By default, we either do no re-encoding, or the CoNLLDocumentIteratorFactory does a lossy encoding as IO. Note that this is all CoNLL-specific, and depends on their way of prefix encoding classes, and is only implemented by the CoNLLDocumentIteratorFactory.
useSumbooleanfalse
tolerancedouble1e-4Convergence tolerance in optimization
printFeaturesStringnullprint out the features of the classifier to a file based on this name (suffixed "-1" and "-2")
useSymTagsbooleanfalseGives you features (pt, t, nt, c), (t, nt, c), (pt, t, c)
useSymWordPairsbooleanfalseGives you features (pw, nw, c)
printClassifierStringnullStyle in which to print the classifier. One of: HighWeight, HighMagnitude, Collection, AllWeights, WeightHistogram
printClassifierParamint100A parameter to the printing style, which may give, for example the number of parameters to print
internbooleanfalseIf true, (String) intern read in data and classes and feature (pre-)names such as substring features
intern2booleanfalseIf true, intern all (final) feature names (if only current word and ngram features are used, these will already have been interned by intern, and this is an unnecessary no-op)
cacheNGramsbooleanfalseIf true, record the NGram features that correspond to a String (under the current option settings) and reuse rather than recalculating if the String is seen again.
selfTestbooleanfalse
noMidNGramsbooleanfalseDo not include character n-gram features for n-grams that contain neither the beginning or end of the word
maxNGramLengint-1If this number is positive, n-grams above this size will not be used in the model
useReversebooleanfalse
retainEntitySubclassificationbooleanfalseIf true, rather than undoing a recoding of entity tag subtypes (such as BIO variants), just leave them in the output.
useLemmasbooleanfalseInclude the lemma of a word as a feature.
usePrevNextLemmasbooleanfalseInclude the previous/next lemma of a word as a feature.
useLemmaAsWordbooleanfalseInclude the lemma of a word as a feature.
normalizeTermsbooleanfalseIf this is true, some words are normalized: day and month names are lowercased (as for normalizeTimex) and some British spellings are mapped to American English spellings (e.g., -our/-or, etc.).
normalizeTimexbooleanfalseIf this is true, capitalization of day and month names is normalized to lowercase
useNBbooleanfalse
useTypeSeqsbooleanfalseUse basic zeroeth order word shape features.
useTypeSeqs2booleanfalseAdd additional first and second order word shape features
useTypeSeqs3booleanfalseAdds one more first order shape sequence
useDisjunctivebooleanfalseInclude in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position)
disjunctionWidthint4The number of words on each side of the current word that are included in the disjunction features
useDisjunctiveShapeInteractionbooleanfalseInclude in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) interacting with the word shape of the current word
useWideDisjunctivebooleanfalseInclude in features giving disjunctions of words anywhere in the left or right wideDisjunctionWidth words (preserving direction but not position)
wideDisjunctionWidthint4The number of words on each side of the current word that are included in the disjunction features
usePositionbooleanfalseUse combination of position in sentence and class as a feature
useBeginSentbooleanfalseUse combination of initial position in sentence and class (and word shape) as a feature. (Doesn't seem to help.)
useDisjShapebooleanfalseInclude features giving disjunctions of word shapes anywhere in the left or right disjunctionWidth words (preserving direction but not position)
useClassFeaturebooleanfalseInclude a feature for the class (as a class marginal)
useShapeConjunctionsbooleanfalseConjoin shape with tag or position
useWordTagbooleanfalseInclude word and tag pair features
useLastRealWordbooleanfalseIff the prev word is of length 3 or less, add an extra feature that combines the word two back and the current word's shape. Weird!
useNextRealWordbooleanfalseIff the next word is of length 3 or less, add an extra feature that combines the word after next and the current word's shape. Weird!
useTitlebooleanfalseMatch a word against a list of name titles (Mr, Mrs, etc.)
useOccurrencePatternsbooleanfalseThis is a very engineered feature designed to capture multiple references to names. If the current word isn't capitalized, followed by a non-capitalized word, and preceded by a word with alphabetic characters, it returns NO-OCCURRENCE-PATTERN. Otherwise, if the previous word is a capitalized NNP, then if in the next 150 words you find this PW-W sequence, you get XY-NEXT-OCCURRENCE-XY, else if you find W you get XY-NEXT-OCCURRENCE-Y. Similarly for backwards and XY-PREV-OCCURRENCE-XY and XY-PREV-OCCURRENCE-Y. Else (if the previous word isn't a capitalized NNP), under analogous rules you get one or more of X-NEXT-OCCURRENCE-YX, X-NEXT-OCCURRENCE-XY, X-NEXT-OCCURRENCE-X, X-PREV-OCCURRENCE-YX, X-PREV-OCCURRENCE-XY, X-PREV-OCCURRENCE-X.
useTypeySequencesbooleanfalseSome first order word shape patterns.
useGenericFeaturesbooleanfalseIf true, any features you include in the map will be incorporated into the model with values equal to those given in the file; values are treated as strings unless you use the "realValued" option (described below)
justifybooleanfalsePrint out all feature/class pairs and their weight, and then for each input data point, print justification (weights) for active features
normalizebooleanfalseFor the CMMClassifier (only) if this is true then the Scorer normalizes scores as probabilities.
useHuberbooleanfalseUse a Huber loss prior rather than the default quadratic loss.
useQuarticbooleanfalseUse a Quartic prior rather than the default quadratic loss.
sigmadouble1.0
epsilondouble0.01Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear
beamSizeint30
maxLeftint2The number of things to the left that have to be cached to run the Viterbi algorithm: the maximum context of class features used.
dontExtendTaggybooleanfalseDon't extend the range of useTaggySequences when maxLeft is increased.
numFolds int1The number of folds to use for cross-validation.
startFold int1The starting fold to run.
numFoldsToRun int1The number of folds to run.
mergeTags booleanfalseWhether to merge B- and I- tags.
splitDocumentsbooleantrueWhether or not to split the data into seperate documents for training/testing
maxDocSizeint10000If this number is greater than 0, attempt to split documents bigger than this value into multiple documents at sentence boundaries during testing; otherwise do nothing.

Note: flags/properties overwrite left to right. That is, the parameter setting specified last is the one used.

 DOCUMENTATION ON FEATURE TEMPLATES
 

w = word t = tag p = position (word index in sentence) c = class p = paren g = gazette a = abbrev s = shape r = regent (dependency governor) h = head word of phrase n(w) = ngrams from w g(w) = gazette entries containing w l(w) = length of w o(...) = occurrence patterns of words

useReverse reverses meaning of prev, next everywhere below (on in macro)

"Prolog" booleans: , = AND and ; = OR

Mac: Y = turned on in -macro, + = additional positive things relative to -macro for CoNLL NERFeatureFactory (perhaps none...) - = Known negative for CoNLL NERFeatureFactory relative to -macro

Bio: + = additional things that are positive for BioCreative - = things negative relative to -macro

HighMagnitude: There are no (0) to a few (+) to many (+++) high weight features of this template. (? = not used in goodCoNLL, but usually = 0)

Feature Mac Bio CRFFlags HighMagnitude --------------------------------------------------------------------- w,c Y useWord 0 (useWord is almost useless with unlimited ngram features, but helps a fraction in goodCoNLL, if only because of prior fiddling p,c usePosition ? p=0,c useBeginSent ? p=0,s,c useBeginSent ? t,c Y useTags ++ pw,c Y usePrev + pt,c Y usePrev,useTags 0 nw,c Y useNext ++ nt,c Y useNext,useTags 0 pw,w,c Y useWordPairs + w,nw,c Y useWordPairs + pt,t,nt,c useSymTags ? t,nt,c useSymTags ? pt,t,c useSymTags ? pw,nw,c useSymWordPairs ?

pc,c Y usePrev,useSequences,usePrevSequences +++ pc,w,c Y usePrev,useSequences,usePrevSequences 0 nc,c useNext,useSequences,useNextSequences ? w,nc,c useNext,useSequences,useNextSequences ? pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ? w,pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ?

(pw;p2w;p3w;p4w),c + useDisjunctive (out to disjunctionWidth now) +++ (nw;n2w;n3w;n4w),c + useDisjunctive (out to disjunctionWidth now) ++++ (pw;p2w;p3w;p4w),s,c + useDisjunctiveShapeInteraction ? (nw;n2w;n3w;n4w),s,c + useDisjunctiveShapeInteraction ? (pw;p2w;p3w;p4w),c + useWideDisjunctive (to wideDisjunctionWidth) ? (nw;n2w;n3w;n4w),c + useWideDisjunctive (to wideDisjunctionWidth) ? (ps;p2s;p3s;p4s),c useDisjShape (out to disjunctionWidth now) ? (ns;n2s;n3s;n4s),c useDisjShape (out to disjunctionWidth now) ?

pt,pc,t,c Y useTaggySequences + p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=2 + p3t,p3c,p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=3,!dontExtendTaggy ? p2c,pc,c Y useLongSequences ++ p3c,p2c,pc,c Y useLongSequences,maxLeft>=3 ? p4c,p3c,p2c,pc,c Y useLongSequences,maxLeft>=4 ? p2c,pc,c,pw=BOUNDARY useBoundarySequences 0 (OK, but!)

p2t,pt,t,c - useExtraTaggySequences ? p3t,p2t,pt,t,c - useExtraTaggySequences ?

p2t,pt,t,s,p2c,pc,c - useTaggySequencesShapeInteraction ? p3t,p2t,pt,t,s,p3c,p2c,pc,c useTaggySequencesShapeInteraction ?

s,pc,c Y useTypeySequences ++ ns,pc,c Y useTypeySequences // error for ps? not? 0 ps,pc,s,c Y useTypeySequences 0 // p2s,p2c,ps,pc,s,c Y useTypeySequences,maxLeft>=2 // duplicated a useTypeSeqs2 feature

n(w),c Y useNGrams (noMidNGrams, MaxNGramLeng, lowercaseNGrams, dehyphenateNGrams) +++ n(w),s,c useNGrams,conjoinShapeNGrams ?

g,c + useGazFeatures // test refining this? ? pg,pc,c + useGazFeatures ? ng,c + useGazFeatures ? // pg,g,c useGazFeatures ? // pg,g,ng,c useGazFeatures ? // p2g,p2c,pg,pc,g,c useGazFeatures ? g,w,c useMoreGazFeatures ? pg,pc,g,c useMoreGazFeatures ? g,ng,c useMoreGazFeatures ?

g(w),c useGazette,sloppyGazette (contains same word) ? g(w),[pw,nw,...],c useGazette,cleanGazette (entire entry matches) ?

s,c Y wordShape >= 0 +++ ps,c Y wordShape >= 0,useTypeSeqs + ns,c Y wordShape >= 0,useTypeSeqs + pw,s,c Y wordShape >= 0,useTypeSeqs + s,nw,c Y wordShape >= 0,useTypeSeqs + ps,s,c Y wordShape >= 0,useTypeSeqs 0 s,ns,c Y wordShape >= 0,useTypeSeqs ++ ps,s,ns,c Y wordShape >= 0,useTypeSeqs ++ pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2 0 p2c,p2s,pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2,maxLeft>=2 +++ pc,ps,s,ns,c wordShape >= 0,useTypeSeqs,useTypeSeqs3 ?

p2w,s,c if l(pw) <= 3 Y useLastRealWord // weird features, but work 0 n2w,s,c if l(nw) <= 3 Y useNextRealWord ++ o(pw,w,nw),c Y useOccurrencePatterns // don't fully grok but has to do with capitalized name patterns ++

a,c useAbbr;useMinimalAbbr pa,a,c useAbbr a,na,c useAbbr pa,a,na,c useAbbr pa,pc,a,c useAbbr;useMinimalAbbr p2a,p2c,pa,pc,a useAbbr w,a,c useMinimalAbbr p2a,p2c,a,c useMinimalAbbr

RESTR. w,(pw,pc;p2w,p2c;p3w,p3c;p4w,p4c) + useParenMatching,maxLeft>=n

c - useClassFeature

p,s,c - useShapeConjunctions t,s,c - useShapeConjunctions

w,t,c + useWordTag ? w,pt,c + useWordTag ? w,nt,c + useWordTag ?

r,c useNPGovernor (only for baseNP words) r,t,c useNPGovernor (only for baseNP words) h,c useNPHead (only for baseNP words) h,t,c useNPHead (only for baseNP words)

Author:
Dan Klein, Jenny Finkel, Christopher Manning, Shipra Dingare, Huy Nguyen
See Also:
Serialized Form

Field Summary
 
Fields inherited from class edu.stanford.nlp.sequences.FeatureFactory
cliqueC, cliqueCnC, cliqueCp2C, cliqueCp3C, cliqueCp4C, cliqueCp5C, cliqueCpC, cliqueCpCnC, cliqueCpCp2C, cliqueCpCp2Cp3C, cliqueCpCp2Cp3Cp4C, cliqueCpCp2Cp3Cp4Cp5C, flags, knownCliques
 
Constructor Summary
NERFeatureFactory()
           
 
Method Summary
 void clearMemory()
           
protected  Collection<String> featuresC(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCnC(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCp2C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCp3C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCp4C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCp5C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCpC(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCpCnC(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCpCp2C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCpCp2Cp3C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
protected  Collection<String> featuresCpCp2Cp3Cp4C(PaddedList<? extends CoreLabel> cInfo, int loc)
           
 Collection<String> getCliqueFeatures(PaddedList<? extends CoreLabel> cInfo, int loc, Clique clique)
          Extracts all the features from the input data at a certain index.
 void init(SeqClassifierFlags flags)
           
 void initGazette()
           
 
Methods inherited from class edu.stanford.nlp.sequences.FeatureFactory
addAllInterningAndSuffixing, getCliques, getCliques
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NERFeatureFactory

public NERFeatureFactory()
Method Detail

init

public void init(SeqClassifierFlags flags)
Overrides:
init in class FeatureFactory

getCliqueFeatures

public Collection<String> getCliqueFeatures(PaddedList<? extends CoreLabel> cInfo,
                                            int loc,
                                            Clique clique)
Extracts all the features from the input data at a certain index.

Specified by:
getCliqueFeatures in class FeatureFactory
Parameters:
cInfo - The complete data set as a List of WordInfo
loc - The index at which to extract features.
clique - The particular clique for which to extract features. It should be a member of the knownCliques list.
Returns:
A Collection of the features calculated for the word at the specified position in info.

clearMemory

public void clearMemory()

featuresC

protected Collection<String> featuresC(PaddedList<? extends CoreLabel> cInfo,
                                       int loc)

featuresCpC

protected Collection<String> featuresCpC(PaddedList<? extends CoreLabel> cInfo,
                                         int loc)

featuresCp2C

protected Collection<String> featuresCp2C(PaddedList<? extends CoreLabel> cInfo,
                                          int loc)

featuresCp3C

protected Collection<String> featuresCp3C(PaddedList<? extends CoreLabel> cInfo,
                                          int loc)

featuresCp4C

protected Collection<String> featuresCp4C(PaddedList<? extends CoreLabel> cInfo,
                                          int loc)

featuresCp5C

protected Collection<String> featuresCp5C(PaddedList<? extends CoreLabel> cInfo,
                                          int loc)

featuresCpCp2C

protected Collection<String> featuresCpCp2C(PaddedList<? extends CoreLabel> cInfo,
                                            int loc)

featuresCpCp2Cp3C

protected Collection<String> featuresCpCp2Cp3C(PaddedList<? extends CoreLabel> cInfo,
                                               int loc)

featuresCpCp2Cp3Cp4C

protected Collection<String> featuresCpCp2Cp3Cp4C(PaddedList<? extends CoreLabel> cInfo,
                                                  int loc)

featuresCnC

protected Collection<String> featuresCnC(PaddedList<? extends CoreLabel> cInfo,
                                         int loc)

featuresCpCnC

protected Collection<String> featuresCpCnC(PaddedList<? extends CoreLabel> cInfo,
                                           int loc)

initGazette

public void initGazette()


Stanford NLP Group