public class NERFeatureFactory<IN extends CoreLabel> extends FeatureFactory<IN>
SeqClassifierFlags
to see where the flags are set for
what options to use for what flags.
To add a new feature extractor, you should do the following:
featuresC
code, if both the current and previous class,
then featuresCpC
, etc. Parameters can be defined using a Properties file
(specified on the command-line with -prop
propFile),
or directly on the command line. The following properties are recognized:
Property Name | Type | Default Value | Description |
loadClassifier | String | n/a | Path to serialized classifier to load |
loadAuxClassifier | String | n/a | Path to auxiliary classifier to load. |
serializeTo | String | n/a | Path to serialize classifier to |
trainFile | String | n/a | Path of file to use as training data |
testFile | String | n/a | Path of file to use as training data |
map | String | see below | This applies at training time or if testing on tab-separated column data. It says what is in each column. It doesn't apply when running on plain text data. The simplest scenario for training is having words and classes in two column. word=0,answer=1 is the default if conllNoTags is specified; otherwise word=0,tag=1,answer=2 is the default. But you can add other columns, such as for a part-of-speech tag, presences in a lexicon, etc. That would only be useful at runtime if you have part-of-speech information or whatever available and are passing it in with the tokens (that is, you can pass to classify CoreLabel tokens with additional fields stored in them). |
useWord | boolean | true | Gives you feature for w |
useBinnedLength | String | null | If non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound up to the next bound are binned Len-range |
useNGrams | boolean | false | Make features from letter n-grams, i.e., substrings of the word |
lowercaseNGrams | boolean | false | Make features from letter n-grams only lowercase |
dehyphenateNGrams | boolean | false | Remove hyphens before making features from letter n-grams |
conjoinShapeNGrams | boolean | false | Conjoin word shape and n-gram features |
useNeighborNGrams | boolean | false | Use letter n-grams for the previous and current words in the CpC clique. This feature helps languages such as Chinese, but not so much for English |
usePrev | boolean | false | Gives you feature for (pw,c), and together with other options enables other previous features, such as (pt,c) [with useTags) |
useNext | boolean | false | Gives you feature for (nw,c), and together with other options enables other next features, such as (nt,c) [with useTags) |
useTags | boolean | false | Gives you features for (t,c), (pt,c) [if usePrev], (nt,c) [if useNext] |
useWordPairs | boolean | false | Gives you features for (pw, w, c) and (w, nw, c) |
useGazettes | boolean | false | If true, use gazette features (defined by other flags) |
gazette | String | null | The value can be one or more filenames (names separated by a comma, semicolon or space). If provided gazettes are loaded from these files. Each line should be an entity class name, followed by whitespace followed by an entity (which might be a phrase of several tokens with a single space between words). Giving this property turns on useGazettes, so you normally don't need to specify it (but can use it to turn off gazettes specified in a properties file). |
sloppyGazette | boolean | false | If true, a gazette feature fires when any token of a gazette entry matches |
cleanGazette | boolean | false | If true, a gazette feature fires when all tokens of a gazette entry match |
wordShape | String | none | Either "none" for no wordShape use, or the name of a word shape function recognized by WordShapeClassifier.lookupShaper(String) |
useSequences | boolean | true | Does not use any class combination features if this is false |
usePrevSequences | boolean | false | Does not use any class combination features using previous classes if this is false |
useNextSequences | boolean | false | Does not use any class combination features using next classes if this is false |
useLongSequences | boolean | false | Use plain higher-order state sequences out to minimum of length or maxLeft |
useBoundarySequences | boolean | false | Use extra second order class sequence features when previous is CoNLL boundary, so entity knows it can span boundary. |
useTaggySequences | boolean | false | Use first, second, and third order class and tag sequence interaction features |
useExtraTaggySequences | boolean | false | Add in sequences of tags with just current class features |
useTaggySequencesShapeInteraction | boolean | false | Add in terms that join sequences of 2 or 3 tags with the current shape |
strictlyFirstOrder | boolean | false | As an override to whatever other options are in effect, deletes all features other than C and CpC clique features when building the classifier |
entitySubclassification | String | "IO" | If set, convert the labeling of classes (but not the background) into one of several alternate encodings (IO, IOB1, IOB2, IOE1, IOE2, SBIEO, with a S(ingle), B(eginning), E(nding), I(nside) 4-way classification for each class. By default, we either do no re-encoding, or the CoNLLDocumentIteratorFactory does a lossy encoding as IO. Note that this is all CoNLL-specific, and depends on their way of prefix encoding classes, and is only implemented by the CoNLLDocumentIteratorFactory. |
useSum | boolean | false | |
tolerance | double | 1e-4 | Convergence tolerance in optimization |
printFeatures | String | null | print out all the features generated by the classifier for a dataset to a file based on this name (starting with "features-", suffixed "-1" and "-2" for train and test). This simply prints the feature names, one per line. |
printFeaturesUpto | int | -1 | Print out features for only the first this many datums, if the value is positive. |
useSymTags | boolean | false | Gives you features (pt, t, nt, c), (t, nt, c), (pt, t, c) |
useSymWordPairs | boolean | false | Gives you features (pw, nw, c) |
printClassifier | String | null | Style in which to print the classifier. One of: HighWeight, HighMagnitude, Collection, AllWeights, WeightHistogram |
printClassifierParam | int | 100 | A parameter to the printing style, which may give, for example the number of parameters to print |
intern | boolean | false | If true, (String) intern read in data and classes and feature (pre-)names such as substring features |
intern2 | boolean | false | If true, intern all (final) feature names (if only current word and ngram features are used, these will already have been interned by intern, and this is an unnecessary no-op) |
cacheNGrams | boolean | false | If true, record the NGram features that correspond to a String (under the current option settings) and reuse rather than recalculating if the String is seen again. |
selfTest | boolean | false | |
noMidNGrams | boolean | false | Do not include character n-gram features for n-grams that contain neither the beginning or end of the word |
maxNGramLeng | int | -1 | If this number is positive, n-grams above this size will not be used in the model |
useReverse | boolean | false | |
retainEntitySubclassification | boolean | false | If true, rather than undoing a recoding of entity tag subtypes (such as BIO variants), just leave them in the output. |
useLemmas | boolean | false | Include the lemma of a word as a feature. |
usePrevNextLemmas | boolean | false | Include the previous/next lemma of a word as a feature. |
useLemmaAsWord | boolean | false | Include the lemma of a word as a feature. |
normalizeTerms | boolean | false | If this is true, some words are normalized: day and month names are lowercased (as for normalizeTimex) and some British spellings are mapped to American English spellings (e.g., -our/-or, etc.). |
normalizeTimex | boolean | false | If this is true, capitalization of day and month names is normalized to lowercase |
useNB | boolean | false | |
useTypeSeqs | boolean | false | Use basic zeroeth order word shape features. |
useTypeSeqs2 | boolean | false | Add additional first and second order word shape features |
useTypeSeqs3 | boolean | false | Adds one more first order shape sequence |
useDisjunctive | boolean | false | Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) |
disjunctionWidth | int | 4 | The number of words on each side of the current word that are included in the disjunction features |
useDisjunctiveShapeInteraction | boolean | false | Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) interacting with the word shape of the current word |
useWideDisjunctive | boolean | false | Include in features giving disjunctions of words anywhere in the left or right wideDisjunctionWidth words (preserving direction but not position) |
wideDisjunctionWidth | int | 4 | The number of words on each side of the current word that are included in the disjunction features |
usePosition | boolean | false | Use combination of position in sentence and class as a feature |
useBeginSent | boolean | false | Use combination of initial position in sentence and class (and word shape) as a feature. (Doesn't seem to help.) |
useDisjShape | boolean | false | Include features giving disjunctions of word shapes anywhere in the left or right disjunctionWidth words (preserving direction but not position) |
useClassFeature | boolean | false | Include a feature for the class (as a class marginal). Puts a prior on the classes which is equivalent to how often the feature appeared in the training data. |
useShapeConjunctions | boolean | false | Conjoin shape with tag or position |
useWordTag | boolean | false | Include word and tag pair features |
useLastRealWord | boolean | false | Iff the prev word is of length 3 or less, add an extra feature that combines the word two back and the current word's shape. Weird! |
useNextRealWord | boolean | false | Iff the next word is of length 3 or less, add an extra feature that combines the word after next and the current word's shape. Weird! |
useTitle | boolean | false | Match a word against a list of name titles (Mr, Mrs, etc.) |
useDistSim | boolean | false | Load a file of distributional similarity classes (specified by distSimLexicon ) and use it for features |
distSimLexicon | String | The file to be loaded for distsim classes. | |
distSimFileFormat | String | alexclark | Files should be formatted as tab separated rows where each row is a word/class pair. alexclark=word first, terrykoo=class first |
useOccurrencePatterns | boolean | false | This is a very engineered feature designed to capture multiple references to names. If the current word isn't capitalized, followed by a non-capitalized word, and preceded by a word with alphabetic characters, it returns NO-OCCURRENCE-PATTERN. Otherwise, if the previous word is a capitalized NNP, then if in the next 150 words you find this PW-W sequence, you get XY-NEXT-OCCURRENCE-XY, else if you find W you get XY-NEXT-OCCURRENCE-Y. Similarly for backwards and XY-PREV-OCCURRENCE-XY and XY-PREV-OCCURRENCE-Y. Else (if the previous word isn't a capitalized NNP), under analogous rules you get one or more of X-NEXT-OCCURRENCE-YX, X-NEXT-OCCURRENCE-XY, X-NEXT-OCCURRENCE-X, X-PREV-OCCURRENCE-YX, X-PREV-OCCURRENCE-XY, X-PREV-OCCURRENCE-X. |
useTypeySequences | boolean | false | Some first order word shape patterns. |
useGenericFeatures | boolean | false | If true, any features you include in the map will be incorporated into the model with values equal to those given in the file; values are treated as strings unless you use the "realValued" option (described below) |
justify | boolean | false | Print out all feature/class pairs and their weight, and then for each input data point, print justification (weights) for active features |
normalize | boolean | false | For the CMMClassifier (only) if this is true then the Scorer normalizes scores as probabilities. |
useHuber | boolean | false | Use a Huber loss prior rather than the default quadratic loss. |
useQuartic | boolean | false | Use a Quartic prior rather than the default quadratic loss. |
sigma | double | 1.0 | |
epsilon | double | 0.01 | Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear |
beamSize | int | 30 | |
maxLeft | int | 2 | The number of things to the left that have to be cached to run the Viterbi algorithm: the maximum context of class features used. |
maxRight | int | 2 | The number of things to the right that have to be cached to run the Viterbi algorithm: the maximum context of class features used. The maximum possible clique size to use is (maxLeft + maxRight + 1) |
dontExtendTaggy | boolean | false | Don't extend the range of useTaggySequences when maxLeft is increased. |
numFolds | int | 1 | The number of folds to use for cross-validation. CURRENTLY NOT IMPLEMENTED. |
startFold | int | 1 | The starting fold to run. CURRENTLY NOT IMPLEMENTED. |
endFold | int | 1 | The last fold to run. CURRENTLY NOT IMPLEMENTED. |
mergeTags | boolean | false | Whether to merge B- and I- tags. |
splitDocuments | boolean | true | Whether or not to split the data into separate documents for training/testing |
maxDocSize | int | 10000 | If this number is greater than 0, attempt to split documents bigger than this value into multiple documents at sentence boundaries during testing; otherwise do nothing. |
Note: flags/properties overwrite left to right. That is, the parameter setting specified last is the one used.
DOCUMENTATION ON FEATURE TEMPLATES
w = word t = tag p = position (word index in sentence) c = class p = paren g = gazette a = abbrev s = shape r = regent (dependency governor) h = head word of phrase n(w) = ngrams from w g(w) = gazette entries containing w l(w) = length of w o(...) = occurrence patterns of words
useReverse reverses meaning of prev, next everywhere below (on in macro)
"Prolog" booleans: , = AND and ; = OR
Mac: Y = turned on in -macro, + = additional positive things relative to -macro for CoNLL NERFeatureFactory (perhaps none...) - = Known negative for CoNLL NERFeatureFactory relative to -macro
Bio: + = additional things that are positive for BioCreative - = things negative relative to -macro
HighMagnitude: There are no (0) to a few (+) to many (+++) high weight features of this template. (? = not used in goodCoNLL, but usually = 0)
Feature Mac Bio CRFFlags HighMagnitude --------------------------------------------------------------------- w,c Y useWord 0 (useWord is almost useless with unlimited ngram features, but helps a fraction in goodCoNLL, if only because of prior fiddling p,c usePosition ? p=0,c useBeginSent ? p=0,s,c useBeginSent ? t,c Y useTags ++ pw,c Y usePrev + pt,c Y usePrev,useTags 0 nw,c Y useNext ++ nt,c Y useNext,useTags 0 pw,w,c Y useWordPairs + w,nw,c Y useWordPairs + pt,t,nt,c useSymTags ? t,nt,c useSymTags ? pt,t,c useSymTags ? pw,nw,c useSymWordPairs ?
pc,c Y usePrev,useSequences,usePrevSequences +++ pc,w,c Y usePrev,useSequences,usePrevSequences 0 nc,c useNext,useSequences,useNextSequences ? w,nc,c useNext,useSequences,useNextSequences ? pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ? w,pc,nc,c useNext,usePrev,useSequences,usePrevSequences,useNextSequences ?
(pw;p2w;p3w;p4w),c + useDisjunctive (out to disjunctionWidth now) +++ (nw;n2w;n3w;n4w),c + useDisjunctive (out to disjunctionWidth now) ++++ (pw;p2w;p3w;p4w),s,c + useDisjunctiveShapeInteraction ? (nw;n2w;n3w;n4w),s,c + useDisjunctiveShapeInteraction ? (pw;p2w;p3w;p4w),c + useWideDisjunctive (to wideDisjunctionWidth) ? (nw;n2w;n3w;n4w),c + useWideDisjunctive (to wideDisjunctionWidth) ? (ps;p2s;p3s;p4s),c useDisjShape (out to disjunctionWidth now) ? (ns;n2s;n3s;n4s),c useDisjShape (out to disjunctionWidth now) ?
pt,pc,t,c Y useTaggySequences + p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=2 + p3t,p3c,p2t,p2c,pt,pc,t,c Y useTaggySequences,maxLeft>=3,!dontExtendTaggy ? p2c,pc,c Y useLongSequences ++ p3c,p2c,pc,c Y useLongSequences,maxLeft>=3 ? p4c,p3c,p2c,pc,c Y useLongSequences,maxLeft>=4 ? p2c,pc,c,pw=BOUNDARY useBoundarySequences 0 (OK, but!)
p2t,pt,t,c - useExtraTaggySequences ? p3t,p2t,pt,t,c - useExtraTaggySequences ?
p2t,pt,t,s,p2c,pc,c - useTaggySequencesShapeInteraction ? p3t,p2t,pt,t,s,p3c,p2c,pc,c useTaggySequencesShapeInteraction ?
s,pc,c Y useTypeySequences ++ ns,pc,c Y useTypeySequences // error for ps? not? 0 ps,pc,s,c Y useTypeySequences 0 // p2s,p2c,ps,pc,s,c Y useTypeySequences,maxLeft>=2 // duplicated a useTypeSeqs2 feature
n(w),c Y useNGrams (noMidNGrams, MaxNGramLeng, lowercaseNGrams, dehyphenateNGrams) +++ n(w),s,c useNGrams,conjoinShapeNGrams ?
g,c + useGazFeatures // test refining this? ? pg,pc,c + useGazFeatures ? ng,c + useGazFeatures ? // pg,g,c useGazFeatures ? // pg,g,ng,c useGazFeatures ? // p2g,p2c,pg,pc,g,c useGazFeatures ? g,w,c useMoreGazFeatures ? pg,pc,g,c useMoreGazFeatures ? g,ng,c useMoreGazFeatures ?
g(w),c useGazette,sloppyGazette (contains same word) ? g(w),[pw,nw,...],c useGazette,cleanGazette (entire entry matches) ?
s,c Y wordShape >= 0 +++ ps,c Y wordShape >= 0,useTypeSeqs + ns,c Y wordShape >= 0,useTypeSeqs + pw,s,c Y wordShape >= 0,useTypeSeqs + s,nw,c Y wordShape >= 0,useTypeSeqs + ps,s,c Y wordShape >= 0,useTypeSeqs 0 s,ns,c Y wordShape >= 0,useTypeSeqs ++ ps,s,ns,c Y wordShape >= 0,useTypeSeqs ++ pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2 0 p2c,p2s,pc,ps,s,c Y wordShape >= 0,useTypeSeqs,useTypeSeqs2,maxLeft>=2 +++ pc,ps,s,ns,c wordShape >= 0,useTypeSeqs,useTypeSeqs3 ?
p2w,s,c if l(pw) <= 3 Y useLastRealWord // weird features, but work 0 n2w,s,c if l(nw) <= 3 Y useNextRealWord ++ o(pw,w,nw),c Y useOccurrencePatterns // don't fully grok but has to do with capitalized name patterns ++
a,c useAbbr;useMinimalAbbr pa,a,c useAbbr a,na,c useAbbr pa,a,na,c useAbbr pa,pc,a,c useAbbr;useMinimalAbbr p2a,p2c,pa,pc,a useAbbr w,a,c useMinimalAbbr p2a,p2c,a,c useMinimalAbbr
RESTR. w,(pw,pc;p2w,p2c;p3w,p3c;p4w,p4c) + useParenMatching,maxLeft>=n
c - useClassFeature
p,s,c - useShapeConjunctions t,s,c - useShapeConjunctions
w,t,c + useWordTag ? w,pt,c + useWordTag ? w,nt,c + useWordTag ?
r,c useNPGovernor (only for baseNP words) r,t,c useNPGovernor (only for baseNP words) h,c useNPHead (only for baseNP words) h,t,c useNPHead (only for baseNP words)
cliqueC, cliqueCnC, cliqueCp2C, cliqueCp3C, cliqueCp4C, cliqueCp5C, cliqueCpC, cliqueCpCnC, cliqueCpCp2C, cliqueCpCp2Cp3C, cliqueCpCp2Cp3Cp4C, cliqueCpCp2Cp3Cp4Cp5C, flags, knownCliques
Constructor and Description |
---|
NERFeatureFactory() |
addAllInterningAndSuffixing, getCliques, getCliques, getWord
public void init(SeqClassifierFlags flags)
init
in class FeatureFactory<IN extends CoreLabel>
public Collection<String> getCliqueFeatures(PaddedList<IN> cInfo, int loc, Clique clique)
getCliqueFeatures
in class FeatureFactory<IN extends CoreLabel>
cInfo
- The complete data set as a List of WordInfoloc
- The index at which to extract features.clique
- The particular clique for which to extract features. It
should be a member of the knownCliques list.Collection
of the features
calculated for the word at the specified position in info.public void clearMemory()
protected Collection<String> featuresC(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCpC(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCp2C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCp3C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCp4C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCp5C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCpCp2C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCpCp2Cp3C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCpCp2Cp3Cp4C(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCnC(PaddedList<IN> cInfo, int loc)
protected Collection<String> featuresCpCnC(PaddedList<IN> cInfo, int loc)
public void initGazette()