public class ExtractorFramesRare
This class contains feature extractors for the MaxentTagger that are only
applied to rare (low frequency/unknown) words.
The following options are supported:
Word shape features, e.g., transform Foo5 into Xxx#
(not exactly like that, but that general idea).
Creates individual features for each word left ... right.
If just one argument wordshapes(-2) is given, then end is taken as 0.
If left is not less than or equal to right, no features are made.
Same thing, but works for unicode characters generally.
Instead of individual word shape features, combines several
word shapes into one feature.
Features for suffixes of the word position. One feature for
each suffix of length 1 ... length.
Features for prefixes of the word position. One feature for
each prefix of length 1 ... length.
Features for concatenated prefix and suffix. One feature for
each of length 1 ... length.
Current word only. Combines character suffixes up to size length with a
binary value for whether the word contains any capital letters.
filename, left, right
Individual features for each position left ... right.
Compares that word with the dictionary in filename.
filename, left, right
A concatenation of distsim features from left ... right.
Also available are the macros "naacl2003unknowns",
"lnaacl2003unknowns", and "naacl2003conjunctions".
naacl2003unknowns and lnaacl2003unknowns include suffix extractors
and extractors for specific word shape features, such as containing
or not containing a digit.
The macro "frenchunknowns" is a macro for five extractors specific
to French, which test the end of the word to see if it matches
common suffixes for various POS classes and plural words. Adding
this experiment did not improve accuracy over the regular
naacl2003unknowns extractor macro, though.
Get an array of rare word feature Extractor identified by a name.
Note: Names used here must also be known in getExtractorFrames, so we
can appropriately add error messages. So if you add a keyword here,
add it there as one to be ignored, too. (In the next iteration, this
class and ExtractorFrames should probably just be combined).
identifier - Describes a set of extractors for rare word features