This class contains feature extractors for the MaxentTagger that are only
applied to rare (low frequency/unknown) words.
The following options are supported:
Name | Args | Effect |
wordshapes | left, right |
Word shape features, eg transform Foo5 into Xxx#
(not exactly like that, but that general idea).
Creates individual features for each word left ... right |
unicodeshapes | left, right |
Same thing, but works for some unicode characters, too. |
unicodeshapeconjunction | left, right |
Instead of individual word shape features, combines several
word shapes into one feature. |
suffix | length, position |
Features for suffixes of the word position. One feature for
each suffix of length 1 ... length. |
prefix | length, position |
Features for prefixes of the word position. One feature for
each prefix of length 1 ... length. |
prefixsuffix | length |
Features for concatenated prefix and suffix. One feature for
each of length 1 ... length. |
capitalizationsuffix | length |
Current word only. Combines the suffix with a binary value
for whether the word contains any capital letters. |
distsim | filename, left, right |
Individual features for each position left ... right.
Compares that word with the dictionary in filename. |
distsimconjunction | filename, left, right |
A concatenation of distsim features from left ... right. |
Also available are the macros "naacl2003unknowns",
"lnaacl2003unknowns", and "naacl2003conjunctions".
naacl2003unknowns and lnaacl2003unknowns include suffix extractors
and extractors for specific word shape features, such as containing
or not containing a digit.