This class contains the basic feature extractors used for all words and
tag sequences (and interaction terms) for the MaxentTagger, but not the
feature extractors explicitly targeting generalization for rare or unknown
The following options are supported:
||Individual features for words begin ... end.
If just one argument words(-2) is given, then end is taken as 0. If
begin is not less than or equal to end, no features are made.|
||Individual features for tags begin ... end|
||One feature for the pair of words w1, w2|
||One feature for each sequential pair of words
from begin to end|
||One feature for the pair of tags t1, t2|
||One feature for each word begin ... end, lowercased|
||A feature for tags left through 0 and a feature for
tags 0 through right. Lower order left and right features are
This gets very expensive for higher order terms.|
||A feature combining word w and tag t.|
|wordTwoTags||w, t1, t2
||A feature combining word w and tags t1, t2.|
|threeTags||t1, t2, t3
||A feature combining tags t1, t2, t3.|
||A feature that looks at the left length words for something that
appears to be a VBN (in English) without looking at the actual tags.
It is zeroeth order, as it does not look at the tag predictions.
It also is never used, since it doesn't seem to help.|
||Word shape features, eg transform Foo5 into Xxx#
(not exactly like that, but that general idea).
Creates individual features for each word left ... right.
Compare with the feature "wordshapes" in ExtractorFramesRare,
which is only applied to rare words. Fairly English-specific.
Slightly increases accuracy.|
||Same thing, but works for unicode characters more generally.|
||Instead of individual word shape features, combines several
word shapes into one feature.|
for more options.
There are also macro features:
left3words = words(-1,1),order(2)
left5words = words(-2,2),order(2)
generic = words(-1,1),order(2),biwords(-1,0),wordTag(0,-1)
german = some random stuff
sighan2005 = some other random stuff
The left3words architectures are faster, but slightly less
accurate, than the bidirectional architectures.
'naacl2003unknowns' was our traditional set of unknown word
features, but you can now specify features more flexibility via the
various other supported keywords.