Despite the differences between the two methods, the
classification accuracy of feature sets selected with
and MI does not seem to differ systematically.
In most text classification problems, there are a few strong
indicators and many weak indicators. As long as all strong
indicators and a large number of weak indicators are
selected, accuracy is expected to be good.
Both methods do this.
Figure 13.8
compares MI and feature selection
for the multinomial model. Peak effectiveness is
virtually the same for both methods.
reaches this
peak later, at 300 features, probably because the rare, but highly
significant features it selects initially do not cover all
documents in the class. However, features selected later (in
the range of 100-300) are of better quality than those
selected by MI.
All three methods - MI, and frequency based - are
greedy methods. They may select features that
contribute no incremental information over previously selected
features. In Figure 13.7 , kong is selected
as the seventh term even though it is highly correlated with
previously selected hong and therefore redundant.
Although such redundancy can negatively impact accuracy,
non-greedy methods (see Section 13.7 for references)
are rarely used in text classification due to their
computational cost.
Exercises.
Select two of these four terms based on (i)
term brazil 98,012 102 1835 51 council 96,322 133 3525 20 producers 98,524 119 1118 34 roasted 99,824 143 23 10