A common feature selection method is to compute as
the expected mutual information (MI) of term and
class . MI measures how much information the
presence/absence of a term contributes to making the correct
classification decision on . Formally:
For
MLEs of the probabilities,
Equation 130 is equivalent to Equation 131:
Worked example. Consider the class poultry and the term export in Reuters-RCV1. The counts of the number of documents with the four possible combinations of indicator values are as follows:
After plugging these values into Equation 131 we get:
End worked example.
To select terms for a given class, we use the feature selection algorithm in Figure 13.6 : We compute the utility measure as and select the terms with the largest values.
Mutual information measures how much information - in the information-theoretic sense - a term contains about the class. If a term's distribution is the same in the class as it is in the collection as a whole, then . MI reaches its maximum value if the term is a perfect indicator for class membership, that is, if the term is present in a document if and only if the document is in the class.
Figure 13.7 shows terms with high mutual information scores for the six classes in Figure 13.1 . The selected terms (e.g., london, uk, british for the class UK) are of obvious utility for making classification decisions for their respective classes. At the bottom of the list for UK we find terms like peripherals and tonight (not shown in the figure) that are clearly not helpful in deciding whether the document is in the class. As you might expect, keeping the informative terms and eliminating the non-informative ones tends to reduce noise and improve the classifier's accuracy.
Such an accuracy increase can be observed in Figure 13.8 , which shows as a function of vocabulary size after feature selection for Reuters-RCV1. Comparing at 132,776 features (corresponding to selection of all features) and at 10-100 features, we see that MI feature selection increases by about 0.1 for the multinomial model and by more than 0.2 for the Bernoulli model. For the Bernoulli model, peaks early, at ten features selected. At that point, the Bernoulli model is better than the multinomial model. When basing a classification decision on only a few features, it is more robust to consider binary occurrence only. For the multinomial model (MI feature selection), the peak occurs later, at 100 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features.