Mutual information

A common feature selection method is to compute as
the expected *mutual information* (MI) of term and
class .^{} MI measures how much information the
presence/absence of a term contributes to making the correct
classification decision on . Formally:

where is a random variable that takes values (the document contains term ) and (the document does not contain ), as defined on page 13.4 , and is a random variable that takes values (the document is in class ) and (the document is not in class ). We write and if it is not clear from context which term and class we are referring to.

For
MLEs of the probabilities,
Equation 130 is equivalent to Equation 131:

where the s are counts of documents that have the values of and that are indicated by the two subscripts. For example, is the number of documents that contain () and are not in (). is the number of documents that contain () and we count documents independent of class membership ( ). is the total number of documents. An example of one of the MLE estimates that transform Equation 130 into Equation 131 is .

**Worked example.**
Consider the class poultry and the
term export in Reuters-RCV1. The counts of the
number of documents with the four possible combinations of
indicator values are as follows:

After plugging these values into Equation 131 we get:

**End worked example.**

To select terms for a given class, we use the feature selection algorithm in Figure 13.6 : We compute the utility measure as and select the terms with the largest values.

Mutual information measures how much information - in the information-theoretic sense - a term contains about the class. If a term's distribution is the same in the class as it is in the collection as a whole, then . MI reaches its maximum value if the term is a perfect indicator for class membership, that is, if the term is present in a document if and only if the document is in the class.

Figure 13.7 shows terms with high
mutual information scores for the six classes
in Figure 13.1 .^{} The selected terms (e.g.,
london, uk, british for the class UK)
are of
obvious utility for making classification decisions for their respective classes.
At the bottom of the list for UK we find terms like peripherals and
tonight (not shown in the figure) that are clearly not helpful in deciding whether the
document is in the class. As you might expect, keeping the
informative terms and eliminating the non-informative ones
tends to reduce noise and improve the classifier's accuracy.

Such an accuracy increase can be observed in Figure 13.8 , which shows as a function of vocabulary size after feature selection for Reuters-RCV1.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07