Comparison of feature selection methods

Next: Evaluation of text classification Up: Feature selection Previous: Feature selection for multiple Contents Index

Comparison of feature selection methods

Mutual information and $\chi ^2$ represent rather different feature selection methods. The independence of term $\tcword$ and class

can sometimes be rejected with high confidence even if $\tcword$ carries little information about membership of a document in

. This is particularly true for rare terms. If a term occurs once in a large collection and that one occurrence is in the poultry class, then this is statistically significant. But a single occurrence is not very informative according to the information-theoretic definition of information. Because its criterion is significance, $\chi ^2$ selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy.

Despite the differences between the two methods, the classification accuracy of feature sets selected with $\chi ^2$ and MI does not seem to differ systematically. In most text classification problems, there are a few strong indicators and many weak indicators. As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good. Both methods do this.

Figure 13.8 compares MI and $\chi ^2$ feature selection for the multinomial model. Peak effectiveness is virtually the same for both methods. $\chi ^2$ reaches this peak later, at 300 features, probably because the rare, but highly significant features it selects initially do not cover all documents in the class. However, features selected later (in the range of 100-300) are of better quality than those selected by MI.

All three methods - MI, $\chi ^2$ and frequency based - are greedy methods. They may select features that contribute no incremental information over previously selected features. In Figure 13.7 , kong is selected as the seventh term even though it is highly correlated with previously selected hong and therefore redundant. Although such redundancy can negatively impact accuracy, non-greedy methods (see Section 13.7 for references) are rarely used in text classification due to their computational cost.

Exercises.

Consider the following frequencies for the class coffee for four terms in the first 100,000 documents of Reuters-RCV1:

term $\observationo_{00}$ $\observationo_{01}$ $\observationo_{10}$ $\observationo_{11}$

brazil 98,012 102 1835 51

council 96,322 133 3525 20

producers 98,524 119 1118 34

roasted 99,824 143 23 10

Select two of these four terms based on (i) $\chi ^2$ , (ii) mutual information, (iii) frequency .

Next: Evaluation of text classification Up: Feature selection Previous: Feature selection for multiple Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07


term	$\observationo_{00}$	$\observationo_{01}$	$\observationo_{10}$	$\observationo_{11}$
brazil	98,012	102	1835	51
council	96,322	133	3525	20
producers	98,524	119	1118	34
roasted	99,824	143	23	10