Next: Assessing chi-square as a Up: Feature selection Previous: Mutual information   Contents   Index

Chi-square feature selection

Another popular feature selection method is . In statistics, the test is applied to test the independence of two events, where two events A and B are defined to be independent if or, equivalently, and . In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity:

 (133)

where and are defined as in Equation 130. is the observed frequency in and the expected frequency. For example, is the expected frequency of and occurring together in a document assuming that term and class are independent.

Worked example. We first compute for the data in Example 13.5.1:

 (134) (135)

where is the total number of documents as before.

We compute the other in the same way:

Plugging these values into Equation 133, we get a value of 284:

 (136)

End worked example.

 critical value 0.1 2.71 0.05 3.84 0.01 6.63 0.005 7.88 0.001 10.83
Critical values of the distribution with one degree of freedom.For example, if the two events are independent, then . So for the assumption of independence can be rejected with 99% confidence.

is a measure of how much expected counts and observed counts deviate from each other. A high value of indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. In our example, . Based on Table 13.6 , we can reject the hypothesis that poultry and export are independent with only a 0.001 chance of being wrong.Equivalently, we say that the outcome is statistically significant at the 0.001 level. If the two events are dependent, then the occurrence of the term makes the occurrence of the class more likely (or less likely), so it should be helpful as a feature. This is the rationale of feature selection.

An arithmetically simpler way of computing is the following:

 (137)

This is equivalent to Equation 133 (Exercise 13.8 ).

Subsections

Next: Assessing chi-square as a Up: Feature selection Previous: Mutual information   Contents   Index