next up previous contents index
Next: Frequency-based feature selection Up: Chi-square feature selection Previous: Chi-square feature selection   Contents   Index

Assessing chi-square as a feature selection method

From a statistical point of view, $\chi ^2$ feature selection is problematic. For a test with one degree of freedom, the so-called Yates correction should be used (see Section 13.7 ), which makes it harder to reach statistical significance. Also, whenever a statistical test is used multiple times, then the probability of getting at least one error increases. If 1000 hypotheses are rejected, each with 0.05 error probability, then $0.05 \times 1000=50$ calls of the test will be wrong on average. However, in text classification it rarely matters whether a few additional terms are added to the feature set or removed from it. Rather, the relative importance of features is important. As long as $\chi ^2$ feature selection only ranks features with respect to their usefulness and is not used to make statements about statistical dependence or independence of variables, we need not be overly concerned that it does not adhere strictly to statistical theory.

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.