next up previous contents index
Next: Feature selection for multiple Up: Feature selection Previous: Assessing as a feature   Contents   Index


Frequency-based feature selection

A third feature selection method is frequency-based feature selection , that is, selecting the terms that are most common in the class. Frequency can be either defined as document frequency (the number of documents in the class $c$ that contain the term $\tcword$) or as collection frequency (the number of tokens of $\tcword$ that occur in documents in $c$). Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.

Frequency-based feature selection selects some frequent terms that have no specific information about the class, for example, the days of the week (Monday, Tuesday, ...), which are frequent across classes in newswire text. When many thousands of features are selected, then frequency-based feature selection often does well. Thus, if somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods. However, Figure 13.8 is a case where frequency-based feature selection performs a lot worse than MI and $\chi ^2$ and should not be used.


next up previous contents index
Next: Feature selection for multiple Up: Feature selection Previous: Assessing as a feature   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07