Next: Feature selection for multiple
Up: Feature selection
Previous: Assessing as a feature
Contents
Index
Frequency-based feature
selection
A third feature selection method is
frequency-based
feature selection , that is, selecting the terms that are
most common in the class.
Frequency can be either defined as document frequency (the number of documents
in the class that contain the term ) or as
collection frequency (the number of
tokens of that occur in documents in
). Document frequency is
more appropriate for the Bernoulli model, collection frequency for the
multinomial model.
Frequency-based feature selection
selects some
frequent terms that have no specific information about the
class, for example, the days of the week (Monday,
Tuesday, ...), which are frequent across classes
in newswire text. When many thousands of features are selected,
then frequency-based feature selection often does
well.
Thus, if somewhat suboptimal accuracy is acceptable, then
frequency-based feature selection can be a good alternative to
more complex methods. However, Figure 13.8 is a case where
frequency-based feature selection performs a lot worse than
MI and and should not be
used.
Next: Feature selection for multiple
Up: Feature selection
Previous: Assessing as a feature
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07