General introductions to statistical classification and machine learning can be found in (Hastie et al., 2001), (Mitchell, 1997), and (Duda et al., 2000), including many important methods (e.g., decision trees and boosting ) that we do not cover. A comprehensive review of text classification methods and results is (Sebastiani, 2002). Manning and Schütze (1999, Chapter 16) give an accessible introduction to text classification with coverage of decision trees, perceptrons and maximum entropy models. More information on the superlinear time complexity of learning methods that are more accurate than Naive Bayes can be found in (Perkins et al., 2003) and (Joachims, 2006a).
Maron and Kuhns (1960) described one of the first NB
text classifiers. Lewis (1998) focuses on the
history of NB classification. Bernoulli and
multinomial models and their accuracy for different
collections are discussed by McCallum and Nigam (1998).
Eyheramendy et al. (2003) present additional NB
models. Domingos and Pazzani (1997),
Friedman (1997), and Hand and Yu (2001) analyze why NB performs
well although its probability estimates are poor. The
first paper also discusses NB's optimality when the
independence assumptions are true of the
data. Pavlov et al. (2004) propose a modified
document representation that partially addresses the
inappropriateness of the independence assumptions.
Bennett (2000) attributes the tendency of NB
probability estimates to be close to either 0 or 1 to the
effect of document length. Ng and Jordan (2001) show
that NB is sometimes (although rarely) superior to
discriminative methods because it more quickly reaches its
optimal error rate. The basic NB model presented
in this chapter can be tuned for better effectiveness
(Rennie et al. 2003;Kocz and Yih 2007). The problem of
concept drift and other reasons why
state-of-the-art classifiers do not always excel in practice
are discussed by Forman (2006) and
Hand (2006).
Early uses of mutual information and for feature
selection in text classification are
Lewis and Ringuette (1994) and Schütze et al. (1995), respectively.
Yang and Pedersen (1997) review feature selection methods and
their impact on classification effectiveness. They find that
pointwise mutual information is not competitive with
other methods. Yang and Pedersen refer to
expected mutual information (Equation 130) as
information gain (see Exercise 13.6 ,
page 13.6 ). (Snedecor and Cochran, 1989) is a good
reference for the
test in statistics, including the
Yates' correction for continuity for
tables. Dunning (1993) discusses problems of the
test when counts are small. Nongreedy feature
selection techniques are described by
Hastie et al. (2001). Cohen (1995)
discusses the pitfalls of using multiple significance tests
and methods to avoid them. Forman (2004) evaluates
different methods for feature selection for multiple
classifiers.
David D. Lewis defines the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21578/readme.txtbased on Apté et al. (1994). Lewis (1995) describes utility measures for the evaluation of text classification systems. Yang and Liu (1999) employ significance tests in the evaluation of text classification methods.
Lewis et al. (2004) find that SVMs (Chapter 15 ) perform better on Reuters-RCV1 than kNN and Rocchio (Chapter 14 ).