General introductions to statistical classification and machine learning can be found in (Hastie et al., 2001), (Mitchell, 1997), and (Duda et al., 2000), including many important methods (e.g., decision trees and boosting ) that we do not cover. A comprehensive review of text classification methods and results is (Sebastiani, 2002). Manning and Schütze (1999, Chapter 16) give an accessible introduction to text classification with coverage of decision trees, perceptrons and maximum entropy models. More information on the superlinear time complexity of learning methods that are more accurate than Naive Bayes can be found in (Perkins et al., 2003) and (Joachims, 2006a).
Maron and Kuhns (1960) described one of the first NB text classifiers. Lewis (1998) focuses on the history of NB classification. Bernoulli and multinomial models and their accuracy for different collections are discussed by McCallum and Nigam (1998). Eyheramendy et al. (2003) present additional NB models. Domingos and Pazzani (1997), Friedman (1997), and Hand and Yu (2001) analyze why NB performs well although its probability estimates are poor. The first paper also discusses NB's optimality when the independence assumptions are true of the data. Pavlov et al. (2004) propose a modified document representation that partially addresses the inappropriateness of the independence assumptions. Bennett (2000) attributes the tendency of NB probability estimates to be close to either 0 or 1 to the effect of document length. Ng and Jordan (2001) show that NB is sometimes (although rarely) superior to discriminative methods because it more quickly reaches its optimal error rate. The basic NB model presented in this chapter can be tuned for better effectiveness (Rennie et al. 2003;Kocz and Yih 2007). The problem of concept drift and other reasons why state-of-the-art classifiers do not always excel in practice are discussed by Forman (2006) and Hand (2006).
Early uses of mutual information and for feature selection in text classification are Lewis and Ringuette (1994) and Schütze et al. (1995), respectively. Yang and Pedersen (1997) review feature selection methods and their impact on classification effectiveness. They find that pointwise mutual information is not competitive with other methods. Yang and Pedersen refer to expected mutual information (Equation 130) as information gain (see Exercise 13.6 , page 13.6 ). (Snedecor and Cochran, 1989) is a good reference for the test in statistics, including the Yates' correction for continuity for tables. Dunning (1993) discusses problems of the test when counts are small. Nongreedy feature selection techniques are described by Hastie et al. (2001). Cohen (1995) discusses the pitfalls of using multiple significance tests and methods to avoid them. Forman (2004) evaluates different methods for feature selection for multiple classifiers.
David D. Lewis defines the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21578/readme.txtbased on Apté et al. (1994). Lewis (1995) describes utility measures for the evaluation of text classification systems. Yang and Liu (1999) employ significance tests in the evaluation of text classification methods.
Lewis et al. (2004) find that SVMs (Chapter 15 ) perform better on Reuters-RCV1 than kNN and Rocchio (Chapter 14 ).