next up previous contents index
Next: Exercises Up: Vector space classification Previous: The bias-variance tradeoff   Contents   Index

References and further reading

As discussed in Chapter 9 , Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Rocchio classification was widely used as a classification method in in the 1990s (Buckley et al., 1994b;a, Voorhees and Harman, 2005). Initially, it was used as a form of routing . Routing merely ranks documents according to relevance to a class without assigning them. Early work on filtering , a true classification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model , each transfer of a document to the user is initiated by the user - for example, by means of search or by selecting it from a list of documents on a news aggregation website.

Some authors restrict the name Roccchio classification to two-class problems and use the terms cluster-based (Iwayama and Tokunaga, 1995) and centroid-based classification (Han and Karypis, 2000, Tan and Cheng, 2007) for Rocchio classification with $J>2$.

A more detailed treatment of kNN can be found in (Hastie et al., 2001), including methods for tuning the parameter $k$. An example of an approximate fast kNN algorithm is locality-based hashing (Andoni et al., 2006). Kleinberg (1997) presents an approximate $\Theta((M \log^2 M)(M + \log N))$ kNN algorithm (where $M$ is the dimensionality of the space and $N$ the number of data points), but at the cost of exponential storage requirements: $\Theta((N\log M)^{2M})$. Indyk (2004) surveys nearest neighbor methods in high-dimensional spaces. Early work on kNN in text classification was motivated by the availability of massively parallel hardware architectures (Creecy et al., 1992). Yang (1994) uses an inverted index to speed up kNN classification. The optimality result for 1NN (twice the Bayes error rate asymptotically) is due to Cover and Hart (1967).

The effectiveness of Rocchio classification and kNN is highly dependent on careful parameter tuning (in particular, the parameters $b'$ for Rocchio on page 14.2 and $k$ for kNN), feature engineering svm-text and feature selection feature. Buckley and Salton (1995), Yang and Kisiel (2003), Schapire et al. (1998) and Moschitti (2003) address these issues for Rocchio and Yang (2001) and Ault and Yang (2002) for kNN. Zavrel et al. (2000) compare feature selection methods for kNN.

The bias-variance tradeoff was introduced by Geman et al. (1992). The derivation in Section 14.6 is for $\mbox{MSE}(\gamma)$, but the tradeoff applies to many loss functions (cf. Friedman (1997), Domingos (2000)). Schütze et al. (1995) and Lewis et al. (1996) discuss linear classifiers for text and Hastie et al. (2001) linear classifiers in general. Readers interested in the algorithms mentioned, but not described in this chapter may wish to consult Bishop (2006) for neural networks, Hastie et al. (2001) for linear and logistic regression, and Minsky and Papert (1988) for the perceptron algorithm . Anagnostopoulos et al. (2006) show that an inverted index can be used for highly efficient document classification with any linear classifier, provided that the classifier is still effective when trained on a modest number of features via feature selection.

We have only presented the simplest method for combining two-class classifiers into a one-of classifier. Another important method is the use of error-correcting codes, where a vector of decisions of different two-class classifiers is constructed for each document. A test document's decision vector is then ``corrected'' based on the distribution of decision vectors in the training set, a procedure that incorporates information from all two-class classifiers and their correlations into the final classification decision (Dietterich and Bakiri, 1995). Ghamrawi and McCallum (2005) also exploit dependencies between classes in any-of classification. Allwein et al. (2000) propose a general framework for combining two-class classifiers.

next up previous contents index
Next: Exercises Up: Vector space classification Previous: The bias-variance tradeoff   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.