kNN with preprocessing of training set | |
training | |
testing | |
kNN without preprocessing of training set | |
training | |
testing |
Table 14.3 gives the time complexity of kNN. kNN has properties that are quite different from most other classification algorithms. Training a kNN classifier simply consists of determining and preprocessing documents. In fact, if we preselect a value for and do not preprocess, then kNN requires no training at all. In practice, we have to perform preprocessing steps like tokenization. It makes more sense to preprocess training documents once as part of the training phase rather than repeatedly every time we classify a new test document.
Test time is for kNN. It is linear in the size of the training set as we need to compute the distance of each training document from the test document. Test time is independent of the number of classes . kNN therefore has a potential advantage for problems with large .
In kNN classification, we do not perform any estimation of parameters as we do in Rocchio classification (centroids) or in Naive Bayes (priors and conditional probabilities). kNN simply memorizes all examples in the training set and then compares the test document to them. For this reason, kNN is also called memory-based learning or instance-based learning . It is usually desirable to have as much training data as possible in machine learning. But in kNN large training sets come with a severe efficiency penalty in classification.
Can kNN testing be made more efficient than or, ignoring the length of documents, more efficient than ? There are fast kNN algorithms for small dimensionality (Exercise 14.8 ). There are also approximations for large that give error bounds for specific efficiency gains (see Section 14.7 ). These approximations have not been extensively tested for text classification applications, so it is not clear whether they can achieve much better efficiency than without a significant loss of accuracy.
The reader may have noticed the similarity between the problem of finding nearest neighbors of a test document and ad hoc retrieval, where we search for the documents with the highest similarity to the query (Section 6.3.2 , page 6.3.2 ). In fact, the two problems are both nearest neighbor problems and only differ in the relative density of (the vector of) the test document in kNN (10s or 100s of non-zero entries) versus the sparseness of (the vector of) the query in ad hoc retrieval (usually fewer than 10 non-zero entries). We introduced the inverted index for efficient ad hoc retrieval in Section 1.1 (page 1.1 ). Is the inverted index also the solution for efficient kNN?
An inverted index restricts a search to those documents that have at least one term in common with the query. Thus in the context of kNN, the inverted index will be efficient if the test document has no term overlap with a large number of training documents. Whether this is the case depends on the classification problem. If documents are long and no stop list is used, then less time will be saved. But with short documents and a large stop list, an inverted index may well cut the average test time by a factor of 10 or more.
The search time in an inverted index is a function of the length of the postings lists of the terms in the query. Postings lists grow sublinearly with the length of the collection since the vocabulary increases according to Heaps' law - if the probability of occurrence of some terms increases, then the probability of occurrence of others must decrease. However, most new terms are infrequent. We therefore take the complexity of inverted index search to be (as discussed in Section 2.4.2 , page 2.4.2 ) and, assuming average document length does not change over time, .
As we will see in the next chapter, kNN's effectiveness is close to that of the most accurate learning methods in text classification (Table 15.2 , page 15.2 ). A measure of the quality of a learning method is its Bayes error rate , the average error rate of classifiers learned by it for a particular problem. kNN is not optimal for problems with a non-zero Bayes error rate - that is, for problems where even the best possible classifier has a non-zero classification error. The error of 1NN is asymptotically (as the training set increases) bounded by twice the Bayes error rate. That is, if the optimal classifier has an error rate of , then 1NN has an asymptotic error rate of less than . This is due to the effect of noise - we already saw one example of noise in the form of noisy features in Section 13.5 (page 13.5 ), but noise can also take other forms as we will discuss in the next section. Noise affects two components of kNN: the test document and the closest training document. The two sources of noise are additive, so the overall error of 1NN is twice the optimal error rate. For problems with Bayes error rate 0, the error rate of 1NN will approach 0 as the size of the training set increases.
Exercises.