k nearest neighbor

Unlike Rocchio, *
nearest neighbor* or *kNN
classification* determines the decision boundary
locally. For 1NN we assign each document to the class of its
closest neighbor. For kNN we assign each document to the majority class of its
closest neighbors where is a parameter. The rationale of kNN classification
is that, based on the contiguity hypothesis,
we expect a test document
to have the same label as the
training documents located in the local region surrounding .

Decision boundaries in 1NN are concatenated
segments of the *Voronoi tessellation* as shown in
Figure 14.6 . The Voronoi tessellation of a
set of objects decomposes space into Voronoi cells, where
each object's cell consists of all points that are closer to
the object than to other objects. In our case, the objects
are documents.
The Voronoi tessellation then partitions
the plane into
convex polygons, each containing its
corresponding document (and no other)
as shown in Figure 14.6 , where
a convex polygon is a convex region in
2-dimensional space bounded by
lines.

For general
in kNN,
consider the region in the space for which the set
of nearest neighbors is the same. This again is a convex
polygon and the space is partitioned into convex
polygons , within each of which the
set of nearest neighbors is invariant (Exercise 14.8 ).^{}

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for is more robust. It assigns documents to the majority class of their closest neighbors, with ties broken randomly.

There is a probabilistic version of this kNN classification algorithm. We can estimate the probability of membership in class as the proportion of the nearest neighbors in . Figure 14.6 gives an example for . Probability estimates for class membership of the star are , , and . The 3nn estimate ( ) and the 1nn estimate ( ) differ with 3nn preferring the X class and 1nn preferring the circle class .

The parameter in kNN is often chosen based on experience or
knowledge about the classification problem at hand. It is
desirable for to be odd to make ties less likely.
and are common choices, but much larger values between
50 and 100 are also used. An alternative way of setting
the parameter is to select the that gives best results
on a *held-out* portion of the training set.

We can also weight the ``votes'' of the nearest
neighbors by their cosine similarity. In this scheme, a class's
score is computed as:

(143) |

Figure 14.7 summarizes the kNN algorithm.

**Worked example.**
The distances of the test document from the four training
documents in
Table 14.1 are
and
. 's nearest neighbor
is therefore and 1NN assigns to 's class,
.
**End worked example.**

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07