Unlike Rocchio,
nearest neighbor or kNN
classification determines the decision boundary
locally. For 1NN we assign each document to the class of its
closest neighbor. For kNN we assign each document to the majority class of its
closest neighbors where
is a parameter. The rationale of kNN classification
is that, based on the contiguity hypothesis,
we expect a test document
to have the same label as the
training documents located in the local region surrounding
.
Decision boundaries in 1NN are concatenated
segments of the Voronoi tessellation as shown in
Figure 14.6 . The Voronoi tessellation of a
set of objects decomposes space into Voronoi cells, where
each object's cell consists of all points that are closer to
the object than to other objects. In our case, the objects
are documents.
The Voronoi tessellation then partitions
the plane into
convex polygons, each containing its
corresponding document (and no other)
as shown in Figure 14.6 , where
a convex polygon is a convex region in
2-dimensional space bounded by
lines.
For general
in kNN,
consider the region in the space for which the set
of
nearest neighbors is the same. This again is a convex
polygon and the space is partitioned into convex
polygons , within each of which the
set of
nearest neighbors is invariant (Exercise 14.8 ).
1NN is not very robust. The classification decision
of each test document
relies on the class of a single training document, which may be incorrectly
labeled or
atypical. kNN for is more robust. It
assigns documents to the majority class of their
closest neighbors, with ties
broken randomly.
There is a probabilistic
version of this kNN classification algorithm. We can estimate the probability of
membership in class
as the proportion of the
nearest neighbors
in
.
Figure 14.6 gives an example for
. Probability estimates for class membership of the
star are
,
, and
.
The
3nn estimate (
)
and
the 1nn estimate (
)
differ with
3nn preferring the X class
and
1nn preferring the circle class .
The parameter in kNN is often chosen based on experience or
knowledge about the classification problem at hand. It is
desirable for
to be odd to make ties less likely.
and
are common choices, but much larger values between
50 and 100 are also used. An alternative way of setting
the parameter is to select the
that gives best results
on a held-out portion of the training set.
We can also weight the ``votes'' of the nearest
neighbors by their cosine similarity. In this scheme, a class's
score is computed as:
![]() |
(143) |
Figure 14.7 summarizes the kNN algorithm.
Worked example.
The distances of the test document from the four training
documents in
Table 14.1 are
and
.
's nearest neighbor
is therefore
and 1NN assigns
to
's class,
.
End worked example.