k nearest neighbor

Unlike Rocchio, nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its

closest neighbors where

is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document

to have the same label as the training documents located in the local region surrounding

Decision boundaries in 1NN are concatenated segments of the Voronoi tessellation as shown in Figure 14.6 . The Voronoi tessellation of a set of objects decomposes space into Voronoi cells, where each object's cell consists of all points that are closer to the object than to other objects. In our case, the objects are documents. The Voronoi tessellation then partitions the plane into $\vert\docsetlabeled\vert$ convex polygons, each containing its corresponding document (and no other) as shown in Figure 14.6 , where a convex polygon is a convex region in 2-dimensional space bounded by lines.

For general $k \in \mathbb{N}$ in kNN, consider the region in the space for which the set of

nearest neighbors is the same. This again is a convex polygon and the space is partitioned into convex polygons , within each of which the set of

nearest neighbors is invariant (Exercise 14.8 ).

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for

is more robust. It assigns documents to the majority class of their

closest neighbors, with ties broken randomly.

There is a probabilistic version of this kNN classification algorithm. We can estimate the probability of membership in class

as the proportion of the

nearest neighbors in

. Figure 14.6 gives an example for

. Probability estimates for class membership of the star are $\hat{P}(\mbox{circle class}\vert\mbox{star})=1/3$ , $\hat{P}(\mbox{X class}\vert\mbox{star})=2/3$ , and $\hat{P}(\mbox{diamond class}\vert\mbox{star})=0$ . The 3nn estimate ( $\hat{P}_1(\mbox{circle class}\vert\mbox{star})=1/3$ ) and the 1nn estimate ( $\hat{P}_1(\mbox{circle class}\vert\mbox{star})=1$ ) differ with 3nn preferring the X class and 1nn preferring the circle class .

The parameter

in kNN is often chosen based on experience or knowledge about the classification problem at hand. It is desirable for

to be odd to make ties less likely.

and

are common choices, but much larger values between 50 and 100 are also used. An alternative way of setting the parameter is to select the

that gives best results on a held-out portion of the training set.

$\begin{figure} % latex2html id marker 20193 \begin{algorithm}{Train-kNN}{\mathbb... ...oc)$. $c_j$\ denotes the set of all documents in the class $c_j$. } \end{figure}$

We can also weight the ``votes'' of the

nearest neighbors by their cosine similarity. In this scheme, a class's score is computed as:

$\begin{displaymath} \mbox{score} (c,d) = \sum_{\onedoc' \in S_k(d)} I_c(\onedoc') \cos (\vec{v}(\onedoc') , \vec{v}(\onedoc)) \end{displaymath}$

(143)

Worked example. The distances of the test document from the four training documents in Table 14.1 are $\vert\vec{d}_1 - \vec{d}_5 \vert =\vert\vec{d}_2 - \vec{d}_5 \vert =\vert\vec{d}_3 - \vec{d}_5 \vert \approx 1.41$ and $\vert\vec{d}_4 - \vec{d}_5 \vert =0.0$ .

's nearest neighbor is therefore

and 1NN assigns

's class, $\overline{c}$ . End worked example.