The underlying theory.

Next: The #rocchio71### algorithm. Up: The Rocchio algorithm for Previous: The Rocchio algorithm for Contents Index

The underlying theory.

We want to find a query vector, denoted as $\vec{q}$ , that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents. If

is the set of relevant documents and $C_{nr}$ is the set of nonrelevant documents, then we wish to find:

$\begin{displaymath} \vec{q}_{opt} = \argmax_{\vec{q}} [\mbox{sim}(\vec{q}, C_r) - \mbox{sim}(\vec{q}, C_{nr})], \end{displaymath}$

(47)

where $\mbox{sim}$ is defined as in Equation 24. Under cosine similarity, the optimal query vector $\vec{q}_{opt}$ for separating the relevant and nonrelevant documents is:

$\begin{displaymath} \vec{q}_{opt} = \frac{1}{\vert C_r\vert}\sum_{\vec{d}_j \in ... ...rac{1}{\vert C_{nr}\vert}\sum_{\vec{d}_j \in C_{nr}} \vec{d}_j \end{displaymath}$

(48)

That is, the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents; see Figure 9.3 . However, this observation is not terribly useful, precisely because the full set of relevant documents is not known: it is what we want to find.

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07