next up previous contents index
Next: The #rocchio71### algorithm. Up: The Rocchio algorithm for Previous: The Rocchio algorithm for   Contents   Index

The underlying theory.

We want to find a query vector, denoted as $\vec{q}$, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents. If $C_r$ is the set of relevant documents and $C_{nr}$ is the set of nonrelevant documents, then we wish to find:[*]
\vec{q}_{opt} = \argmax_{\vec{q}} [\mbox{sim}(\vec{q}, C_r) - \mbox{sim}(\vec{q}, C_{nr})],
\end{displaymath} (47)

where $\mbox{sim}$ is defined as in Equation 24. Under cosine similarity, the optimal query vector $\vec{q}_{opt}$ for separating the relevant and nonrelevant documents is:
\vec{q}_{opt} = \frac{1}{\vert C_r\vert}\sum_{\vec{d}_j \in ...
...rac{1}{\vert C_{nr}\vert}\sum_{\vec{d}_j \in C_{nr}} \vec{d}_j
\end{displaymath} (48)

That is, the optimal query is the vector difference between the centroids of the relevant and nonrelevant documents; see Figure 9.3 . However, this observation is not terribly useful, precisely because the full set of relevant documents is not known: it is what we want to find.

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.