The #rocchio71### algorithm.

Next: Probabilistic relevance feedback Up: The Rocchio algorithm for Previous: The underlying theory. Contents Index

The Rocchio (1971) algorithm.

$\includegraphics[totalheight=2.5in]{Rocchio2.eps}$ An application of Rocchio's algorithm.Some documents have been labeled as relevant and nonrelevant and the initial query vector is moved in response to this feedback.

This was the relevance feedback mechanism introduced in and popularized by Salton's SMART system around 1970. In a real IR query context, we have a user query and partial knowledge of known relevant and nonrelevant documents. The algorithm proposes using the modified query $\vec{q}_m$ :

$\begin{displaymath} \vec{q}_m = \alpha \vec{q}_0 + \beta\frac{1}{\vert D_r\vert}... ...c{1}{\vert D_{nr}\vert}\sum_{\vec{d}_j \in D_{nr}} \vec{d}_j\end{displaymath}$

(49)

where

is the original query vector,

and $D_{nr}$ are the set of known relevant and nonrelevant documents respectively, and $\alpha$ , $\beta$ , and $\gamma$ are weights attached to each term. These control the balance between trusting the judged document set versus the query: if we have a lot of judged documents, we would like a higher $\beta$ and $\gamma$ . Starting from

, the new query moves you some distance toward the centroid of the relevant documents and some distance away from the centroid of the nonrelevant documents. This new query can be used for retrieval in the standard vector space model (see Section 6.3 ). We can easily leave the positive quadrant of the vector space by subtracting off a nonrelevant document's vector. In the Rocchio algorithm, negative term weights are ignored. That is, the term weight is set to 0. Figure 9.4 shows the effect of applying relevance feedback.

Relevance feedback can improve both recall and precision. But, in practice, it has been shown to be most useful for increasing recall in situations where recall is important. This is partly because the technique expands the query, but it is also partly an effect of the use case: when they want high recall, users can be expected to take time to review results and to iterate on the search. Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems set $\gamma < \beta$ . Reasonable values might be $\alpha = 1$ , $\beta = 0.75$ , and $\gamma = 0.15$ . In fact, many systems, such as the image search system in Figure 9.1 , allow only positive feedback, which is equivalent to setting $\gamma = 0$ . Another alternative is to use only the marked nonrelevant document which received the highest ranking from the IR system as negative feedback (here, $\vert D_{nr}\vert=1$ in Equation 49). While many of the experimental results comparing various relevance feedback variants are rather inconclusive, some studies have suggested that this variant, called Ide dec-hi is the most effective or at least the most consistent performer.

Next: Probabilistic relevance feedback Up: The Rocchio algorithm for Previous: The underlying theory. Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07