next up previous contents index
Next: An appraisal and some Up: The Binary Independence Model Previous: Probability estimates in practice   Contents   Index

Probabilistic approaches to relevance feedback

We can use (pseudo-)relevance feedback, perhaps in an iterative process of estimation, to get a more accurate estimate of $p_t$. The probabilistic approach to relevance feedback works as follows:

  1. Guess initial estimates of $p_t$ and $u_t$. This can be done using the probability estimates of the previous section. For instance, we can assume that $p_t$ is constant over all $x_t$ in the query, in particular, perhaps taking $p_t = \frac{1}{2}$.
  2. Use the current estimates of $p_t$ and $u_t$ to determine a best guess at the set of relevant documents $R =\{d: R_{d,q} = 1\}$. Use this model to retrieve a set of candidate relevant documents, which we present to the user.

  3. We interact with the user to refine the model of $R$. We do this by learning from the user relevance judgments for some subset of documents $V$. Based on relevance judgments, $V$ is partitioned into two subsets: $VR = \{d \in V, R_{d,q} = 1\} \subset R$ and $VNR = \{ d \in V, R_{d,q} = 0 \}$, which is disjoint from $R$.

  4. We reestimate $p_t$ and $u_t$ on the basis of known relevant and nonrelevant documents. If the sets $VR$ and $VNR$ are large enough, we may be able to estimate these quantities directly from these documents as maximum likelihood estimates:
p_t = \vert VR_t\vert/\vert VR\vert
\end{displaymath} (77)

    (where $VR_t$ is the set of documents in $VR$ containing $x_t$). In practice, we usually need to smooth these estimates. We can do this by adding $\frac{1}{2}$ to both the count $\vert VR_t\vert$ and to the number of relevant documents not containing the term, giving:
p_t = \frac{\vert VR_t\vert + \frac{1}{2}}{\vert VR\vert + 1}
\end{displaymath} (78)

    However, the set of documents judged by the user ($V$) is usually very small, and so the resulting statistical estimate is quite unreliable (noisy), even if the estimate is smoothed. So it is often better to combine the new information with the original guess in a process of Bayesian updating . In this case we have:
p_t^{(k+1)} = \frac{\vert VR_t\vert + \kappa p_t^{(k)}}{\vert VR\vert + \kappa}
\end{displaymath} (79)

    Here $p_t^{(k)}$ is the $k^{th}$ estimate for $p_t$ in an iterative updating process and is used as a Bayesian prior in the next iteration with a weighting of $\kappa$. Relating this equation back to Equation 59 requires a bit more probability theory than we have presented here (we need to use a beta distribution prior, conjugate to the Bernoulli random variable $X_t$). But the form of the resulting equation is quite straightforward: rather than uniformly distributing pseudocounts, we now distribute a total of $\kappa$ pseudocounts according to the previous estimate, which acts as the prior distribution. In the absence of other evidence (and assuming that the user is perhaps indicating roughly 5 relevant or nonrelevant documents) then a value of around $\kappa =5$ is perhaps appropriate. That is, the prior is strongly weighted so that the estimate does not change too much from the evidence provided by a very small number of documents.

  5. Repeat the above process from step 2, generating a succession of approximations to $R$ and hence $p_t$, until the user is satisfied.

It is also straightforward to derive a pseudo-relevance feedback version of this algorithm, where we simply pretend that $VR = V$. More briefly:

  1. Assume initial estimates for $p_t$ and $u_t$ as above.
  2. Determine a guess for the size of the relevant document set. If unsure, a conservative (too small) guess is likely to be best. This motivates use of a fixed size set $V$ of highest ranked documents.

  3. Improve our guesses for $p_t$ and $u_t$. We choose from the methods of and 79 for re-estimating $p_t$, except now based on the set $V$ instead of $VR$. If we let $V_t$ be the subset of documents in $V$ containing $x_t$ and use add $\frac{1}{2}$ smoothing , we get:
p_t = \frac{\vert V_t\vert+\frac{1}{2}}{\vert V\vert+1}
\end{displaymath} (80)

    and if we assume that documents that are not retrieved are nonrelevant then we can update our $u_t$ estimates as:
u_t = \frac{\docf_t - \vert V_t\vert + \frac{1}{2}}{N - \vert V\vert + 1}
\end{displaymath} (81)

  4. Go to step 2 until the ranking of the returned results converges.

Once we have a real estimate for $p_t$ then the $c_t$ weights used in the $RSV$ value look almost like a tf-idf value. For instance, using Equation 73, Equation 76, and Equation 80, we have:

c_t = \log \left[\frac{p_t}{1-p_t}\cdot\frac{1-u_t}{u_t}\rig...
...vert V\vert - \vert V_t\vert + 1}\cdot\frac{N}{\docf_t}\right]
\end{displaymath} (82)

But things aren't quite the same: $p_t/(1-p_t)$ measures the (estimated) proportion of relevant documents that the term $t$ occurs in, not term frequency. Moreover, if we apply log identities:
c_t = \log\frac{\vert V_t\vert+\frac{1}{2}}{\vert V\vert - \vert V_t\vert + 1} + \log\frac{N}{\docf_t}
\end{displaymath} (83)

we see that we are now adding the two log scaled components rather than multiplying them.


next up previous contents index
Next: An appraisal and some Up: The Binary Independence Model Previous: Probability estimates in practice   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.