next up previous contents index
Next: Estimating the query generation Up: The query likelihood model Previous: The query likelihood model   Contents   Index


Using query likelihood language models in IR

Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model . In it, we construct from each document $d$ in the collection a language model $M_d$. Our goal is to rank documents by $P(d\vert q)$, where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule (as introduced in probirsec), we have:

\begin{displaymath}
P(d\vert q) = P(q\vert d)P(d)/P(q)
\end{displaymath} (98)

$P(q)$ is the same for all documents, and so can be ignored. The prior probability of a document $P(d)$ is often treated as uniform across all $d$ and so it can also be ignored, but we could implement a genuine prior which could include criteria like authority, length, genre, newness, and number of previous people who have read the document. But, given these simplifications, we return results ranked by simply $P(q\vert d)$, the probability of the query $q$ under the language model derived from $d$. The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.

The most common way to do this is using the multinomial unigram language model, which is equivalent to a multinomial Naive Bayes model (page 13.3 ), where the documents are the classes, each treated in the estimation as a separate ``language''. Under this model, we have that:

\begin{displaymath}P(q\vert M_d) = K_q \prod_{t \in V} P(t\vert M_d)^{\termf_{t,d}}
\end{displaymath} (99)

where, again $K_q = L_d!/(\termf_{t_1,d}!\termf_{t_2,d}!\cdots
\termf_{t_M,d}!)$ is the multinomial coefficient for the query $q$, which we will henceforth ignore, since it is a constant for a particular query.

For retrieval based on a language model (henceforth LM ), we treat the generation of queries as a random process. The approach is to

  1. Infer a LM for each document.
  2. Estimate $P(q\vert M_{d_i})$, the probability of generating the query according to each of these document models.
  3. Rank the documents according to these probabilities.
The intuition of the basic model is that the user has a prototype document in mind, and generates a query based on words that appear in this document. Often, users have a reasonable idea of terms that are likely to occur in documents of interest and they will choose query terms that distinguish these documents from others in the collection.[*]Collection statistics are an integral part of the language model, rather than being used heuristically as in many other approaches.


next up previous contents index
Next: Estimating the query generation Up: The query likelihood model Previous: The query likelihood model   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07