     Next: Estimating the query generation Up: The query likelihood model Previous: The query likelihood model   Contents   Index

## Using query likelihood language models in IR

Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model . In it, we construct from each document in the collection a language model . Our goal is to rank documents by , where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule (as introduced in probirsec), we have: (98) is the same for all documents, and so can be ignored. The prior probability of a document is often treated as uniform across all and so it can also be ignored, but we could implement a genuine prior which could include criteria like authority, length, genre, newness, and number of previous people who have read the document. But, given these simplifications, we return results ranked by simply , the probability of the query under the language model derived from . The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.

The most common way to do this is using the multinomial unigram language model, which is equivalent to a multinomial Naive Bayes model (page 13.3 ), where the documents are the classes, each treated in the estimation as a separate language''. Under this model, we have that: (99)

where, again is the multinomial coefficient for the query , which we will henceforth ignore, since it is a constant for a particular query.

For retrieval based on a language model (henceforth LM ), we treat the generation of queries as a random process. The approach is to

1. Infer a LM for each document.
2. Estimate , the probability of generating the query according to each of these document models.
3. Rank the documents according to these probabilities.
The intuition of the basic model is that the user has a prototype document in mind, and generates a query based on words that appear in this document. Often, users have a reasonable idea of terms that are likely to occur in documents of interest and they will choose query terms that distinguish these documents from others in the collection. Collection statistics are an integral part of the language model, rather than being used heuristically as in many other approaches.     Next: Estimating the query generation Up: The query likelihood model Previous: The query likelihood model   Contents   Index