next up previous contents index
Next: Deriving a ranking function Up: Probabilistic information retrieval Previous: The PRP with retrieval   Contents   Index

The Binary Independence Model

The Binary Independence Model (BIM) we present in this section is the model that has traditionally been used with the PRP. It introduces some simple assumptions, which make estimating the probability function $P(R\vert d,q)$ practical. Here, ``binary'' is equivalent to Boolean: documents and queries are both represented as binary term incidence vectors. That is, a document $d$ is represented by the vector $\vec{x} = (x_1, \ldots, x_M)$ where $x_t = 1$ if term $t$ is present in document $d$ and $x_t = 0$ if $t$ is not present in $d$. With this representation, many possible documents have the same vector representation. Similarly, we represent $q$ by the incidence vector $\vec{q}$ (the distinction between $q$ and $\vec{q}$ is less central since commonly $q$ is in the form of a set of words). ``Independence'' means that terms are modeled as occurring in documents independently. The model recognizes no association between terms. This assumption is far from correct, but it nevertheless often gives satisfactory results in practice; it is the ``naive'' assumption of Naive Bayes models, discussed further in Section 13.4 (page [*]). Indeed, the Binary Independence Model is exactly the same as the multivariate Bernoulli Naive Bayes model presented in Section 13.3 (page [*]). In a sense this assumption is equivalent to an assumption of the vector space model, where each term is a dimension that is orthogonal to all other terms.

We will first present a model which assumes that the user has a single step information need. As discussed in Chapter 9 , seeing a range of results might let the user refine their information need. Fortunately, as mentioned there, it is straightforward to extend the Binary Independence Model so as to provide a framework for relevance feedback, and we present this model in Section 11.3.4 .

To make a probabilistic retrieval strategy precise, we need to estimate how terms in documents contribute to relevance, specifically, we wish to know how term frequency, document frequency, document length, and other statistics that we can compute influence judgments about document relevance, and how they can be reasonably combined to estimate the probability of document relevance. We then order documents by decreasing estimated probability of relevance.

We assume here that the relevance of each document is independent of the relevance of other documents. As we noted in Section 8.5.1 (page [*]), this is incorrect: the assumption is especially harmful in practice if it allows a system to return duplicate or near duplicate documents. Under the BIM, we model the probability $P(R\vert d,q)$ that a document is relevant via the probability in terms of term incidence vectors $P(R\vert\vec{x}, \vec{q})$. Then, using Bayes rule, we have:

$\displaystyle P(R=1\vert\vec{x}, \vec{q})$ $\textstyle =$ $\displaystyle \frac{P(\vec{x}\vert R=1, \vec{q})P(R=1\vert\vec{q})}{P(\vec{x}\vert\vec{q})}$ (63)
$\displaystyle P(R=0\vert\vec{x}, \vec{q})$ $\textstyle =$ $\displaystyle \frac{P(\vec{x}\vert R=0, \vec{q})P(R=0\vert\vec{q})}{P(\vec{x}\vert\vec{q})}$ (64)

Here, $P(\vec{x}\vert R=1,\vec{q})$ and $P(\vec{x}\vert R=0,\vec{q})$ are the probability that if a relevant or nonrelevant, respectively, document is retrieved, then that document's representation is $\vec{x}$. You should think of this quantity as defined with respect to a space of possible documents in a domain. How do we compute all these probabilities? We never know the exact probabilities, and so we have to use estimates: Statistics about the actual document collection are used to estimate these probabilities. $P(R=1\vert\vec{q})$ and $P(R=0\vert\vec{q})$ indicate the prior probability of retrieving a relevant or nonrelevant document respectively for a query $\vec{q}$. Again, if we knew the percentage of relevant documents in the collection, then we could use this number to estimate $P(R=1\vert\vec{q})$ and $P(R=0\vert\vec{q})$. Since a document is either relevant or nonrelevant to a query, we must have that:
P(R=1\vert\vec{x}, \vec{q}) + P(R=0\vert\vec{x},\vec{q}) = 1
\end{displaymath} (65)

next up previous contents index
Next: Deriving a ranking function Up: Probabilistic information retrieval Previous: The PRP with retrieval   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.