next up previous contents index
Next: Positional index size. Up: Positional postings and phrase Previous: Biword indexes   Contents   Index


Positional indexes

For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed. Here, for each term in the vocabulary, we store postings of the form docID: $\langle$position1, position2, ...$\rangle$, as shown in Figure 2.11 , where each position is a token index in the document. Each posting will also usually record the term frequency, for reasons discussed in Chapter 6 .

\begin{figure}
% latex2html id marker 2653
\raggedright
\term{to}, 993427:\\
\h...
...77, and
occurs 6 times in document 1 at positions 7, 18, 33, etc.}
\end{figure}

To process a phrase query, you still need to access the inverted index entries for each distinct term. As before, you would start with the least frequent term and then work to further restrict the list of possible candidates. In the merge operation, the same general technique is used as before, but rather than simply checking that both terms are in a document, you also need to check that their positions of appearance in the document are compatible with the phrase query being evaluated. This requires working out offsets between the words.

Worked example. Satisfying phrase queries.phrasequery Suppose the postings lists for to and be are as in Figure 2.11 , and the query is ``to be or not to be''. The postings lists to access are: to, be, or, not. We will examine intersecting the postings lists for to and be. We first look for documents that contain both terms. Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to, and then we look for another occurrence of each word with token index 4 higher than the first occurrence. In the above lists, the pattern of occurrences that is a possible match is:

to: $\langle$...; 4:$\langle$...,429,433$\rangle$; ...$\rangle$
be: $\langle$...; 4:$\langle$...,430,434$\rangle$; ...$\rangle$
End worked example.

\begin{figure}
% latex2html id marker 2691
\par
\begin{algorithm}{PositionalInte...
...of triples giving docID and the term position
in $p_1$\ and $p_2$.}
\end{figure}

The same general method is applied for within $k$ word proximity searches, of the sort we saw in westlaw:

employment /3 place
Here, /$k$ means ``within $k$ words of (on either side)''. Clearly, positional indexes can be used for such queries; biword indexes cannot. We show in Figure 2.12 an algorithm for satisfying within $k$ word proximity searches; it is further discussed in Exercise 2.4.3 .



Subsections
next up previous contents index
Next: Positional index size. Up: Positional postings and phrase Previous: Biword indexes   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07