next up previous contents index
Next: Positional postings and phrase Up: The term vocabulary and Previous: Stemming and lemmatization   Contents   Index


Faster postings list intersection via skip pointers

In the remainder of this chapter, we will discuss extensions to postings list data structures and ways to increase the efficiency of using postings lists. Recall the basic postings list intersection operation from Section 1.3 (page [*]): we walk through the two postings lists simultaneously, in time linear in the total number of postings entries. If the list lengths are $m$ and $n$, the intersection takes $O(m+n)$ operations. Can we do better than this? That is, empirically, can we usually process postings list intersection in sublinear time? We can, if the index isn't changing too fast.

One way to do this is to use a skip list by augmenting postings lists with skip pointers (at indexing time), as shown in Figure 2.9 . Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results. The two questions are then where to place skip pointers and how to do efficient merging using skip pointers.

\includegraphics[width=3.5in]{SkipLists.eps}

Postings lists with skip pointers.The postings intersection can use a skip pointer when the end point is still less than the item on the other list.

Figure 2.10: Postings lists intersection with skip pointers.
\begin{figure}\begin{algorithm}{IntersectWithSkips}{p_1, p_2}
answer \= \langle\...
...{IF}\end{IF}\end{IF}\end{WHILE} \\
\RETURN{answer}
\end{algorithm}
\end{figure}

Consider first efficient merging, with Figure 2.9 as an example. Suppose we've stepped through the lists in the figure until we have matched \framebox{8} on each list and moved it to the results list. We advance both pointers, giving us \framebox{16} on the upper list and \framebox{41} on the lower list. The smallest item is then the element \framebox{16} on the top list. Rather than simply advancing the upper pointer, we first check the skip list pointer and note that 28 is also less than 41. Hence we can follow the skip list pointer, and then we advance the upper pointer to \framebox{28}. We thus avoid stepping to \framebox{19} and \framebox{23} on the upper list. A number of variant versions of postings list intersection with skip pointers is possible depending on when exactly you check the skip pointer. One version is shown in Figure 2.10 . Skip pointers will only be available for the original postings lists. For an intermediate result in a complex query, the call $\mbox{\emph{hasSkip}}(p)$ will always return false. Finally, note that the presence of skip pointers only helps for AND queries, not for OR queries.

Where do we place skips? There is a tradeoff. More skips means shorter skip spans, and that we are more likely to skip. But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers. Fewer skips means few pointer comparisons, but then long skip spans which means that there will be fewer opportunities to skip. A simple heuristic for placing skips, which has been found to work well in practice, is that for a postings list of length $P$, use $\sqrt{P}$ evenly-spaced skip pointers. This heuristic can be improved upon; it ignores any details of the distribution of query terms.

Building effective skip pointers is easy if an index is relatively static; it is harder if a postings list keeps changing because of updates. A malicious deletion strategy can render skip lists ineffective.

Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you're running a search engine with everything in memory then the equation changes again. We discuss the impact of hardware parameters on index construction time in Section 4.1 (page [*]) and the impact of index size on system speed in Chapter 5 .

Exercises.


next up previous contents index
Next: Positional postings and phrase Up: The term vocabulary and Previous: Stemming and lemmatization   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07