Chapter 1 introduced the dictionary and the inverted index as the central data structures in information retrieval (IR). In this chapter, we employ a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems.
One benefit of compression is immediately clear. We need less disk space. As we will see, compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.
There are two
more subtle benefits of compression. The first is increased use of
caching. Search systems use some parts of the dictionary and
the index much more than others. For example, if we cache
the postings list of a frequently used query term , then
the computations necessary for responding to the one-term
query
can be entirely done in memory.
With compression, we can fit a lot more information into
main memory. Instead of having to expend a disk seek when
processing a query with
, we instead access its postings
list in memory and decompress it. As we will see below,
there are simple and efficient decompression methods,
so that the penalty of having to
decompress the postings list is small. As a result, we are
able to decrease the response time of the IR system
substantially. Because memory is a more expensive resource
than disk space, increased speed owing to caching - rather
than decreased space requirements - is often the prime
motivator for compression.
The second more subtle advantage of compression is faster transfer of data from disk to memory. Efficient decompression algorithms run so fast on modern hardware that the total time of transferring a compressed chunk of data from disk and then decompressing it is usually less than transferring the same chunk of data in uncompressed form. For instance, we can reduce input/output (I/O) time by loading a much smaller compressed postings list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.
If the main goal of compression is to conserve disk space, then the speed of compression algorithms is of no concern. But for improved cache utilization and faster disk-to-memory transfer, decompression speeds must be high. The compression algorithms we discuss in this chapter are highly efficient and can therefore serve all three purposes of index compression.
In this chapter, we
define a posting as a docID in a postings list. For
example, the postings list (6; 20, 45, 100), where 6 is the
termID of the list's term, contains three postings.
As discussed in Section 2.4.2 (page ),
postings in
most search systems also contain frequency and position
information; but
we will only consider simple docID
postings here. See
Section 5.4 for references on compressing
frequencies and positions.
This chapter first gives a statistical characterization of
the distribution of the entities we want to compress -
terms and postings in large collections
(Section 5.1 ). We then look at compression of
the dictionary, using the dictionary-as-a-string method and
blocked storage (Section 5.2 ).
Section 5.3 describes two techniques for
compressing the postings file, variable byte encoding and
encoding.