number | unary code | length | offset | ![]() |
||
0 | 0 | |||||
1 | 10 | 0 | 0 | |||
2 | 110 | 10 | 0 | 10,0 | ||
3 | 1110 | 10 | 1 | 10,1 | ||
4 | 11110 | 110 | 00 | 110,00 | ||
9 | 1111111110 | 1110 | 001 | 1110,001 | ||
13 | 1110 | 101 | 1110,101 | |||
24 | 11110 | 1000 | 11110,1000 | |||
511 | 111111110 | 11111111 | 111111110,11111111 | |||
1025 | 11111111110 | 0000000001 | 11111111110,0000000001 |
VB codes use an adaptive number of bytes
depending on the size of the gap. Bit-level codes adapt the
length of the code on the finer grained bit level.
The simplest bit-level code is unary code . The unary
code of is a string of
1s followed by a 0 (see the
first two columns of Table 5.5 ).
Obviously,
this is not a very efficient code, but it will come in handy
in a moment.
How efficient can a code be in principle? Assuming the
gaps
with
are all equally
likely, the optimal encoding uses
bits for each
. So
some gaps (
in this case) cannot be encoded with
fewer than
bits. Our goal is to get as close to
this lower bound as possible.
A method that is within a factor of optimal
is encoding .
codes implement variable-length encoding by splitting the
representation of a
gap
into a pair of length and offset.
Offset is
in binary, but with the leading 1
removed.
For example, for 13 (binary 1101) offset is 101.
Length encodes the length of offset
in unary code.
For 13,
the length of offset is 3 bits, which is 1110 in
unary. The
code of 13 is therefore 1110101, the
concatenation of length 1110 and offset 101.
The right hand column of Table 5.5 gives additional
examples of
codes.
A code is decoded by first reading the unary code
up to the 0 that terminates it, for example, the four bits 1110
when decoding 1110101. Now we know how long the
offset is: 3 bits. The offset 101 can then be read correctly and
the 1 that was chopped off in encoding is prepended: 101
1101 = 13.
The length of offset is
bits and the length of length
is
bits,
so the length of the entire
code is
bits.
codes are
always of odd length and they are within a factor of
2 of what we claimed to be
the optimal encoding length
.
We derived this optimum
from the assumption
that the
gaps
between
and
are equiprobable.
But this need not be the case. In general, we do not know the
probability distribution over gaps a priori.
The characteristic of a discrete probability distribution
![]() |
(4) |
It can be shown
that the lower bound for the expected length of a
code
is
if certain conditions hold (see the references). It can
further be shown that for
,
encoding is within a factor of 3 of this optimal encoding,
approaching 2 for large
:
![]() |
(5) |
In addition to universality,
codes have two other properties that are useful for index
compression. First, they are
prefix free , namely, no
code is the prefix of another. This means that
there is always a unique decoding of a sequence of
codes - and we do not need delimiters between them,
which would decrease the efficiency of the code. The second
property is that
codes are
parameter free . For many other efficient codes, we
have to fit the parameters of a model (e.g., the
binomial distribution) to
the distribution of gaps in the index. This complicates the
implementation of compression and decompression.
For instance, the
parameters need to be stored and retrieved. And in dynamic indexing, the distribution of
gaps can change, so that the original parameters are no longer
appropriate. These problems are avoided with a
parameter-free code.
How much compression of the inverted index do codes
achieve? To answer this question we use Zipf's law, the term
distribution model introduced in Section 5.1.2 .
According to Zipf's law, the collection frequency
is proportional to the
inverse of the rank
, that is, there is a constant
such that:
![]() |
(6) |
![]() |
![]() |
![]() |
(7) |
![]() |
(8) |
![]() |
(9) |
![]() |
(10) |
Now we have derived term statistics that characterize the distribution of terms in the collection and, by extension, the distribution of gaps in the postings lists. From these statistics, we can calculate the space requirements for an inverted index compressed with
Encoding the gaps of size
with
codes, the number of
bits needed for the postings list
of a term in the
th block (corresponding to one row in
the figure) is:
For Reuters-RCV1,
400,000
and
When we run compression on Reuters-RCV1,
the actual size of the compressed index is even
lower: 101 MB, a bit more than one tenth of the size of the
collection. The reason for the discrepancy between
predicted and actual value is
that (i) Zipf's law is not a very good
approximation of the actual distribution of term frequencies
for Reuters-RCV1 and (ii) gaps are not uniform. The Zipf model predicts an index size of 251 MB
for the unrounded numbers from Table 4.2 . If
term frequencies are generated from the Zipf model and a
compressed index is created for these artificial terms, then
the compressed size is 254 MB. So to the extent that
the assumptions about the distribution of term frequencies
are accurate,
the predictions of the model are correct.
data structure | size in MB | |
dictionary, fixed-width | 19.211.2 | |
dictionary, term pointers into string | 10.8 7.6 | |
![]() ![]() |
10.3 7.1 | |
![]() |
7.9 5.9 | |
collection (text, xml markup etc) | 3600.0 | |
collection (text) | 960.0 | |
term incidence matrix | 40,000.0 | |
postings, uncompressed (32-bit words) | 400.0 | |
postings, uncompressed (20 bits) | 250.0 | |
postings, variable byte encoded | 116.0 | |
postings, ![]() |
101.0 |
Table 5.6 summarizes
the compression techniques covered in this chapter.
The
term incidence matrix
(Figure 1.1 , page 1.1 )
for Reuters-RCV1 has size
bits or 40 GB.
The numbers were the collection (3600 MB and 960 MB) are for
the encoding of RCV1 of CD, which uses one byte per
character, not Unicode.
codes achieve great compression ratios - about
15% better than variable byte codes for Reuters-RCV1. But they
are expensive to decode. This is because many
bit-level operations - shifts and masks - are necessary to
decode a sequence of
codes as the boundaries between
codes will usually be somewhere in the middle of a machine
word. As a result, query processing is more expensive for
codes than for variable byte codes.
Whether we choose variable byte or
encoding
depends on the characteristics of an application, for example,
on the relative
weights we give to conserving disk space versus maximizing query
response time.
The compression ratio for the index in
Table 5.6 is about 25%: 400 MB
(uncompressed, each posting stored as a 32-bit word) versus 101 MB
() and
116 MB (VB). This shows that both
and VB codes meet
the objectives we stated in the beginning of the chapter.
Index compression substantially improves time and space
efficiency of indexes by reducing the amount of disk space needed,
increasing the amount of information that can be kept in
the cache, and speeding up data transfers from disk to memory.
Exercises.
![]() |
1110110111111001011111111110100011111001 | ||
![]() |
11111010000111111000100011111110010000011111010101 |