| docIDs | 824 | 829 | 215406 | |
| gaps | 5 | 214577 | ||
| VB code | 00000110 10111000 | 10000101 | 00001101 00001100 10110001 |
Variable byte (VB) encoding. Gaps are encoded using an integral number of bytes. The first bit, the continuation bit, of each byte indicates whether the code ends with this byte (1) or not (0).
Variable byte (VB) encoding
uses
an integral number of bytes to encode a gap. The last 7 bits of a byte are ``payload''
and encode part of the gap. The first bit of the byte is a
continuation bit . It is set to
1 for the last byte
of the encoded gap and to 0 otherwise. To decode a variable
byte code, we read a sequence of bytes with continuation
bit 0 terminated by a byte with continuation bit 1. We
then extract and concatenate the 7-bit
parts.
Figure 5.8 gives pseudocode for VB
encoding and decoding and
Table 5.4 an example of a VB-encoded
postings list.
With variable byte compression, the size of the compressed index for Reuters-RCV1 is 116 MB as we verified in an experiment. This is a more than 50% reduction of the size of the uncompressed index (see Table 5.6 ).
The idea of VB encoding can also be applied to larger or smaller units than bytes: 32-bit words, 16-bit words, and 4-bit words or nibbles . Larger words further decrease the amount of bit manipulation necessary at the cost of less effective (or no) compression. Word sizes smaller than bytes get even better compression ratios at the cost of more bit manipulation. In general, bytes offer a good compromise between compression ratio and speed of decompression.
For
most information retrieval systems variable byte codes
offer an excellent tradeoff between time and space.
They are also simple to implement - most of the
alternatives referred to in Section 5.4
are more complex. But if disk space is
a scarce resource, we can achieve better compression
ratios by using bit-level encodings, in particular two
closely related encodings:
codes, which we will turn to next, and
codes (Exercise 5.5 ).