Consider the postings list
with a corresponding list of gaps
.
Assume that the length of the postings list is stored separately, so the system knows when a postings list is complete.
Using variable byte encoding:
(i) What is the largest gap you can encode in 1 byte?
(ii) What is the largest gap you can encode in 2 bytes?
(iii) How many bytes will the above postings list require under this encoding? (Count only space for encoding the sequence of numbers.)
As we pointed out on page 1 ,
a gap cannot be of length 0. Similarly,
the stuff left to encode after shifting in variable byte
encoding cannot be
0. Based on these observations:
(i) Suggest a modification to variable byte encoding that
allows you to encode slightly larger gaps in the same amount
of space.
(ii) What is the largest gap you can encode in 1 byte?
(iii) What is the largest gap you can encode in 2 bytes?
(iv) How many bytes will the postings list in Exercise 5.5
require under this encoding?
(Count only space for encoding the sequence of numbers.)
From the following sequence of coded gaps,
reconstruct first the gap sequence and then the postings
sequence: 1110001110101011111101101111011.
-codes are relatively
inefficient for large numbers (e.g., 1025 in
Table 5.5 ) as they encode the length of the
offset in inefficient unary code. -codes
differ from -codes in that they encode the first
part of the code (length) in -code instead of
unary code. The encoding of offset is the
same. For example, the -code of 7 is 10,0,11 (again,
we add commas for readability). 10,0 is the -code
for length (2 in this case) and the encoding of offset
(11) is unchanged. (i) Compute the -codes for the other
numbers in Table 5.5 . For what range of numbers
is the -code shorter than the -code?
(ii) code beats variable byte code in
Table 5.6 because the index contains stop words and thus
many small gaps. Show that variable byte code is more
compact if larger gaps dominate. (iii) Compare the
compression ratios of code and variable byte code
for a distribution of gaps dominated by large gaps.
We have defined unary codes as being ``10'': sequences of 1s
terminated by a 0. Interchanging the roles of 0s and 1s
yields an equivalent ``01'' unary code. When this 01 unary
code is used, the construction of a
-code can be stated as follows: 1. Write
down in binary using
bits. 2. Prepend 0s.
(i) Encode the numbers in Table 5.5 in this alternative
-code.
(ii)
Show that this method produces
a well-defined alternative
-code in the sense that it has the same length and can
be uniquely decoded.
Unary code is not a universal code in the sense
defined above. However, there exists a distribution over gaps
for which unary code is optimal. Which distribution is this?
Give some examples of terms that violate the
assumption that gaps all have the same size (which we made
when estimating the space requirements of a encoded
index). What are
general characteristics of these terms?
Consider a term whose postings list has size ,
say, . Compare the size of the
-compressed gap-encoded postings list if the distribution of the
term is uniform (i.e., all gaps have the same size) vs. its size when the distribution is not uniform. Which
compressed postings list is smaller?
Work out the sum
in Equation 12
and show it adds up to about
251 MB. Use the numbers in Table 4.2 , but do not round , and the number of vocabulary blocks.
Go through the above calculation of index size and
explicitly state all the approximations that were made to
arrive at Equation 11.
For a collection of your choosing determine the number
of documents and terms and the average length of a
document. (i) How large is the inverted index predicted to be by
Equation 11? (ii) Implement an indexer that
creates a -compressed inverted index for the
collection. How large is the actual index? (iii) Implement an
indexer that uses variable byte encoding. How large is the
variable byte encoded index?
Table 5.7:
Two gap sequences to be merged in blocked sort-based indexing.
To be able to hold as many postings as possible in
main memory, it is a good idea to compress intermediate
index files during index construction. (i) This makes
merging runs in blocked sort-based indexing more complicated. As an
example, work out the encoded merged sequence of
the gaps in Table 5.7 . (ii) Index construction is
more space-efficient when using compression. Would you also
expect it to be faster?
(i) Show that the size of the vocabulary is finite
according to Zipf's law and infinite according to Heaps'
law. (ii) Can we derive Heaps' law from Zipf's law?