The five steps in constructing an
index for Reuters-RCV1 in blocked sort-based indexing.
Line numbers refer to Figure 4.2 .
Total index construction time in
blocked sort-based indexing is broken down in Table 4.3.
Fill out the time column of the table for Reuters-RCV1
assuming a system with the parameters given in Table 4.1 .
Table 4.4:
Collection statistics for a large collection.
symbol
statistic
value
# documents
1,000,000,000
# tokens per document
1000
# distinct terms
44,000,000
Repeat Exercise 4.8 for the larger
collection in Table 4.4 . Choose a block size
that is realistic for current technology
(remember that a block
should easily fit into main memory).
How many blocks do you need?
Assume that we have a collection of modest size
whose index can be constructed with the simple in-memory
indexing algorithm
in Figure 1.4 (page ). For this collection,
compare memory, disk and time requirements of
the simple algorithm in Figure 1.4 and
blocked sort-based indexing.
Assume that machines in MapReduce have 100 GB of disk
space each. Assume further that the postings list of the
term the has a size of 200 GB. Then the MapReduce
algorithm as described cannot be run to construct the index.
How would you modify
MapReduce so that it can handle this case?
For optimal load
balancing, the inverters in MapReduce must get segmented
postings files of similar sizes. For a new collection, the
distribution of key-value pairs may not be known in
advance. How would you solve this problem?
Apply MapReduce to the problem of counting how
often each term occurs in a set of files. Specify map and
reduce operations for this task. Write down an example
along the lines of Figure 4.6 .
We claimed above (page 4.5 ) that an
auxiliary index can impair the quality of collection statistics.
An example is the
term weighting method idf ,
which is defined as
where is the total number of documents and is the
number of documents that term occurs in
idf. Show that
even a small auxiliary index can cause significant error in idf
when it is computed on the main index only. Consider a
rare term that suddenly occurs frequently (e.g.,
Flossie as in Tropical Storm Flossie).