term-partitioned index . Most large search engines prefer a
document-partitioned index (which can be easily generated
from a term-partitioned index).
We discuss this topic further in
Section 20.3 (page ).
The distributed index construction method we describe in this section is an application of MapReduce , a general architecture for distributed computing. MapReduce is designed for large computer clusters. The point of a cluster is to solve large computing problems on cheap commodity machines or nodes that are built from standard parts (processor, memory, disk) as opposed to on a supercomputer with specialized hardware. Although hundreds or thousands of machines are available in such clusters, individual machines can fail at any time. One requirement for robust distributed indexing is, therefore, that we divide the work up into chunks that we can easily assign and - in case of failure - reassign. A master node directs the process of assigning and reassigning tasks to individual worker nodes.
The map and reduce phases of MapReduce split
up the computing job into chunks that
standard machines can process in a short time. The various
steps of MapReduce are shown in
Figure 4.5 and an example on a collection consisting
of two
documents is shown in Figure 4.6 . First, the
input data, in our case a collection of web pages, are split
into splits where the size of the split is
chosen to ensure that the work can be distributed evenly
(chunks should not be too large) and efficiently (the total
number of chunks we need to manage should not be
too large); 16 or 64 MB are good
sizes in distributed indexing. Splits are
not preassigned to machines, but are instead assigned by the
master node on an ongoing basis: As a machine finishes
processing one split, it is assigned the next one. If a
machine dies or becomes a laggard due to hardware problems,
the split it is working on is simply reassigned to another
machine.
![]() |
In general, MapReduce breaks a large computing problem into
smaller parts by recasting it in terms of manipulation of
key-value pairs . For indexing, a key-value pair has
the form (termID,docID).
In distributed indexing, the mapping
from terms to termIDs is also distributed and therefore more
complex than in single-machine indexing.
A simple solution is to maintain a (perhaps precomputed) mapping for frequent
terms that is copied to all nodes and to use terms directly
(instead of termIDs) for infrequent terms.
We do not address
this problem here and assume that all nodes share a
consistent term termID mapping.
The map phase of MapReduce consists of
mapping splits of the input data to key-value pairs. This
is the same parsing task we also encountered in
BSBI and SPIMI,
and we therefore call the
machines that execute the map phase parsers . Each parser writes its output to
local intermediate files, the segment files (shown as
in
Figure 4.5 ).
For the reduce phase , we want all values for a given key to
be stored close together, so that they can be read and
processed quickly. This is achieved by partitioning the keys
into term partitions and having the parsers write key-value
pairs for each term partition into a separate segment file. In
Figure 4.5 , the term partitions are according to first letter:
a-f, g-p, q-z, and
. (We chose these
key ranges for ease of exposition. In general, key ranges
need not correspond to contiguous terms or termIDs.) The
term partitions are defined by the
person who operates the indexing system
(Exercise 4.6 ). The parsers then
write corresponding segment files, one for each
term partition. Each term partition thus corresponds to
segments files, where
is the number of parsers. For
instance,
Figure 4.5 shows three a-f segment files of the a-f
partition, corresponding to the three parsers shown in the figure.
Collecting all values (here: docIDs) for a given key (here:
termID) into one list is the task of the
inverters in the reduce phase. The master assigns each term partition to a
different inverter - and, as in the case of parsers,
reassigns term partitions in case of failing or slow inverters.
Each term partition (corresponding to segment files, one on
each parser) is processed by one inverter.
We assume here that segment files are of a size that a
single machine can handle (Exercise 4.6 ). Finally, the
list of values is sorted for each key and written to the
final sorted postings list (``postings'' in the figure).
(Note that postings in Figure 4.6 include term frequencies, whereas
each posting in the other sections of this chapter is simply
a
docID without term frequency information.)
The data flow is shown for a-f in Figure 4.5 .
This completes the construction of the inverted index.
Parsers and inverters are not separate sets of machines. The master identifies idle machines and assigns tasks to them. The same machine can be a parser in the map phase and an inverter in the reduce phase. And there are often other jobs that run in parallel with index construction, so in between being a parser and an inverter a machine might do some crawling or another unrelated task.
To minimize
write times before inverters reduce the data, each parser writes
its segment files to its local disk. In the reduce phase, the master communicates
to an inverter the locations of
the relevant segment files (e.g., of the segment files
of the a-f partition).
Each segment file
only requires one sequential read because all data
relevant to a particular inverter were written to a single
segment file by the parser. This setup minimizes
the amount of network traffic needed during indexing.
Map and reduce functions in MapReduce. In
general, the map
function produces a list of key-value pairs. All values
for a key are collected into one list in the reduce
phase. This list is then processed further.
The instantiations of the two functions and an
example
are shown for index construction. Because the map phase
processes documents in a
distributed fashion, termID-docID pairs need not be ordered correctly
initially as in this example.
The example shows terms instead of termIDs
for better readability.
We abbreviate Caesar
as C and
conquered
as c'ed.
Figure 4.6 shows the general schema of the MapReduce functions. Input and output are often lists of key-value pairs themselves, so that several MapReduce jobs can run in sequence. In fact, this was the design of the Google indexing system in 2004. What we describe in this section corresponds to only one of five to ten MapReduce operations in that indexing system. Another MapReduce operation transforms the term-partitioned index we just created into a document-partitioned one.
MapReduce offers a robust and conceptually simple framework for implementing index construction in a distributed environment. By providing a semiautomatic method for splitting index construction into smaller tasks, it can scale to almost arbitrarily large collections, given computer clusters of sufficient size.
Exercises.