![]() |
By increasing the block size , we get better compression.
However, there is a tradeoff between compression and the speed of term lookup. For the eight-term dictionary in
Figure 5.6 , steps in binary search are shown as
double lines and steps in list search as simple lines. We
search for terms in the uncompressed dictionary by binary
search (a). In the compressed dictionary, we first locate
the term's block by binary search and then its position
within the list by linear search through the block (b).
Searching the uncompressed dictionary in (a) takes on
average
steps, assuming
each term is equally likely to come up in a query. For
example, finding the two terms, aid and
box, takes three and two steps, respectively. With blocks
of size
in (b), we need
steps
on average,
more. For example, finding
den takes one binary search step and two steps
through the block. By increasing
, we can get the size of
the compressed dictionary arbitrarily close to the minimum
of
, but term lookup becomes
prohibitively slow for large values of
.
One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding (Figure 5.7 ). A common prefix is identified for a subsequence of the term list and then referred to with a special character. In the case of Reuters, front coding saves another 2.41.2 MB, as we found in an experiment.
Other schemes with even greater compression rely on minimal perfect
hashing, that is, a hash function that maps terms onto
without collisions. However, we cannot adapt
perfect hashes incrementally because each new term causes a
collision and therefore requires the creation of a new
perfect hash function. Therefore, they cannot be used
in a dynamic environment.
Even with the best compression scheme, it may not be feasible to store the entire dictionary in main memory for very large text collections and for hardware with limited memory. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree. For processing most queries, the search system has to go to disk anyway to fetch the postings. One additional seek for retrieving the term's dictionary page from disk is a significant, but tolerable increase in the time it takes to process a query.
data structure | size in MB | ||
dictionary, fixed-width | 19.211.2 | ||
dictionary, term pointers into string | 10.8 7.6 | ||
![]() ![]() |
10.3 7.1 | ||
![]() |
7.9 5.9 |
Exercises.