By increasing the block size , we get better compression. However, there is a tradeoff between compression and the speed of term lookup. For the eight-term dictionary in Figure 5.6 , steps in binary search are shown as double lines and steps in list search as simple lines. We search for terms in the uncompressed dictionary by binary search (a). In the compressed dictionary, we first locate the term's block by binary search and then its position within the list by linear search through the block (b). Searching the uncompressed dictionary in (a) takes on average steps, assuming each term is equally likely to come up in a query. For example, finding the two terms, aid and box, takes three and two steps, respectively. With blocks of size in (b), we need steps on average, more. For example, finding den takes one binary search step and two steps through the block. By increasing , we can get the size of the compressed dictionary arbitrarily close to the minimum of , but term lookup becomes prohibitively slow for large values of .
One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding (Figure 5.7 ). A common prefix is identified for a subsequence of the term list and then referred to with a special character. In the case of Reuters, front coding saves another 2.41.2 MB, as we found in an experiment.
Other schemes with even greater compression rely on minimal perfect hashing, that is, a hash function that maps terms onto without collisions. However, we cannot adapt perfect hashes incrementally because each new term causes a collision and therefore requires the creation of a new perfect hash function. Therefore, they cannot be used in a dynamic environment.
Even with the best compression scheme, it may not be feasible to store the entire dictionary in main memory for very large text collections and for hardware with limited memory. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree. For processing most queries, the search system has to go to disk anyway to fetch the postings. One additional seek for retrieving the term's dictionary page from disk is a significant, but tolerable increase in the time it takes to process a query.
data structure | size in MB | ||
dictionary, fixed-width | 19.211.2 | ||
dictionary, term pointers into string | 10.8 7.6 | ||
, with blocking, | 10.3 7.1 | ||
, with blocking & front coding | 7.9 5.9 |
Exercises.