next up previous contents index
Next: Dictionary as a string Up: Index compression Previous: Zipf's law: Modeling the   Contents   Index


Dictionary compression

This section presents a series of dictionary data structures that achieve increasingly higher compression ratios. The dictionary is small compared with the postings file as suggested by Table 5.1 . So why compress it if it is responsible for only a small percentage of the overall space requirements of the IR system?

One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. Thus, the main goal of compressing the dictionary is to fit it in main memory, or at least a large portion of it, to support high query throughput. Although dictionaries of very large collections fit into the memory of a standard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation may have to index a multiterabyte collection with a comparatively large vocabulary because of the presence of documents in many different languages. We also want to be able to design search systems for limited hardware such as mobile phones and onboard computers. Other reasons for wanting to conserve memory are fast startup time and having to share resources with other applications. The search system on your PC must get along with the memory-hogging word processing suite you are using at the same time.

Figure 5.3: Storing the dictionary as an array of fixed-width entries.
\begin{figure}\begin{tabular}{@{}l\vert lp{2cm}p{2cm}\vert}\cline{2-4}
&term & d...
...0} bytes & 4 bytes &
\multicolumn{1}{l}{4 bytes}
\\
\end{tabular}
\end{figure}



Subsections
next up previous contents index
Next: Dictionary as a string Up: Index compression Previous: Zipf's law: Modeling the   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07