next up previous contents index
Next: Blocked storage Up: Dictionary compression Previous: Dictionary compression   Contents   Index

Dictionary-as-a-string

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries as shown in Figure 5.3 . Assuming a Unicode representation, we allocate $2\times 20$ We allocate 20 bytes for the term itself (since few terms have more than 20 characters in English), 4 bytes for its document frequency and 4 bytes for the pointer to its postings list. 4-byte pointers resolve a 4 GB address space. For large collections like the web, we need to allocate more bytes per pointer. We look up terms in the array by binary search (see Chapter 3 ). For Reuters-RCV1, we need $M \times (2 \times
20+4+4) = 400{,}000 \times 48 = 19.2 \mbox{ MB}$ $M \times (20+4+4) = 400{,}000 \times 28 = 11.2 \mbox{ MB}$ for storing the dictionary in this scheme.

\begin{figure}
% latex2html id marker 6160
\psset{unit=1.2cm}
\begin{pspicture}(...
...example
are \term{systile}, \term{syzygetic}
and \term{syzygial}.
}
\end{figure}

Using fixed-width entries for terms is clearly wasteful. The average length of a term in English is about 8 characters icompresstb1, so on average we are wasting 12 characters (or 24 bytes) in the fixed-width scheme. Also, we have no way of storing terms with more than 20 characters like hydrochlorofluorocarbons and supercalifragilisticexpialidocious. We can overcome these shortcomings by storing the dictionary terms as one long string of characters, as shown in Figure 5.4 . The pointer to the next term is also used to demarcate the end of the current term. As before, we locate terms in the data structure by way of binary search in the (now smaller) table. This scheme saves us 60% compared with fixed-width storage - 24 bytes on average of the 40 bytes 12 bytes on average of the 20 bytes we allocated for terms before. However, we now also need to store term pointers. The term pointers resolve $400{,}000 \times 8 = 3.2 \times 10^6$ positions, so they need to be $ \log_2 3.2 \times 10^6 \approx 22$ bits or 3 bytes long.

In this new scheme, we need $400{,}000 \times (4+4+3+\unicode{2\times}{} 8) = \unicode{10.8}{7.6} \ \mbox{MB}$ for the Reuters-RCV1 dictionary: 4 bytes each for frequency and postings pointer, 3 bytes for the term pointer, and $\unicode{2\times}{} 8$ bytes on average for the term. So we have reduced the space requirements by one third from 19.211.2 MB to 10.87.6 MB.

\begin{figure}
% latex2html id marker 6276
\psset{unit=1.2cm}
\begin{pspicture}(...
...that
indicates how many bytes
to skip to reach subsequent terms.
}
\end{figure}


next up previous contents index
Next: Blocked storage Up: Dictionary compression Previous: Dictionary compression   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2008-06-01