Using fixed-width entries for terms is clearly wasteful. The
average length of a term in English is about 8
characters icompresstb1, so on average we are wasting 12 characters
(or
24 bytes)
in
the fixed-width scheme. Also, we have no way of storing terms with
more than 20 characters like
hydrochlorofluorocarbons
and
supercalifragilisticexpialidocious.
We can
overcome these
shortcomings
by storing the dictionary terms as one long string
of characters, as shown in Figure 5.4 . The
pointer to the next term is also used to demarcate the end
of the current term. As before, we locate terms in the data
structure by way of binary search in the (now smaller) table. This scheme saves us 60%
compared with fixed-width storage -
24 bytes on average of
the 40 bytes
12 bytes on average of
the 20 bytes
we allocated for terms before. However, we now
also need to store term pointers. The term
pointers resolve
positions, so they
need to be
bits or 3 bytes long.
In this new scheme,
we need
for
the Reuters-RCV1 dictionary:
4 bytes each for frequency and postings
pointer, 3 bytes for the term pointer, and
bytes on
average for the term.
So we have reduced the space requirements
by one third from 19.211.2 MB to 10.87.6 MB.