next up previous contents index
Next: Heaps' law: Estimating the Up: Index compression Previous: Index compression   Contents   Index


Statistical properties of terms in information retrieval


  (distinct) terms non-positional
postings
tokens (${} = {}$number of position entries in postings)
  number $\Delta\%$ T% number $\Delta\%$ T% number $\Delta\%$ T%
unfiltered 484,494     109,971,179     197,879,290    
no numbers 473,723 $-$2 $-$2 100,680,242 $-$8 $-$8 179,158,204 $-$9 $-$9
case folding 391,523 $-$17 $-$19 96,969,056 $-$3 $-$12 179,158,204 $-$0 $-$9
30 stop words 391,493 $-$0 $-$19 83,390,443 $-$14 $-$24 121,857,825 $-$31 $-$38
150 stop words 391,373 $-$0 $-$19 67,001,847 $-$30 $-$39 94,516,599 $-$47 $-$52
stemming 322,383 $-$17 $-$33 63,812,300 $-$4 $-$42 94,516,599 $-$0 $-$52
The effect of preprocessing on the number of terms, non-positional postings, and tokens for RCV1. ``$\Delta\%$'' indicates the reduction in size from the previous line, except that ``30 stop words'' and ``150 stop words'' both use ``case folding'' as their reference line. ``T%'' is the cumulative (``total'') reduction from unfiltered. We performed stemming with the Porter stemmer (Chapter 2 , page 2.2.4 ).

As in the last chapter, we will use Reuters-RCV1 as our model collection (see Table 4.2 , page 4.2 ). We give some term and postings statistics for the collection in Table 5.1 . ``$\Delta\%$'' indicates the reduction in size from the previous line. ``T%'' is the cumulative reduction from unfiltered.

The table shows the number of terms for different levels of preprocessing (column 2). The number of terms is the main factor in determining the size of the dictionary. The number of non-positional postings (column 3) is an indicator of the expected size of the non-positional index of the collection. The expected size of a positional index is related to the number of positions it must encode (column 4).

In general, the statistics in Table 5.1 show that preprocessing affects the size of the dictionary and the number of non-positional postings greatly. Stemming and case folding reduce the number of (distinct) terms by 17% each and the number of non-positional postings by 4% and 3%, respectively. The treatment of the most frequent words is also important. The rule of 30 states that the 30 most common words account for 30% of the tokens in written text (31% in the table). Eliminating the 150 most common words from indexing (as stop words; cf. Section 2.2.2 , page 2.2.2 ) will cut 25-30% of the non-positional postings. But, while a stop list of 150 words reduces the number of postings by a quarter or more, this size reduction does not carry over to the size of the compressed index. As we will see later in this chapter, the postings lists of frequent words require only a few bits per posting after compression.

The deltas in the table are in a range typical of large collections. Note, however, that the percentage reductions can be very different for some text collections. For example, for a collection of web pages with a high proportion of French text, a lemmatizer for French would reduce vocabulary size much more than the Porter stemmer does for an English-only collection since French is a morphologically richer language than English.

The compression techniques we describe in the remainder of this chapter are lossless , that is, all information is preserved. Better compression ratios can be achieved with lossy compression , which discards some information. Case folding, stemming and stop word elimination are forms of lossy compression. Similarly, the vector space model (Chapter 6 ) and dimensionality reduction techniques like latent semantic indexing (Chapter 18 ) create compact data structures from which we cannot fully restore the original collection. Lossy compression makes sense when the ``lost'' information is unlikely ever to be used by the search system. For example, web search is characterized by a large number of documents, short queries, and users who only look at the first few pages of results. As a consequence, we can discard postings of documents that would only be used for hits far down the list. Thus, there are retrieval scenarios where lossy methods can be used for compression without any reduction in effectiveness.

Before introducing techniques for compressing the dictionary, we want to estimate the number of terms $M$ in a collection. It is sometimes said that languages have a vocabulary of a certain size. The second edition of the Oxford English Dictionary (OED) defines more than 600,000 words. But the vocabulary of most large collections is much larger than the OED. The OED does not include most names of people, locations, products and scientific entities like genes. These names need to be included in the inverted index, so our users can search for them.



Subsections
next up previous contents index
Next: Heaps' law: Estimating the Up: Index compression Previous: Index compression   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2008-06-01