next up previous contents index
Next: Zipf's law: Modeling the Up: Statistical properties of terms Previous: Statistical properties of terms   Contents   Index

Heaps' law: Estimating the number of terms

\includegraphics[width=9cm]{art/heapsmall.eps} Heaps' law.Vocabulary size $M$ as a function of collection size $T$ (number of tokens) for Reuters-RCV1. For these data, the dashed line $\log_{10} M = 0.49*\log_{10} T + 1.64$ is the best least-squares fit. Thus, $k = 10^{1.64} \approx 44$ and $b= 0.49$.

A better way of getting a handle on $M$ is Heaps' law , which estimates vocabulary size as a function of collection size:

\begin{displaymath}
M = k T^b
\end{displaymath} (1)

where $T$ is the number of tokens in the collection. Typical values for the parameters $k$ and $b$ are: $30 \leq k \leq 100$ and $b
\approx 0.5$. The motivation for Heaps' law is that the simplest possible relationship between collection size and vocabulary size is linear in log-log space and the assumption of linearity is usually born out in practice as shown in Figure 5.1 for Reuters-RCV1. In this case, the fit is excellent for $T>10^5=100{,}000$, for the parameter values $b= 0.49$ and $k=44$. For example, for the first 1,000,020 tokens Heaps' law predicts 38,323 terms:
\begin{displaymath}
44 \times 1{,}000{,}020^{0.49} \approx 38{,}323.
\end{displaymath} (2)

The actual number is 38,365 terms, very close to the prediction.

The parameter $k$ is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed. Case-folding and stemming reduce the growth rate of the vocabulary, whereas including numbers and spelling errors increase it. Regardless of the values of the parameters for a particular collection, Heaps' law suggests that (i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary is quite large for large collections. These two hypotheses have been empirically shown to be true of large text collections (Section 5.4 ). So dictionary compression is important for an effective information retrieval system.


next up previous contents index
Next: Zipf's law: Modeling the Up: Statistical properties of terms Previous: Statistical properties of terms   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07