Heaps' law: Estimating the number of terms

Next: Zipf's law: Modeling the Up: Statistical properties of terms Previous: Statistical properties of terms Contents Index

Heaps' law: Estimating the number of terms

$\includegraphics[width=9cm]{art/heapsmall.eps}$ Heaps' law.Vocabulary size

as a function of collection size

(number of tokens) for Reuters-RCV1. For these data, the dashed line $\log_{10} M = 0.49*\log_{10} T + 1.64$ is the best least-squares fit. Thus, $k = 10^{1.64} \approx 44$ and

A better way of getting a handle on is Heaps' law , which estimates vocabulary size as a function of collection size:

$\begin{displaymath} M = k T^b \end{displaymath}$

(1)

where

is the number of tokens in the collection. Typical values for the parameters

and

are: $30 \leq k \leq 100$ and $b \approx 0.5$ . The motivation for Heaps' law is that the simplest possible relationship between collection size and vocabulary size is linear in log-log space and the assumption of linearity is usually born out in practice as shown in Figure 5.1 for Reuters-RCV1. In this case, the fit is excellent for $T>10^5=100{,}000$ , for the parameter values

and

. For example, for the first 1,000,020 tokens Heaps' law predicts 38,323 terms:

$\begin{displaymath} 44 \times 1{,}000{,}020^{0.49} \approx 38{,}323. \end{displaymath}$

(2)

The actual number is 38,365 terms, very close to the prediction.

The parameter is quite variable because vocabulary growth depends a lot on the nature of the collection and how it is processed. Case-folding and stemming reduce the growth rate of the vocabulary, whereas including numbers and spelling errors increase it. Regardless of the values of the parameters for a particular collection, Heaps' law suggests that (i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached, and (ii) the size of the dictionary is quite large for large collections. These two hypotheses have been empirically shown to be true of large text collections (Section 5.4 ). So dictionary compression is important for an effective information retrieval system.

Next: Zipf's law: Modeling the Up: Statistical properties of terms Previous: Statistical properties of terms Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07