**Tom Griffiths
gruffydd@psych.stanford.edu**

Estimation of sparse multinomial distributions is an important component of many statistical learning tasks, and is particularly relevant to natural language processing. From their origins in [#!shannon51!#], statistical approaches to natural language typically require knowledge of the probability of a symbol or word conditioned upon some context. These probabilities are usually estimated from limited data, and require some form of smoothing before they can be used in a particular task. Consequently, a number of approaches to smoothing multinomial distributions have been suggested, typically combining statistical notions with heuristic strategies [#!cheng98!#]. In this paper, I present a generalization of a probabilistically correct approach to parameter estimation for sparse multinomial distributions. This new approach takes advantage of specific knowledge that we might have about a particular domain, such as natural language.

Perhaps the simplest and most elegant method of estimating a multinomial distribution is a generalization of an approach originally taken by Laplace. In the following discussion, I follow the presentation in [#!friedmans98!#]. For a language containing distinct symbols, a multinomial distribution is specified by a parameter vector , where is the probability of an observation being symbol . Consequently, we have the constraints that and . The task of multinomial estimation is to take a data set and produce a vector that results in a good approximation to the distribution that produced . In this case, a data set consists of independent observations drawn from the distribution to be estimated, which can be summarised by the statistics specifying the number of times the th symbol occurs in the data. also specifies the set of symbols that occur at least once in the training data.

Stated in this way, the task of multinomial estimation can be framed as
one of predicting the next observation based on the data. Specifically,
we wish to calculate . The Bayesian estimate for this
probability is given by

where is given by the multinomial distribution corresponding to . The posterior probability can be obtained via Bayes rule

where is the prior probability of a given .

The above approach to estimating the parameters of a multinomial
distribution was first exploited by Laplace [#!laplace95!#], who took a
uniform prior
over to give the famous ``law of succession'':
. A more general approach is to
assume a Dirichlet prior over , which is conjugate to the
multinomial distribution and gives

The simple Bayesian method outlined above is appropriate for the general task of multinomial estimation, but generally provides poor results when used for smoothing of sparse multinomial distributions. This is primarily a consequence of the erroneous assumption that all categories should be considered as possible values for . In actuality, sparse multinomial distributions are characterized by the fact that only a few symbols actually occur. In such cases, applying the above method will give too much probability to symbols that never occur and consequently give a poor estimate of the true distribution.

In order to extend the Bayesian approach to sparse multinomial
distributions, several authors have used the notion of maintaining
uncertainty over the vocabulary from which observations are produced as
well as their probabilities. In [#!ristad95!#], Ristad uses assumptions
about the probability of strings based upon different vocabularies to give
the estimate

where is the size of the smallest vocabulary that can accommodate the data, corresponding to the cardinality of . Ristad shows that this method performs well in a number of different contexts, consistently making better predictions than Laplace's law and its cognates.

A more explicitly Bayesian approach is taken by Friedman and Singer in
[#!friedmans98!#], who also point out that Ristad's method can be
considered a special case of the framework that they present. Friedman and
Singer's approach considers the vocabulary
to be a
random variable, allowing them to write

where is given by assuming a Dirichlet () prior over the symbols in . Thus we have

Friedman and Singer go on to define a hierarchical prior over , such that all vocabularies of cardinality are given the same probability, namely , where is the probability that the size of the vocabulary () is . It follows that if , . If , it is necessary to estimate the proportion of that contain for a given . The simplified result is

where

with . Friedman and Singer also point out that this distribution remains well-defined in the case where the alphabet is unbounded, and show that their method gives good performance in a simple character-prediction task.

Friedman and Singer's approach assumes a prior that gives equal probability to all vocabularies of a given cardinality. This assumption aids in obtaining an efficient specification for . In many real-world tasks, we have at least some knowledge about the structure of the task that we might like to build into our methods for parameter estimation. One example of such knowledge might be the expectation that the symbols used by a sparse multinomial distribution will come from one of a few restricted vocabularies which we can pre-specify. For example, in predicting the next character in a file, our predictions might be facilitated by considering the fact that most files either use a vocabulary consisting of just the ASCII printing characters (such as text files), or all possible characters (such as object files). In such a case, giving equal prior probability to all vocabularies of a given cardinality may have negative consequences.

In the case where our knowledge about a domain leads us to specify some
known set of vocabularies
, we can write

where specifies whether we are considering the distribution over implied by our knowledge of the domain, which will be restricted to , or the knowledge-free distribution over used by Friedman and Singer. Consequently,

(3) |

which is a mixture between two distributions - the first being that produced by Friedman and Singer's approach, and the second being produced by the same approach applied to a restricted set of hypotheses. From [#!friedmans98!#] (Equation 10) it follows that

and, from the properties of Dirichlet priors,

Finally, is given in equation . Thus we need only define the priors , the parameter , and to fully specify the distribution for a given . While and should be set differently for specific domains, I use uniform priors in each of the empirical investigations discussed below.

The intuition behind this approach is that it adds to Friedman and Singer's method a second process that attempts to classify the target distribution as one of a number of known distributions, and uses the posterior probability of these distributions for full Bayesian smoothing. However, rather than using the potentially fragile approach of classifying the data based upon the distribution over symbols, the model attempts to classify the data in terms of a set of known vocabularies. Applying standard Bayesian multinomial estimation within each of these vocabularies gives sufficient flexibility for the method to capture a range of distributions, while allowing prior knowledge to play an important role in informing the results.

Text compressiom provides an effective test of methods for multinomial estimation. One approach to adaptive coding involves specifying a method for calculating a distribution over the probability of the next byte in a file based upon the preceding bytes [#!clearywb90!#]. The extent to which the file can be compressed will depend upon the quality of the resulting predictions. This method was explicitly used by Ristad to assess his approach to multinomial estimation, and implicitly used by Friedman and Singer.

To illustrate the utility of including prior knowledge in multinomial
estimation, I will follow Ristad in examining the performance of these
various methods on the Calgary text compression corpus
[#!clearywb90!#]. This corpus consists of 19 files of several different
types, each containing some subset of the 256 possible characters in some
order (
). The files include BibTeX source (`bib`), formatted English text
(`book1`, `book2`,
`paper1`, `paper2`, `paper3`, `paper4`, `paper5`,
`paper6`), geological data (`geo`), newsgroup postings (` news`), a bit-mapped monochrome picture (`pic`), programs in three
different languages (`progc`, `progl`, `progp`)
and a terminal transcript (`trans`). The task was to repeatedly
estimate the multinomial distribution from which characters in the file
were drawn based upon the first characters, and use this
distribution to predict the st character. Performance was measured
in terms of the length of the resulting file, where the contribution the
st character makes to the length is given by
. The
final file length is thus the accumulation of a single term in the cross
entropy of the distribution a method produces for predicting each
successive character, and provides a good measure of the quality of the
estimator for a range of values of .

Text compression is a domain in which files are likely to fall into one of a small number of categories, so giving extra weight to specific vocabularies can be of great utility. For the predictions of the extended Bayesian approach outlined above, the ``known'' vocabularies corresponded to the characters that occurred in a separate set of files each containing between 0.5 and 2 megabytes of BibTeX source, English text, C code, LISP code, and newsgroup postings. The resulting vocabularies identified 100, 92, 100, 157, and 102 specific characters as belonging to documents of a particular type. Finally, a vocabulary was added to cover files that use all 256 characters. Together, these six vocabularies specified . was set to , as was done by Friedman and Singer in their experiments with characters. The text compression results are shown in Table 1.

file |
size |
||||||||

bib |
111261 | 81 | 72330 | 79 | 70 |
122 | 92 |
269 | 174 |

book1 |
768771 | 82 | 435043 | 160 | 151 | 137 | 116 |
352 | 219 |

book2 |
610856 | 96 | 365952 | 96 | 87 |
167 | 124 |
329 | 212 |

geo |
102400 | 256 | 72274 | 173 | 172 | 279 | 165 | 165 | 161 |

news |
377109 | 98 | 244633 | 96 | 86 |
159 | 116 |
304 | 201 |

obj1 |
21504 | 256 | 15989 | 132 | 136 | 284 | 129 | 129 | 126 |

obj2 |
246814 | 256 | 193144 | 197 | 190 | 333 | 190 | 189 | 182 |

paper1 |
53161 | 95 | 33113 | 75 | 66 |
137 | 100 |
236 | 156 |

paper2 |
82199 | 91 | 47280 | 74 | 65 |
133 | 105 |
259 | 167 |

paper3 |
46526 | 84 | 27132 | 69 | 61 |
118 | 92 |
238 | 154 |

paper4 |
13286 | 80 | 7806 | 60 | 51 |
104 | 79 |
190 | 126 |

paper5 |
11954 | 91 | 7376 | 61 | 53 |
119 | 83 |
181 | 122 |

paper6 |
38105 | 93 | 23861 | 72 | 63 |
131 | 95 |
223 | 149 |

pic |
513216 | 159 | 77636 | 138 | 149 | 325 | 216 | 323 | 205 |

progc |
39611 | 92 | 25743 | 73 | 65 |
131 | 91 |
222 | 150 |

progl |
71646 | 87 | 42720 | 59 |
65 | 150 | 97 | 253 | 164 |

progp |
49379 | 89 | 30052 | 72 | 64 |
131 | 94 |
236 | 155 |

trans |
93695 | 99 | 64800 | 135 | 127 | 145 | 105 |
252 | 169 |

Table 1: Prediction results on the Calgary corpus for six parameter estimation methods. Results are given in whole bytes relative to the empirical entropy which is taken from [#!ristad95!#]. is the predictions of the full Bayesian model outlined above. uses only the vocabularies in , while corresponds to the raw predictions of Friedman and Singer's method. is Ristad's law of natural succession, while and are Laplace's law and the Jeffreys-Perks law, following from Equation with abd respectively. and were taken from [#!ristad95!#], where Ristad also tested several other estimation methods. The results in italics are the best of those considered by Ristad in his paper, while the results in bold are the best overall.

It is instructive to examine the kinds of files on which the Bayesian
method outlined above () outperformed the other methods for
multinomial estimation. These files were `bib`, `book1`, `news`, `paper1`, `paper2`, `paper3`, `paper4`, `paper5`, `paper6`, `progc`, `progl`, and `progp`. In these
cases, only outperformed . These files all use a
restricted vocabulary of characters corresponding to those used in English
text, together with a small number of formatting characters. The high
performance in these cases was a result of the fact that three of the
vocabularies in
contain such characters - the Bibtex
source, C code, and newsgroup postings. The vocabulary corresponding to
plain English text was too restrictive for most of these files, and was
hardly used, while the vocabulary for Lisp programs was too large to be of
utility. Without this additional structure, Friedman and Singer's method
(
) tended to perform worse than Ristad's simpler
method (), although better than the basic Dirichlet smoothing methods
( and ) when the files used restricted vocabularies.

The results for `book1` illustrate an important weakness of the
approach outlined above. Here the file lengths for and
are higher than those for
and , despite the fact that the file in question features a
restricted English-based vocabulary. The reason for this is that the file
also contains two unusual characters that were not encountered in the data
used to construct any of the specific hypotheses. Upon
encountering these characters defaulted to the only
that contained those characters: the unrestricted vocabulary of
all 256 characters. From that point, the distribution for
corresponded to Equation with , and the
resulting smoothing was worse than for
and .

The kind of behaviour that was demonstrated in the compression of `book1` is undesirable - we don't want to reject one of our candidate
vocabularies on the basis of one or two symbols that are inconsistent with
that vocabulary. We can improve the robustness of our model by adding in a
noise process such that, regardless of vocabulary, any character in
can occur with some small probability. A direct mixture of
with a uniform noise process results in an intractable sum
over all of the ways of assigning observations to the smoothed
vocabulary and the noise process, but we can obtain a simple closed form
solution if we assume that the noise process and vocabulary are
mutually exclusive. Assuming that any symbol in occurs with
probability , we have

where . It follows that

which allows us to specify and as above.

The definition of will determine the upper bound on the probability mass assigned to the noise distribution. Specifically, this bound will be given by . This gives us a simple heuristic for setting : if the probability mass that we want to assign to noise is , then we take .

While text compression serves as an illustrative domain for the comparison of different parameter estimation techniques, there are a number of other contexts in which we might wish to estimate multinomial distributions about which we have good prior knowledge. One such context is statistical natural language processing, in which accurate multinomial estimation is the topic of much research [#!cheng98!#]. Typically, such multinomial distributions are over large vocabularies of words. Here, the notion of smoothing a multinomial estimate based upon classification of the vocabulary involved has a direct connection to the ideas driving the text classification literature (eg. [#!yang99!#]): in different contexts, words will occur with different probabilities. In particular, different vocabularies will be used, and having a good set of candidate vocabularies may facilitate smoothing. If it is possible to classify a document as using a particular vocabulary, then we can perform smooth the results we obtain appropriately.

A dataset containing a total of approximately 20,000 articles drawn from
20 different Usenet newsgroups was used to examine this idea. This dataset
was first used for text classification in [#!lang95!#], and has since
been a benchmark for text classification algorithms. Ten of these
newsgroups (`rec.autos`, `rec.sport.baseball`, `sci.crypt`,
`sci.med`, `talk.politics.misc`, `talk.religion.misc`, `misc.forsale`, `comp.sys.mac.hardware`, `comp.os.ms-windows.misc`,
`comp.graphics`) were used to estimate a set of vocabularies
. These vocabularies were then applied in forming multinomial
estimates for further data drawn from these ten newsgroups and ten others
(`alt.atheism`, `sci.space`, `rec.motorcycles`, `talk.politics.guns`, `comp.sys.ibm.pc.hardware`, `rec.sport.hockey`,
`talk.politics.mideast`, `sci.electronics`,
`comp.windows.x`).

The actual dataset used was `20news-18827`, which consists of the 20
newsgroup data with headers and duplicates removed. The dataset was
preprocessed to remove all punctuation and capitalization, as well as
converting every number to a single symbol. The articles in each of the 20
newsgroups were then divided into three sets. The first set contained the
first 500 articles, and this was used to build the candidate vocabularies
for the ten newsgroups described above. The second set
contained articles 501-700, and was used as training data for multinomial
estimation for all 20 newsgroups. The third set contained articles
701-900, and was used as testing data for all 20 newsgroups. A dictionary
was built up by running over the 13,000 articles resulting from this
division, and all words that occurred only once in this entire reduced
dataset were mapped to an ``unknown'' word. The resulting dictionary
contained words.

For the generalized Bayesian model discussed above, featured one vocabulary that contained all words in the dictionary, and 10 vocabularies estimated from each of the 10 newsgroups mentioned above. These 10 vocabularies were produced by thresholding word frequency in the 500 articles considered, with the threshold ranging from 1 to 10 instances. Each newsgroup thus provided a hierarchy of vocabularies representing a range of degrees of specificity.

Five methods of multinomial estimation were considered. Since the candidate vocabularies are simultaneously too general and too specific to give high probability to any set of observations, tends to make no contribution to . For this reason I evaluated and separately. I also considered Ristad's law (), Laplace's law () and the Jeffreys-Perks law (). For , the maximum probability mass assigned to noise was , while both and used , to facilitate comparison with the Jeffreys-Perks law.

Testing for each newsgroup consisted of taking words from the 200 articles assigned for training purposes, estimating a distribution using each method, and then computing the cross-entropy between that distribution and an empirical estimate of the true distribution. The cross , where is the true distribution and is the distribution produced by the estimation method. The estimate of corresponded to the maximum likelihood estimate formed from the word frequencies in all 200 articles assigned for testing purposes. The testing procedure was conducted with just 100 words, and then in increments of 450 words until 10000 words had been seen in total. The results are shown in Figure 1.

As can be seen in Figure 1, consistently outperforms the other
methods, even on newsgroups that did not contribute to
. Performance was worst for `sci.space`, `rec.sport.hockey`, and `talk.politics.mideast`. These newsgroups are
those that showed the least correspondence to those constituting
. Figure 2 shows the vocabularies from
that
had highest posterior probability at each point in the training
process. `sci.space` moves between `comp.graphics` and `talk.politics.misc`, although neither of these seem to be
appropriate. `rec.sport.hockey` defaults to the vocabulary containing
all words once the number of unknown words is easier to account for by
this vocabulary than by a noise process combined with the most general
vocabulary in `rec.sport.baseball`.

This defaulting behavior is an important aspect of : at the point where the data are best accounted for by smoothing on the whole dictionary, the model will use an unrestricted vocabulary. The resulting multinomial estimates will be the same as applying Equation with the appropriate setting of . In the present experiment, the parameter was set so that the default estimate would correspond to . This allows a direct evaluation of how much is being gained by considering restricted vocabularies.

The importance of defaulting is illustrated in Figure 3. After 10,000
words, is beginning to perform worse than on `sci.space`. However, if we continue to examine the predictions made by
as more words are added, we see that the model swiftly defaults
to using as soon as the penalty for the unknown words exceeds the
gains of the restricted vocabulary. This is valuable, since at this point
sufficient data have accumulated that gives a good estimate of the
target distribution. The intelligent choice of
is
important to these results. Experiments conducted with 100 vocabularies
generated at random showed very rapid defaulting - the random
vocabularies tended to be strongly inconsistent with the actual data.

The experiment presented above involved estimating only a single multinomial distribution over words. For tasks like the estimation of transition probabilities, it becomes necessary to maintain multiple such estimates. In these cases, the memory demands of an estimation method become an important concern. Implementing requires more memory than doing simple Bayesian smoothing, however the amount of memory required scales lineary with . Since standard Bayesian smoothing is equivalent to the case where , the resulting cost is not too extreme in most situations.

Efficient implementation of requires storing a list of the words that belong in each of the vocabularies, and a vector of the posterior probabilities of each . can then be evaluated for any given word by taking a weighted average of the probabilities assigned to that word by applying standard Bayesian smoothing (Lidstone's law) within that vocabulary. The time required to compute a probability will thus also increase linearly with , but when the number of candidate vocabularies is small the algorithm will remain efficient.

In this paper, I have presented a novel approach to Bayesian smoothing of sparse multinomial distributions. This approach follows the idea of maintaining uncertainty over restricted vocabularies by allowing the vocabularies themselves to be specified on the basis of domain-specific knowledge. I have argued that this approach has its most valuable applications in statistical natural language processing, where data is sparse but domain knowledge is extensive. The main utility of this approach is that if a set of basis vocabularies that span a wide range of contexts can be found, it may be possible to achieve rapid and accurate smoothing of multinomial distributions over words by classifying documents according to their vocabularies.