Estimation of sparse multinomial distributions is an important component of many statistical learning tasks, and is particularly relevant to natural language processing. From their origins in [#!shannon51!#], statistical approaches to natural language typically require knowledge of the probability of a symbol or word conditioned upon some context. These probabilities are usually estimated from limited data, and require some form of smoothing before they can be used in a particular task. Consequently, a number of approaches to smoothing multinomial distributions have been suggested, typically combining statistical notions with heuristic strategies [#!cheng98!#]. In this paper, I present a generalization of a probabilistically correct approach to parameter estimation for sparse multinomial distributions. This new approach takes advantage of specific knowledge that we might have about a particular domain, such as natural language.
Perhaps the simplest and most elegant method of estimating a multinomial distribution is a generalization of an approach originally taken by Laplace. In the following discussion, I follow the presentation in [#!friedmans98!#]. For a language containing distinct symbols, a multinomial distribution is specified by a parameter vector , where is the probability of an observation being symbol . Consequently, we have the constraints that and . The task of multinomial estimation is to take a data set and produce a vector that results in a good approximation to the distribution that produced . In this case, a data set consists of independent observations drawn from the distribution to be estimated, which can be summarised by the statistics specifying the number of times the th symbol occurs in the data. also specifies the set of symbols that occur at least once in the training data.
Stated in this way, the task of multinomial estimation can be framed as
one of predicting the next observation based on the data. Specifically,
we wish to calculate . The Bayesian estimate for this
probability is given by
The above approach to estimating the parameters of a multinomial
distribution was first exploited by Laplace [#!laplace95!#], who took a
over to give the famous ``law of succession'':
. A more general approach is to
assume a Dirichlet prior over , which is conjugate to the
multinomial distribution and gives
The simple Bayesian method outlined above is appropriate for the general task of multinomial estimation, but generally provides poor results when used for smoothing of sparse multinomial distributions. This is primarily a consequence of the erroneous assumption that all categories should be considered as possible values for . In actuality, sparse multinomial distributions are characterized by the fact that only a few symbols actually occur. In such cases, applying the above method will give too much probability to symbols that never occur and consequently give a poor estimate of the true distribution.
In order to extend the Bayesian approach to sparse multinomial
distributions, several authors have used the notion of maintaining
uncertainty over the vocabulary from which observations are produced as
well as their probabilities. In [#!ristad95!#], Ristad uses assumptions
about the probability of strings based upon different vocabularies to give
A more explicitly Bayesian approach is taken by Friedman and Singer in
[#!friedmans98!#], who also point out that Ristad's method can be
considered a special case of the framework that they present. Friedman and
Singer's approach considers the vocabulary
to be a
random variable, allowing them to write
Friedman and Singer's approach assumes a prior that gives equal probability to all vocabularies of a given cardinality. This assumption aids in obtaining an efficient specification for . In many real-world tasks, we have at least some knowledge about the structure of the task that we might like to build into our methods for parameter estimation. One example of such knowledge might be the expectation that the symbols used by a sparse multinomial distribution will come from one of a few restricted vocabularies which we can pre-specify. For example, in predicting the next character in a file, our predictions might be facilitated by considering the fact that most files either use a vocabulary consisting of just the ASCII printing characters (such as text files), or all possible characters (such as object files). In such a case, giving equal prior probability to all vocabularies of a given cardinality may have negative consequences.
In the case where our knowledge about a domain leads us to specify some
known set of vocabularies
, we can write
The intuition behind this approach is that it adds to Friedman and Singer's method a second process that attempts to classify the target distribution as one of a number of known distributions, and uses the posterior probability of these distributions for full Bayesian smoothing. However, rather than using the potentially fragile approach of classifying the data based upon the distribution over symbols, the model attempts to classify the data in terms of a set of known vocabularies. Applying standard Bayesian multinomial estimation within each of these vocabularies gives sufficient flexibility for the method to capture a range of distributions, while allowing prior knowledge to play an important role in informing the results.
Text compressiom provides an effective test of methods for multinomial estimation. One approach to adaptive coding involves specifying a method for calculating a distribution over the probability of the next byte in a file based upon the preceding bytes [#!clearywb90!#]. The extent to which the file can be compressed will depend upon the quality of the resulting predictions. This method was explicitly used by Ristad to assess his approach to multinomial estimation, and implicitly used by Friedman and Singer.
To illustrate the utility of including prior knowledge in multinomial estimation, I will follow Ristad in examining the performance of these various methods on the Calgary text compression corpus [#!clearywb90!#]. This corpus consists of 19 files of several different types, each containing some subset of the 256 possible characters in some order ( ). The files include BibTeX source (bib), formatted English text (book1, book2, paper1, paper2, paper3, paper4, paper5, paper6), geological data (geo), newsgroup postings ( news), a bit-mapped monochrome picture (pic), programs in three different languages (progc, progl, progp) and a terminal transcript (trans). The task was to repeatedly estimate the multinomial distribution from which characters in the file were drawn based upon the first characters, and use this distribution to predict the st character. Performance was measured in terms of the length of the resulting file, where the contribution the st character makes to the length is given by . The final file length is thus the accumulation of a single term in the cross entropy of the distribution a method produces for predicting each successive character, and provides a good measure of the quality of the estimator for a range of values of .
Text compression is a domain in which files are likely to fall into one of a small number of categories, so giving extra weight to specific vocabularies can be of great utility. For the predictions of the extended Bayesian approach outlined above, the ``known'' vocabularies corresponded to the characters that occurred in a separate set of files each containing between 0.5 and 2 megabytes of BibTeX source, English text, C code, LISP code, and newsgroup postings. The resulting vocabularies identified 100, 92, 100, 157, and 102 specific characters as belonging to documents of a particular type. Finally, a vocabulary was added to cover files that use all 256 characters. Together, these six vocabularies specified . was set to , as was done by Friedman and Singer in their experiments with characters. The text compression results are shown in Table 1.
It is instructive to examine the kinds of files on which the Bayesian method outlined above () outperformed the other methods for multinomial estimation. These files were bib, book1, news, paper1, paper2, paper3, paper4, paper5, paper6, progc, progl, and progp. In these cases, only outperformed . These files all use a restricted vocabulary of characters corresponding to those used in English text, together with a small number of formatting characters. The high performance in these cases was a result of the fact that three of the vocabularies in contain such characters - the Bibtex source, C code, and newsgroup postings. The vocabulary corresponding to plain English text was too restrictive for most of these files, and was hardly used, while the vocabulary for Lisp programs was too large to be of utility. Without this additional structure, Friedman and Singer's method ( ) tended to perform worse than Ristad's simpler method (), although better than the basic Dirichlet smoothing methods ( and ) when the files used restricted vocabularies.
The results for book1 illustrate an important weakness of the approach outlined above. Here the file lengths for and are higher than those for and , despite the fact that the file in question features a restricted English-based vocabulary. The reason for this is that the file also contains two unusual characters that were not encountered in the data used to construct any of the specific hypotheses. Upon encountering these characters defaulted to the only that contained those characters: the unrestricted vocabulary of all 256 characters. From that point, the distribution for corresponded to Equation with , and the resulting smoothing was worse than for and .
The kind of behaviour that was demonstrated in the compression of book1 is undesirable - we don't want to reject one of our candidate
vocabularies on the basis of one or two symbols that are inconsistent with
that vocabulary. We can improve the robustness of our model by adding in a
noise process such that, regardless of vocabulary, any character in
can occur with some small probability. A direct mixture of
with a uniform noise process results in an intractable sum
over all of the ways of assigning observations to the smoothed
vocabulary and the noise process, but we can obtain a simple closed form
solution if we assume that the noise process and vocabulary are
mutually exclusive. Assuming that any symbol in occurs with
probability , we have
The definition of will determine the upper bound on the probability mass assigned to the noise distribution. Specifically, this bound will be given by . This gives us a simple heuristic for setting : if the probability mass that we want to assign to noise is , then we take .
While text compression serves as an illustrative domain for the comparison of different parameter estimation techniques, there are a number of other contexts in which we might wish to estimate multinomial distributions about which we have good prior knowledge. One such context is statistical natural language processing, in which accurate multinomial estimation is the topic of much research [#!cheng98!#]. Typically, such multinomial distributions are over large vocabularies of words. Here, the notion of smoothing a multinomial estimate based upon classification of the vocabulary involved has a direct connection to the ideas driving the text classification literature (eg. [#!yang99!#]): in different contexts, words will occur with different probabilities. In particular, different vocabularies will be used, and having a good set of candidate vocabularies may facilitate smoothing. If it is possible to classify a document as using a particular vocabulary, then we can perform smooth the results we obtain appropriately.
A dataset containing a total of approximately 20,000 articles drawn from
20 different Usenet newsgroups was used to examine this idea. This dataset
was first used for text classification in [#!lang95!#], and has since
been a benchmark for text classification algorithms. Ten of these
newsgroups (rec.autos, rec.sport.baseball, sci.crypt,
sci.med, talk.politics.misc, talk.religion.misc, misc.forsale, comp.sys.mac.hardware, comp.os.ms-windows.misc,
comp.graphics) were used to estimate a set of vocabularies
. These vocabularies were then applied in forming multinomial
estimates for further data drawn from these ten newsgroups and ten others
(alt.atheism, sci.space, rec.motorcycles, talk.politics.guns, comp.sys.ibm.pc.hardware, rec.sport.hockey,
talk.politics.mideast, sci.electronics, comp.windows.x).
The actual dataset used was 20news-18827, which consists of the 20 newsgroup data with headers and duplicates removed. The dataset was preprocessed to remove all punctuation and capitalization, as well as converting every number to a single symbol. The articles in each of the 20 newsgroups were then divided into three sets. The first set contained the first 500 articles, and this was used to build the candidate vocabularies for the ten newsgroups described above. The second set contained articles 501-700, and was used as training data for multinomial estimation for all 20 newsgroups. The third set contained articles 701-900, and was used as testing data for all 20 newsgroups. A dictionary was built up by running over the 13,000 articles resulting from this division, and all words that occurred only once in this entire reduced dataset were mapped to an ``unknown'' word. The resulting dictionary contained words.
For the generalized Bayesian model discussed above, featured one vocabulary that contained all words in the dictionary, and 10 vocabularies estimated from each of the 10 newsgroups mentioned above. These 10 vocabularies were produced by thresholding word frequency in the 500 articles considered, with the threshold ranging from 1 to 10 instances. Each newsgroup thus provided a hierarchy of vocabularies representing a range of degrees of specificity.
Five methods of multinomial estimation were considered. Since the candidate vocabularies are simultaneously too general and too specific to give high probability to any set of observations, tends to make no contribution to . For this reason I evaluated and separately. I also considered Ristad's law (), Laplace's law () and the Jeffreys-Perks law (). For , the maximum probability mass assigned to noise was , while both and used , to facilitate comparison with the Jeffreys-Perks law.
Testing for each newsgroup consisted of taking words from the 200 articles assigned for training purposes, estimating a distribution using each method, and then computing the cross-entropy between that distribution and an empirical estimate of the true distribution. The cross , where is the true distribution and is the distribution produced by the estimation method. The estimate of corresponded to the maximum likelihood estimate formed from the word frequencies in all 200 articles assigned for testing purposes. The testing procedure was conducted with just 100 words, and then in increments of 450 words until 10000 words had been seen in total. The results are shown in Figure 1.
As can be seen in Figure 1, consistently outperforms the other methods, even on newsgroups that did not contribute to . Performance was worst for sci.space, rec.sport.hockey, and talk.politics.mideast. These newsgroups are those that showed the least correspondence to those constituting . Figure 2 shows the vocabularies from that had highest posterior probability at each point in the training process. sci.space moves between comp.graphics and talk.politics.misc, although neither of these seem to be appropriate. rec.sport.hockey defaults to the vocabulary containing all words once the number of unknown words is easier to account for by this vocabulary than by a noise process combined with the most general vocabulary in rec.sport.baseball.
This defaulting behavior is an important aspect of : at the point where the data are best accounted for by smoothing on the whole dictionary, the model will use an unrestricted vocabulary. The resulting multinomial estimates will be the same as applying Equation with the appropriate setting of . In the present experiment, the parameter was set so that the default estimate would correspond to . This allows a direct evaluation of how much is being gained by considering restricted vocabularies.
The importance of defaulting is illustrated in Figure 3. After 10,000 words, is beginning to perform worse than on sci.space. However, if we continue to examine the predictions made by as more words are added, we see that the model swiftly defaults to using as soon as the penalty for the unknown words exceeds the gains of the restricted vocabulary. This is valuable, since at this point sufficient data have accumulated that gives a good estimate of the target distribution. The intelligent choice of is important to these results. Experiments conducted with 100 vocabularies generated at random showed very rapid defaulting - the random vocabularies tended to be strongly inconsistent with the actual data.
The experiment presented above involved estimating only a single multinomial distribution over words. For tasks like the estimation of transition probabilities, it becomes necessary to maintain multiple such estimates. In these cases, the memory demands of an estimation method become an important concern. Implementing requires more memory than doing simple Bayesian smoothing, however the amount of memory required scales lineary with . Since standard Bayesian smoothing is equivalent to the case where , the resulting cost is not too extreme in most situations.
Efficient implementation of requires storing a list of the words that belong in each of the vocabularies, and a vector of the posterior probabilities of each . can then be evaluated for any given word by taking a weighted average of the probabilities assigned to that word by applying standard Bayesian smoothing (Lidstone's law) within that vocabulary. The time required to compute a probability will thus also increase linearly with , but when the number of candidate vocabularies is small the algorithm will remain efficient.
In this paper, I have presented a novel approach to Bayesian smoothing of sparse multinomial distributions. This approach follows the idea of maintaining uncertainty over restricted vocabularies by allowing the vocabularies themselves to be specified on the basis of domain-specific knowledge. I have argued that this approach has its most valuable applications in statistical natural language processing, where data is sparse but domain knowledge is extensive. The main utility of this approach is that if a set of basis vocabularies that span a wide range of contexts can be found, it may be possible to achieve rapid and accurate smoothing of multinomial distributions over words by classifying documents according to their vocabularies.