Naive Bayes text classification

where is the conditional probability of term occurring in a document of class .

In text classification, our goal is to find the *best*
class for the document. The best class in NB classification
is the
most likely or
*maximum a posteriori*
( *MAP* ) class :

In Equation 114,
many conditional probabilities are
multiplied, one for each position
.
This can result in a floating point underflow. It
is therefore better to perform the computation by adding
logarithms of probabilities instead of multiplying
probabilities. The class with the highest log probability
score is still the most probable;
and the logarithm function is monotonic. Hence,
the maximization that is actually done in most
implementations of NB is:

Equation 115 has a simple interpretation. Each conditional parameter is a weight that indicates how good an indicator is for . Similarly, the prior is a weight that indicates the relative frequency of . More frequent classes are more likely to be the correct class than infrequent classes. The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the class, and Equation 115 selects the class for which we have the most evidence.

We will initially work with this intuitive interpretation of the multinomial NB model and defer a formal derivation to Section 13.4 .

How do we estimate the parameters
and
?
We first try
the *maximum likelihood estimate* (MLE; probtheory), which
is simply the relative frequency and
corresponds to the most likely value of each parameter given
the training data. For the priors this estimate is:

where is the number of documents in class and is the total number of documents.

We estimate the conditional probability
as the relative frequency
of term in
documents belonging to class :

The problem with the MLE estimate is that it is zero for a
term-class combination that did not occur in the training
data. If
the term WTO in the training data only
occurred in China documents, then the MLE estimates
for the other classes, for example UK, will be
zero:

(118) |

To eliminate zeros, we use
*add-one*
or *Laplace*
*smoothing*, which simply
adds one to each count (cf. Section 11.3.2 ):

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

docID | words in document | in China? | |||

training set | 1 | Chinese Beijing Chinese | yes | ||

2 | Chinese Chinese Shanghai | yes | |||

3 | Chinese Macao | yes | |||

4 | Tokyo Japan Chinese | no | |||

test set | 5 | Chinese Chinese Chinese Tokyo Japan | ? |

**Worked example.**
For the example in Table 13.1 , the multinomial
parameters we need to classify the test document are the
priors
and
and the
following conditional probabilities:

The denominators are and because the lengths of and are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo.

What is the time complexity of NB? The complexity of computing the parameters is because the set of parameters consists of conditional probabilities and priors. The preprocessing necessary for computing the parameters (extracting the vocabulary, counting terms, etc.) can be done in one pass through the training data. The time complexity of this component is therefore , where is the number of documents and is the average length of a document.

We use as a notation for here, where is the length of the training collection. This is nonstandard; is not defined for an average. We prefer expressing the time complexity in terms of and because these are the primary statistics used to characterize training collections.

The time complexity of
APPLYMULTINOMIALNB in
Figure 13.2 is
.
and are the numbers of
tokens and types, respectively, in the test
document .
APPLYMULTINOMIALNB can be modified to be
(Exercise 13.6 ).
Finally, assuming
that the length of test documents is bounded,
because
for a fixed constant .^{}

Table 13.2 summarizes the time complexities. In general, we have , so both training and testing complexity are linear in the time it takes to scan the data. Because we have to look at the data at least once, NB can be said to have optimal time complexity. Its efficiency is one reason why NB is a popular text classification method.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07