next up previous contents index
Next: Feature selection Up: Properties of Naive Bayes Previous: Properties of Naive Bayes   Contents   Index


A variant of the multinomial model

An alternative formalization of the represents each document $d$ as an $M$-dimensional vector of counts $\langle \termf_{t_1,d},\ldots,\termf_{t_M,d} \rangle$ where $\termf_{t_i,d}$ is the term frequency of $t_i$ in $d$. $P(d\vert\tcjclass)$ is then computed as follows: (cf. Equation 99, page 12.2.1 ).
\begin{displaymath}
P(d\vert\tcjclass) = P(\langle
\termf_{t_1,d},\ldots,\termf_...
...ss) = \prod_{1 \leq i \leq M} P(X=t_i\vert c)^{\termf_{t_i,d}}
\end{displaymath} (129)

Note that we have omitted the multinomial factor. See Equation 99 (page 99 ).

Equation 129 is equivalent to the sequence model in Equation 113 as $P(X=t_i\vert c)^{\termf_{t_i,d}}=1$ for terms that do not occur in $d$ ( $\termf_{t_i,d}=0$) and a term that occurs $\termf_{t_i,d} \geq 1$ times will contribute $\termf_{t_i,d}$ factors both in Equation 113 and in Equation 129.


Table 13.5: A set of documents for which the Naive Bayes independence assumptions are problematic.
(1) He moved from London, Ontario, to London, England.
(2) He moved from London, England, to London, Ontario.
(3) He moved from England to London, Ontario.


Exercises.


next up previous contents index
Next: Feature selection Up: Properties of Naive Bayes Previous: Properties of Naive Bayes   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2008-06-01