next up previous contents index
Next: Types of language models Up: Language models Previous: Language models   Contents   Index

Finite automata and language models

\begin{figure}
% latex2html id marker 15157
\begin{pspicture}(0,0)(5,3)
\rput(1,...
...omaton and a double circle indicates a (possible) finishing
state.}\end{figure}

What do we mean by a document model generating a query? A traditional generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings. For example, the finite automaton shown in Figure 12.1 can generate strings that include the examples shown. The full set of strings that can be generated is called the language of the automaton.[*]

\begin{figure}
% latex2html id marker 15187
\begin{tabular}[b]{c}
\begin{pspictu...
... show a partial specification of the state emission
probabilities.}\end{figure}

If instead each node has a probability distribution over generating different terms, we have a language model. The notion of a language model is inherently probabilistic. A language model is a function that puts a probability measure over strings drawn from some vocabulary. That is, for a language model $M$ over an alphabet $\Sigma$:

\begin{displaymath}
\sum_{s \in \Sigma^{*}} P(s) = 1
\end{displaymath} (90)

One simple kind of language model is equivalent to a probabilistic finite automaton consisting of just a single node with a single probability distribution over producing different terms, so that $\sum_{t \in V} P(t) = 1$, as shown in Figure 12.2 . After generating each word, we decide whether to stop or to loop around and then produce another word, and so the model also requires a probability of stopping in the finishing state. Such a model places a probability distribution over any sequence of words. By construction, it also provides a model for generating text according to its distribution.

Worked example. To find the probability of a word sequence, we just multiply the probabilities which the model gives to each word in the sequence, together with the probability of continuing or stopping after producing each word. For example,

$\displaystyle P(\mbox{frog said that toad likes frog})$ $\textstyle =$ $\displaystyle (0.01 \times 0.03 \times 0.04 \times 0.01 \times 0.02 \times 0.01)$ (91)
    $\displaystyle \times (0.8 \times 0.8 \times 0.8 \times 0.8 \times 0.8 \times 0.8 \times 0.2)$ (92)
  $\textstyle \approx$ $\displaystyle 0.000000000001573$ (93)

As you can see, the probability of a particular string/document, is usually a very small number! Here we stopped after generating frog the second time. The first line of numbers are the term emission probabilities, and the second line gives the probability of continuing or stopping after generating each word. An explicit stop probability is needed for a finite automaton to be a well-formed language model according to Equation 90. Nevertheless, most of the time, we will omit to include STOP and $(1-
\mbox{\textsc{stop}})$ probabilities (as do most other authors). To compare two models for a data set, we can calculate their likelihood ratio , which results from simply dividing the probability of the data according to one model by the probability of the data according to the other model. Providing that the stop probability is fixed, its inclusion will not alter the likelihood ratio that results from comparing the likelihood of two language models generating a string. Hence, it will not alter the ranking of documents.[*] Nevertheless, formally, the numbers will no longer truly be probabilities, but only proportional to probabilities. See Exercise 12.1.3 . End worked example.

Figure 12.3: Partial specification of two unigram language models.
\begin{figure}\begin{tabular}{\vert ll\vert ll\vert}
\hline
\multicolumn{2}{\ver...
...2 \\
\ldots & \ldots & \ldots & \ldots \\ \hline
\end{tabular}\par
\end{figure}

Worked example. Suppose, now, that we have two language models $M_1$ and $M_2$, shown partially in Figure 12.3 . Each gives a probability estimate to a sequence of terms, as already illustrated in m1probability. The language model that gives the higher probability to the sequence of terms is more likely to have generated the term sequence. This time, we will omit STOP probabilities from our calculations. For the sequence shown, we get:
\begin{example}
\begin{tabular}[t]{lllllllll}
$s$\ & frog & said & that & toad &...
...column{5}{l}{$P(s\vert M_2) = 0.000000000000000384 $}
\end{tabular}\end{example}
and we see that $P(s\vert M_1)> P(s\vert M_2)$. We present the formulas here in terms of products of probabilities, but, as is common in probabilistic applications, in practice it is usually best to work with sums of log probabilities (cf. page 13.2 ). End worked example.


next up previous contents index
Next: Types of language models Up: Language models Previous: Language models   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07