What do we mean by a document model generating a query? A traditional
*generative model* of a language, of the kind familiar from formal
language theory, can be used either to recognize or to generate strings.
For example, the finite automaton shown in Figure 12.1 can generate
strings that include the examples shown. The full set of strings that
can be generated is called the *language* of the
automaton.^{}

If instead each node has a probability distribution over generating
different terms, we have a language model. The notion of a language model is inherently probabilistic. A *language model*
is a function that puts a
probability measure over strings drawn from some vocabulary. That is, for
a language model over an alphabet :

**Worked example.** To find the
probability of a word sequence, we just multiply the probabilities
which the model gives to each word in the sequence, together with the
probability of continuing or stopping after producing each word. For example,

(91) | |||

(92) | |||

(93) |

As you can see, the probability of a particular string/document, is usually a very small number! Here we stopped after generating

**Worked example.** Suppose, now, that we have two language models and , shown
partially in Figure 12.3 . Each gives a probability estimate to a
sequence of terms, as already illustrated in m1probability.
The language model that
gives the higher probability to the sequence of terms is more likely to
have generated the term sequence. This time, we will omit
STOP probabilities from our calculations.
For the sequence shown, we get:

and we see that
.
We present the formulas here in terms of products of probabilities,
but, as is common in probabilistic applications, in practice it is
usually best to work with sums of log probabilities (cf. page 13.2 ).
**End worked example.**

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07