Foundations of Statistical Natural Language Processing
Christopher D. Manning and Hinrich Schütze
Chapter 6: Statistical estimation: n-gram models over sparse data
Links referred to in the text
- CMU-Cambridge
Statistical Language Modeling toolkit and its
documentation.
The best freely available tool for building language models.
-
The Austen text files which were used to build sample language models were
obtained from Project Gutenberg (perhaps try
the Sailor's Project
Gutenberg site mirror).
-
To remove punctuation from the text files, we used the following Unix
sed
script: sed.strip
. It
specifies a number of global substitutions in terms of very simple
regular expressions. If sed
is not available, it would be
very easy to write the same thing in Perl
, or one could
just do the substitutions in a text editor.
- But here are the resulting 'clean' text files that we used:
training data (a concatenation of various
novels), and test data (cleaned up
Persuasion).
- The Good-Turing estimates for Austen in Table 6.8 were calculated
using Gale and
Sampson's (1995) Simple Good Turing technique using Sampson's C program
SGT.c
, available from
his website. The frequency of
frequency data
that was used as input is available in this
file. (To do exercise 6.6, what you might want to do is use
a language modelling toolkit to generate raw n-grams, a Perl program to
do counts over those n-grams, and then to feed those into
SGT.c
for Good-Turing estimation.)
- This file gives examples of some of the commands we used in
calculations in the chapter, using standard Unix commands, and programs
from the CMU-Cambridge Statistical Language Modeling toolkit:
recipes.txt
.
-
Gertjan
van Noord's table of language identification systems available on the
WWW
Teaching materials
Other links of interest
- The SRI Language
Modeling toolkit by Andreas Stolcke is another good system for
building language models, freely available for research purposes.
Christopher Manning and Hinrich Schütze --