Foundations of Statistical Natural Language Processing

Christopher D. Manning and Hinrich Schütze

Chapter 6: Statistical estimation: n-gram models over sparse data

Links referred to in the text

CMU-Cambridge Statistical Language Modeling toolkit and its documentation. The best freely available tool for building language models.
The Austen text files which were used to build sample language models were obtained from Project Gutenberg (perhaps try the Sailor's Project Gutenberg site mirror).
To remove punctuation from the text files, we used the following Unix sed script: sed.strip. It specifies a number of global substitutions in terms of very simple regular expressions. If sed is not available, it would be very easy to write the same thing in Perl, or one could just do the substitutions in a text editor.
But here are the resulting 'clean' text files that we used: training data (a concatenation of various novels), and test data (cleaned up Persuasion).
The Good-Turing estimates for Austen in Table 6.8 were calculated using Gale and Sampson's (1995) Simple Good Turing technique using Sampson's C program SGT.c, available from his website. The frequency of frequency data that was used as input is available in this file. (To do exercise 6.6, what you might want to do is use a language modelling toolkit to generate raw n-grams, a Perl program to do counts over those n-grams, and then to feed those into SGT.c for Good-Turing estimation.)
This file gives examples of some of the commands we used in calculations in the chapter, using standard Unix commands, and programs from the CMU-Cambridge Statistical Language Modeling toolkit: recipes.txt.
Gertjan van Noord's table of language identification systems available on the WWW

Teaching materials

Powerpoint slides by Jonathen Henke (UC Berkeley) [reproduced with permission of the author].

Foundations of Statistical Natural Language Processing

Christopher D. Manning and Hinrich Schütze

Chapter 6: Statistical estimation: n-gram models over sparse data

Links referred to in the text

Teaching materials

Other links of interest