Foundations of Statistical Natural Language Processing

Christopher D. Manning and Hinrich Schütze

Chapter 4: Corpus-Based Work

Links referred to in the text

Sources of corpora
Particular corpora referred to
- The Brown corpus is available from ICAME (a tagged version of the Brown corpus is also available from the LDC as part of the Penn Treebank)
- The Penn Treebank is available from the LDC (LDC Catalog entry)
- British National Corpus (BNC) or BNC online
- ICE-GB Corpus -- the British component of the International Corpus of English.
Markup schemes
AMALGAM
Tokenizing rules, sentence divider
CMU-Cambridge Statistical Language Modeling toolkit

Other links

Christopher Manning and Hinrich Schütze --