Foundations of Statistical Natural Language Processing
Christopher D. Manning and Hinrich Schütze
Chapter 4: Corpus-Based Work
Links referred to in the text
Sources of corpora
The Linguistic Data Consortium (LDC)
European Language Resources Association (ELRA)
International Computer Archive of Modern English (ICAME)
Oxford Text Archive (OTA)
Child Language Data Exchange System (CHILDES)
Manning's StatNLP page
has general resources for online text
Info on Corpus Availability (SIL)
Particular corpora referred to
The
Brown corpus
is available from ICAME (a tagged version of the Brown corpus is also available from the LDC as part of the Penn Treebank)
The Penn Treebank
is available from the LDC (
LDC Catalog entry
)
British National Corpus (BNC)
or
BNC online
ICE-GB Corpus
-- the British component of the International Corpus of English.
Markup schemes
SGML/XML Web Page, Summer Institute of Linguistics
Text Encoding Initiative, Home Page
Text Encoding Initiative, SIL page
The Corpus Encoding Standard
.
AMALGAM
Tokenizing rules, sentence divider
CMU-Cambridge Statistical Language Modeling toolkit
Other links
The Consortium for Lexical Research (NMSU)
CorpusToolbox
alembic workbench
german morphological analyzer
DFKI Natural Language Software Registry
multitext east home page
Department of Artificial Intelligence, Edinburgh
Dan Melamed's NLP software links
List of Resources at Cambridge
Joakim Nivre's StatNLP resources
Negra: German Newspaper Corpus
Kita lab list of NLP tools
, part of the
Kita lab speech and language resources
Christopher Manning and Hinrich Schütze --