info@lingsoft.fi
. There
is an online demo.
Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.
veronis@univ-aix.fr
. Some stuff
including a multilingual text editor is downloadable.
MULTEXT EAST has parallel versions
of Orwell's 1984 available free (upon registration) for a number
of Central European languages.
silberz@ladl.jussieu.fr
The PennTools page collects information on a variety of NLP systems, many of which are available externally.
ldc@ldc.upenn.edu
. Provides the largest range of
corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey.
CDs can be purchased individually; institutions can become members and
receive discounts on CDs. There's an
LDC Online service for
searches over the web (mainly intended for members, but there are samplers
available).
help
to fileserv@nora.hd.uib.no
, by ftp to
nora.hd.uib.no
.
Also,
manuals for
these corpora.
lexical@nmsu.edu
. Focuses more on language
processing tools and lexicons, but does have some corpora. As of Feb 1996,
you can get most of their stuff by anonymous ftp to clr.nmsu.edu
.
Their catalog is
available as a postscript file.
info@ota.ahds.ac.uk
.
Most materials are available on the web or by anonymous ftp to
ota.ox.ac.uk
.
Some require negotiations with the providers.
English language corpora available from the sites above are not repeated here.
English language corpora available from the sites above are not repeated here.
Name | Language | Size | Availability | Comments |
---|---|---|---|---|
Penn Treebank | US English | 2 million + words | Available (distributed by LDC) | 1 million WSJ, 1 million speech, surface syntax (1970s TG) |
BLLIP WSJ corpus | US English | 30 million words | Available (distributed by LDC) | WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking |
ICE-GB | UK English | 1 million words (83,394 sentences) | Available; c. 500 pounds | British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material. |
Bulgarian Treebank | Bulgarian | n/a | POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement) | An under construction Bulgarian HPSG treebank. |
Penn Chinese Treebank | Chinese | 100,000 words | Available (LDC) | Based on Xinhua news articles. 1980s-style GB syntax. |
The Prague Dependency Treebank 1.0 | Czech | 500,000 words | Free on completion of license agreement (available through LDC). | Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines. |
Danish Dependency Treebank 1.0 | Danish | 100,000 words | Available free under the GPL. | Built on a portion of the Parole corpus. |
Alpino Dependency Treebank | Dutch | 150,000 words | Freely downloadable | Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus. |
NEGRA Corpus | German | 20,000 sentences | Available free of charge to academics on completion of license agreement. | Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures. |
TIGER corpus | German | 700,000 words | Available free of charge for research purposes on completion of license agreement. | German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch. |
Icelandic Parsed Historical Corpus (IcePaHC) | Icelandic | 1,000,000 words | Free download (LGPL) | Texts from 1150 through 2008! |
TUT: Turin University Treebank | Italian | 2,400 sentences | Free download. | Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts. |
Floresta Sintá(c)tica | Portuguese | 168,000 words hand-corrected; 1,000,000 words automatically parsed | Hand corrected part is free web download; automatically parsed part available through email contact | Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format. |
Talbanken05 | Swedish | 300,000 words | Free download | Resurrects and modernizes an early treebank from the 1970s. |
CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.
There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:
The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).
Not exactly a dictionary, but other popular sources are:
See also COMLEX and CELEX available from the LDC.
U.S. names with frequency information, are available from the Census Bureau.
Mailing lists that have information on these topics include:
subscribe corpora
listserv@uib.no
. Or if you want to subscribe with a different
email address, send:
subscribe corpora email-address
empiricists-request@unagi.cis.upenn.edu
.
Home pages with something useful on them.
Still under construction...
http://nlp.stanford.edu/links/statnlp.html