|
|
About | Questions | Mailing lists | Download | Release history | Tagging Tutorial | FAQ
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in:
Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, and Michel Galley have improved its speed, performance, usability, and support for other languages.
The system requires Java 1.5+ to be installed. About 60 MB of memory is
requred to run a trained tagger (i.e., you may need to give java an
option like java -mx60m). Plenty of memory is needed
to train a tagger. It depends on the complexity of the model but at
least 1GB is recommended.
Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language based on POS-annotated training text. Each of the trained taggers can also be downloaded individually.
Part-of-speech name abbreviations: The two English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill's list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.
For documentation, first take a look at the included README.txt and the javadocs,
particularly the javadoc for MaxentTagger. If you have more questions,
Matthew Jockers has kindly produced
an example and tutorial for running the tagger.
There is a brief FAQ.
Additional questions, feedback, and bugs/bug-fixes) can be sent to our mailing lists.
The tagger is licensed under the GNU GPL. (Note that this is the full GPL, which allows its use for research purposes, free software projects, etc., but does not allow its incorporation into any type of distributed proprietary software, even in part or in translation. Source is included. The package includes components for command-line invocation and a Java API. Commercial licensing of the POS tagger is also available.
We have 3 mailing lists for the Stanford POS Tagger, all of which are shared
with other JavaNLP tools (with the exclusion of the parser). Each address is
at @lists.stanford.edu:
java-nlp-user This is the best list to post to in order
to ask questions, make announcements, or for discussion among JavaNLP
users. You have to subscribe to be able to use it.
Join the list via this webpage or by emailing
java-nlp-user-join@lists.stanford.edu. (Leave the
subject and message body empty.) You can also
look at
the list archives.
java-nlp-announce This list will be used only to announce
new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
message a year). Join the list via via this webpage or by emailing
java-nlp-announce-join@lists.stanford.edu. (Leave the
subject and message body empty.)
java-nlp-support This list goes only to the software
maintainers. It's a good address for licensing questions, etc. For
general use and support questions, please join and use
java-nlp-user.
You cannot join java-nlp-support, but you can mail questions to
java-nlp-support@lists.stanford.edu.
Download basic English Stanford Tagger version 1.6
Download full Stanford Tagger version 1.6
Download Stanford Arabic tagger data (compatible with Stanford Tagger v1.6)
Download Stanford Chinese tagger data (compatible with Stanford Tagger v1.6)
Download Stanford German tagger data (compatible with Stanford Tagger v1.6)
The basic download is a 22 MB gzipped tar file (mainly consisting of included model files); the full download is a 42 MB gzipped tar file. If you unpack the tar file, you should have everything needed. This software provides a GUI demo, a command-line interface and an API. A simple script is included to invoke the tagger on a Unix system. For another system, an appropriate java command can be given, as described in the included README.txt. The only difference between the basic and full downloads is that the basic version contains only English trained models for a smaller download. The individual tagger downloads contain the taggers not included in the basic version but included in the full version; all require the current version of the Stanford Tagger (v1.6) to run. Biomedical taggers for English biomedical text will be available shortly.
| Version | Date | Description |
|---|---|---|
| 1.6 | 2008-09-28 | A fraction better, a fraction faster, more flexible model specification, and quite a few less bugs. |
| 1.5.1 | 2008-06-06 | Tagger properties are now saved with the tagger, making taggers more portable; tagger can be trained off of treebank data or tagged text; fixes classpath bugs in 2 June 2008 patch; new foreign language taggers released on 7 July 2008 and packaged with 1.5.1. |
| 1.5 | 2008-05-19 | Added international taggers, support for reading from and writing to XML, better support for changing the encoding, distributional similarity options, and many more small changes; patched on 2 June 2008 to fix a bug with tagging pre-tokenized text. |
| 1.0 | 2006-01-10 | First cleaned-up release after Kristina graduated. |
| 0.1 | 2004-08-16 | First release. |
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · ScalaNLP · machines · Wiki |
Site design by Bill MacCartney |