About | Questions | Mailing lists | Download | Extensions | Release history | FAQ
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. This software is a Java implementation of the log-linear part-of-speech taggers described in these papers (if citing just one paper, cite the 2003 one):
Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages.
The system requires Java 8+ to be installed. Depending on whether
you're running 32 or 64 bit Java and the complexity of the tagger model,
you'll need somewhere between 60 and 200 MB of memory to run a trained
tagger (i.e., you may need to give Java an
option like java -mx200m
). Plenty of memory is needed
to train a tagger. It again depends on the complexity of the model but at
least 1GB is usually needed, often more.
Current downloads contain three trained tagger models for English, two each for Chinese and Arabic, and one each for French, German, and Spanish. The tagger can be retrained on any language, given POS-annotated training text for the language.
Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, Chameleon Metadata list (which includes recent additions to the set). The French, German, and Spanish models all use the UD (v2) tagset. See the included README-Models.txt in the models directory for more information about the tagset for each language.
The tagger code is dual licensed (in a similar manner to MySQL, etc.). The tagger is licensed under the GNU General Public License (v2 or later), which allows many free uses. Source is included. The package includes components for command-line invocation, running as a server, and a Java API. For distributors of proprietary software, commercial licensing is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding.
For documentation, first take a look at the included
README.txt
.
Galal Aly wrote a tagging tutorial focused on usage in Java with Eclipse.
For more details, look at our included javadocs, particularly the javadoc for MaxentTagger.
There is a FAQ.
Matthew Jockers kindly produced an example and tutorial for running the tagger. This particularly concentrates on command-line usage with XML and (Mac OS X) xGrid.
Have a support question? Ask us on Stack Overflow using the tag stanford-nlp.
Feedback and bug reports / fixes can be sent to our mailing lists.
Tag text from a file text.txt, producing tab-separated-column output:
java -cp "*" edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-left3words-distsim.tagger -textFile text.txt -outputFormat tsv -outputFile text.tag
We have 3 mailing lists for the Stanford POS Tagger,
all of which are shared
with other JavaNLP tools (with the exclusion of the parser). Each address is
at @lists.stanford.edu
:
java-nlp-user
This is the best list to post to in order
to send feature requests, make announcements, or for discussion among JavaNLP
users. (Please ask support questions on
Stack Overflow using the
stanford-nlp tag.)
You have to subscribe to be able to use this list.
Join the list via this webpage or by emailing
java-nlp-user-join@lists.stanford.edu
. (Leave the
subject and message body empty.) You can also
look at
the list archives.
java-nlp-announce
This list will be used only to announce
new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
messages a year). Join the list via this webpage or by emailing
java-nlp-announce-join@lists.stanford.edu
. (Leave the
subject and message body empty.)
java-nlp-support
This list goes only to the software
maintainers. It's a good address for licensing questions, etc. For
general use and support questions, you're better off joining and using
java-nlp-user
.
You cannot join java-nlp-support
, but you can mail questions to
java-nlp-support@lists.stanford.edu
.
The full download is a 75 MB zipped file including models for English, Arabic, Chinese, French, Spanish, and German. If you unpack the tar file, you should have everything needed. This software provides a GUI demo, a command-line interface, and an API. Simple scripts are included to invoke the tagger. For more information on use, see the included README.txt.
Version | Date | Description |
---|---|---|
4.2.0 | 2020-11-17 | Add currency data for English models
Full |
4.1.0 | 2020-08-06 | Missing tagger extractor class added, Spanish tokenization improvements
Full |
4.0.0 | 2020-04-19 | Model tokenization updated to UDv2.0
Full |
3.9.2 | 2018-10-16 | New English models, better currency symbol handling
English / Full |
3.9.1 | 2018-02-27 | new French UD model
English / Full |
3.8.0 | 2017-06-09 | new Spanish and French UD models
English / Full |
3.7.0 | 2016-10-31 | Update for compatibility, German UD model
English / Full |
3.6.0 | 2015-12-09 | Updated for compatibility
English / Full |
3.5.2 | 2015-04-20 | Updated for compatibility
English / Full |
3.5.1 | 2015-01-29 | General bugfixes
English / Full |
3.5.0 | 2014-10-26 | Upgrade to Java 8
English / Full |
3.4.1 | 2014-08-27 | Add Spanish model
English / Full |
3.4 | 2014-06-16 | French model uses CC tagset
English / Full |
3.3.1 | 2014-01-04 | Bugfix release
English / Full |
3.3.0 | 2013-11-12 | imperatives included in English model
English / Full |
3.2.0 | 2013-06-20 | improved speed & size of all models
English / Full |
3.1.5 | 2013-04-04 | ctb7 model, -nthreads option, improved speed
English / Full |
3.1.4 | 2012-11-11 | Improved Chinese model
English / Full |
3.1.3 | 2012-07-09 | Minor bug fixes
English / Full |
3.1.2 | 2012-05-22 | Included some "tech" words in the latest model
English / Full |
3.1.1 | 2012-03-09 | Caseless models added for English
English / Full |
3.1.0 | 2012-01-06 | French tagger added, tagging speed improved
English / Full |
3.0.4 | 2011-09-14 | Compatible with other recent Stanford releases.
English / Full |
3.0.3 | 2011-06-19 | Compatible with other recent Stanford releases.
English / Full |
3.0.2 | 2011-05-15 | Addition of TSV input format.
English / Full |
3.0.1 | 2011-04-20 | Faster Arabic and German models.
Compatible with other recent Stanford releases.
English / Full |
3.0 | 2010-05-21 | Tagger is now re-entrant. New tagger objects are loaded with tagger = new MaxentTagger(path) and then used with tagger.tagMethod...
English / Full |
2.0 | 2009-12-24 | An order of magnitude faster, slightly more accurate best model,
more options for training and deployment.
English / Full |
1.6 | 2008-09-28 | A fraction better, a fraction faster, more flexible model specification,
and quite a few less bugs.
English / Full |
1.5.1 | 2008-06-06 | Tagger properties are now saved with the tagger, making taggers more portable; tagger can be trained off of treebank data or tagged text; fixes classpath bugs in 2 June 2008 patch; new foreign language taggers released on 7 July 2008 and packaged with 1.5.1.
English / Full / Updated models |
1.5 | 2008-05-21 | Added taggers for several languages, support for reading from and writing to XML, better support for
changing the encoding, distributional similarity options, and many more small changes; patched on 2 June 2008 to fix a bug with tagging pre-tokenized text.
English / Full |
1.0 | 2006-01-10 | First cleaned-up release after Kristina graduated.
Old School |
0.1 | 2004-08-16 | First release. |