Stanford POS tagger FAQ

Questions

  1. How can I unpack the gzipped tar file?
  2. Why do I get Exception in thread "main" java.lang.NoClassDefFoundError:edu/stanford/nlp/tagger/maxent/MaxentTagger?
  3. How can I lemmatize (reduce to a base, dictionary form) words that have been tagged with the POS tagger?
  4. Why am I running out of memory tagging a lot of data (using the 2008-10-26 version)?

Questions with answers

  1. How can I unpack the gzipped tar file?

    On Unix/Linux (or command-line Mac OS X), use GNU tar, if you're not already. (If you're using Linux, you're almost certainly using GNU tar.) For some reason we don't understand, it doesn't seem to unpack with classic Unix tar. Make sure you specify the -z option if you are not gunzipping it in advance: tar -xzf stanford-ner-2008-05-07.tar.gz.

    On Windows, it unpacks fine with most common tools, such as WinZip or 7-Zip. The latter is open source. (As of Sep 2007, WinRAR doesn't work: it apparently does not handle tar files correctly.)

    On the Mac, just double-click it to unpack. The default unarchiver (BOMArchiveHelper) works fine.

    If it won't unpack, you normally have either a corrupted download (try downloading it again) or there is some configuration error on your system, which we can't help with.

  2. Why do I get Exception in thread "main" java.lang.NoClassDefFoundError:edu/stanford/nlp/tagger/maxent/MaxentTagger?

    This means your Java CLASSPATH isn't set correctly, so the tagger (in stanford-tagger.jar) isn't being found. See the examples in the README.txt file for how to set the classpath with the -cp or -classpath option.

    See, e.g.,

    http://en.wikipedia.org/wiki/Classpath_(Java)
    for general discussion of the Java classpath.
  3. How can I lemmatize (reduce to a base, dictionary form) words that have been tagged with the POS tagger?

    For English (only), you can do this using the included Morphology class. However, unlike for the Stanford parser, there is at present no support for doing this automatically using options of the command-line version of the tagger. You'd have to do it using code you write.

  4. Why am I running out of memory tagging a lot of data (using the 2008-10-26 version)?

    You're probably using the tagString() method. Unfortunately, it does use increasing memory in this version. That method may well not be what you want anyway. It assumes that the input is correctly tokenized according to the conventions of the tagger training corpus. For the English models we use derived from the Penn Treebank, this means things like separating off contractions of "be" and "n't", rendering parentheses as -LRB-, -RRB-, etc. If you don't do this correctly, then accuracy will suffer.

    (For no very good reason) in the 2008-09-28 distribution, the tagSentence method is set up to do tagging by using a beam search, whereas the main method of MaxentTagger and the tagSentence(Sentence) method called in TaggerDemo.java call a different Viterbi search routine to do the part-of-speech tagging. There seem to be problems with the former, and so you should use tagSentence(). If you have Strings which you are happy with the tokenization of, you can convert to using tagSentence() easily: rather than calling tagString(String) you could use the line:

    String taggedLine = MaxentTagger.tagSentence(Sentence.toSentence(line.split("\\s+"\))).toString(false);

You can discuss other topics with Stanford NER developers and users by joining the java-nlp-user mailing list (via a webpage). Or you can send other questions and feedback to java-nlp-support@lists.stanford.edu.