edu.stanford.nlp.tagger.maxent
Class MaxentTagger

java.lang.Object
  extended by edu.stanford.nlp.tagger.maxent.MaxentTagger
All Implemented Interfaces:
SentenceProcessor, Function, ListProcessor, java.io.Serializable

public class MaxentTagger
extends java.lang.Object
implements Function, SentenceProcessor, ListProcessor

A class for end users to part of speech tag text using an already trained and saved maxent tagger. You can tag things through the Java API or from the command line. The two taggers included in this distribution are:

Using the Java API

A MaxentTagger can be made with a constructor taking as argument the location of parameter files for a trained tagger.
MaxentTagger tagger = new MaxentTagger("bidirectional/wsj3t0-18.holder");
Alternatively, a constructor with no arguments can be used, which reads the parameters from a default location (which has to be set in the source file, and is set to a path that works on the Stanford NLP machines).
MaxentTagger tagger = new MaxentTagger();
To tag a string of words and get a string of tagged words.
String taggedString = maxentTagger.tagString("Here's a tagged string.")
To tag a Sentence and get a TaggedSentence
Sentence taggedSentence=maxentTagger.tagSentence(Sentence sentence)
Sentence taggedSentence=maxentTagger.apply(Sentence sentence)
To tag a list of sentences and get back a list of tagged sentences
List taggedList=maxentTagger.process(List sentences)

Here is an example of using the static tagString method. The MaxentTagger can be initialized using a static call init that takes in a trained model, which is loaded immediately (takes a long time...); subsequent attempts to initialize the tagger will be no-ops, therefore it is safe to call init ad infinitum. Otherwise the default trained model is used. Then subsequent calls to tagString can be executed, passing in an untagged String; a tagged String is returned, unless there was a serious problem in the Tagging machinery, in which case null is returned.

Example:
MaxentTagger.init("stanford-tagger/bidirectional/wsj0-18.holder");
String taggedString = MaxentTagger.tagString("Here's a tagged string.");
String taggedString2 = MaxentTagger.tagString("This is your life.");

The output is

Here's/JJ a/DT tagged/VBD string./NNP

and

This/DT is/VBZ your/PRP$ life./NN respectively.

Note that in precisely the above fashion, this tagger just splits on white space when tagging, but the tagger expects input tokenized as in the Penn Treebank. Unless a prior tokenization step is done, the tagger will perform poorly. You can tokenize text programmatically by calling the tokenizeText method.

Tagging from the commandline

Usage:
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelPrefix> -file <testfile> 
You can use the same properties file as for training (c.f. Train) if you pass it in with the "-props" argument. The only important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizeFactory". See the properties file documentation for details.

See Also:
Serialized Form

Field Summary
static java.lang.String DEFAULT_NLP_GROUP_MODEL_PATH
           
 
Constructor Summary
MaxentTagger()
          Non-static version of the tagger for the Function interface.
MaxentTagger(java.lang.String modelFile)
          Constructor from a fileName.
 
Method Summary
 java.lang.Object apply(java.lang.Object in)
          Expects a sentence and returns a tagged sentence
static void init(java.lang.String modelFile)
          Initializer that loads the dictionary.
static void main(java.lang.String[] args)
          Command-line tagger that takes input from stdin or a file.
 java.util.List process(java.util.List sentences)
          Tags the Words in each Sentence in the given List with their grammatical part-of-speech.
 Sentence processSentence(Sentence sentence)
          Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
static Sentence tagSentence(Sentence sentence)
          Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
static java.lang.String tagString(java.lang.String toTag)
          Tags the input string and returns the tagged version.
static java.util.List tokenizeText(java.io.Reader r)
          Reads data from r, tokenizes it with the default (Penn Treebank tokenizes), and returns a List of Sentence objects, which can then be fed into tagSentence.
static java.util.List tokenizeText(java.io.Reader r, TokenizerFactory tokenizerFactory)
          Reads data from r, tokenizes it with the given tokenizes, and returns a List of Sentence objects, which can then be fed into tagSentence.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_NLP_GROUP_MODEL_PATH

public static final java.lang.String DEFAULT_NLP_GROUP_MODEL_PATH
See Also:
Constant Field Values
Constructor Detail

MaxentTagger

public MaxentTagger()
Non-static version of the tagger for the Function interface. A tagger is not initialized in the constructor, but will be initialized the first time it is used from DEFAULT_NLP_GROUP_MODEL_PATH.


MaxentTagger

public MaxentTagger(java.lang.String modelFile)
             throws java.lang.Exception
Constructor from a fileName. The modelFile is both a filename and a prefix that is used from which other filenames are built and their data loaded. The tagger data is loaded when the constructor is called (this can be slow). Since some of the data for the tagger is static, two different taggers can not exist at the same time.

Throws:
java.lang.Exception
Method Detail

init

public static void init(java.lang.String modelFile)
                 throws java.lang.Exception
Initializer that loads the dictionary. This maintains a flag as to whether initialization has been done previously, and if so, running this is a no-op.

Parameters:
modelFile - Filename of the trained model, for example /u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
Throws:
java.lang.Exception

tagString

public static java.lang.String tagString(java.lang.String toTag)
                                  throws java.lang.Exception
Tags the input string and returns the tagged version. The tagger wants input that is space separated tokens, tokenized according to the conventions of the Penn Treebank.

Parameters:
toTag - The untagged input String
Returns:
The same string with tags inserted in the form word/tag
Throws:
java.lang.Exception

apply

public java.lang.Object apply(java.lang.Object in)
Expects a sentence and returns a tagged sentence

Specified by:
apply in interface Function
Parameters:
in - This needs to be a Sentence
Returns:
A Sentence of TaggedWord

process

public java.util.List process(java.util.List sentences)
Tags the Words in each Sentence in the given List with their grammatical part-of-speech. The returned List contains Sentences consisting of TaggedWords.

NOTE: The input document must contain sentences as its elements, not words. To turn a Document of words into a Document of sentences, run it through WordToSentenceProcessor.

Specified by:
process in interface ListProcessor
Parameters:
sentences - A List of Sentence
Returns:
A List of Sentence

processSentence

public Sentence processSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convinience method when you only want to tag a single Sentence instead of a Document of sentences.

Specified by:
processSentence in interface SentenceProcessor
Parameters:
sentence - A sentence. Classes implementing this interface can assume that the sentence passed in is not null.

tagSentence

public static Sentence tagSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convinience method when you only want to tag a single Sentence instead of a Document of sentences.


tokenizeText

public static java.util.List tokenizeText(java.io.Reader r)
Reads data from r, tokenizes it with the default (Penn Treebank tokenizes), and returns a List of Sentence objects, which can then be fed into tagSentence.


tokenizeText

public static java.util.List tokenizeText(java.io.Reader r,
                                          TokenizerFactory tokenizerFactory)
Reads data from r, tokenizes it with the given tokenizes, and returns a List of Sentence objects, which can then be fed into tagSentence.


main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Command-line tagger that takes input from stdin or a file.

Parameters:
args - There can be no arguments, or a file to read a tagger from and a file to tag can be supplied. If the text is pre-tokenized in Penn Treebank format, use the -tokenized option. Usage:
java edu.stanford.nlp.tagger.maxent.MaxentTagger [-model modelName] [-file fileName] [-tokenize]
The fileName or other input is assumed to be one sentence per line unless the option -tokenize is specified.
Throws:
java.lang.Exception