edu.stanford.nlp.tagger.maxent
Class MaxentTagger

java.lang.Object
  extended by edu.stanford.nlp.tagger.maxent.MaxentTagger
All Implemented Interfaces:
SentenceProcessor, ListProcessor<Sentence,Sentence>, Function<Sentence,Sentence>

public class MaxentTagger
extends Object
implements Function<Sentence,Sentence>, SentenceProcessor, ListProcessor<Sentence,Sentence>

A class for end users to part of speech tag text using an already trained and saved maxent tagger. You can tag things through the Java API or from the command line. The two taggers included in this distribution are:

Using the Java API

A MaxentTagger can be made with a constructor taking as argument the location of parameter files for a trained tagger:
MaxentTagger tagger = new MaxentTagger("bidirectional/wsj3t0-18.holder");

Alternatively, a constructor with no arguments can be used, which reads the parameters from a default location (which has to be set in the source file, and is set to a path that works on the Stanford NLP machines):
MaxentTagger tagger = new MaxentTagger();

To tag a string of words and get a string of tagged words:
String taggedString = maxentTagger.tagString("Here's a tagged string.")

To tag a Sentence and get a TaggedSentence:
Sentence taggedSentence=maxentTagger.tagSentence(Sentence sentence)
Sentence taggedSentence=maxentTagger.apply(Sentence sentence)

To tag a list of sentences and get back a list of tagged sentences:
List taggedList=maxentTagger.process(List sentences)

Here is an example of using the static tagString method:
MaxentTagger.init("stanford-tagger/bidirectional/wsj0-18.holder");
String taggedString = MaxentTagger.tagString("Here's a tagged string.");
String taggedString2 = MaxentTagger.tagString("This is your life.");

Output:
Here's/JJ a/DT tagged/VBD string./NNP
This/DT is/VBZ your/PRP$ life./NN

The MaxentTagger is initialized using a static call to init, which takes the path to a trained model as an argument. The trained model is loaded immediately (which may take awhile); subsequent calls to init will have no effect. If no path is passed to init, the default trained model is used. tagString can then be called with an untagged String; a tagged String is returned (serious errors could result in a null return value).

Note that the tagger assumes input has not yet been tokenized and tokenizes it using a default English tokenizer. If your input has already been tokenized, use the flag "-tokenized".

Using the command line

Tagging, testing, and training can all also be done via the command line.

Training from the command line

To train a model from the command line, first generate a property file:
java edu.stanford.nlp.tagger.maxent.MaxentTagger -genprops 
This gets you a default properties file with descriptions of each parameter you can set in your trained model. You can modify the properties file , or use the default options. To train, run: *
java edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props 
with the appropriate properties file specified; any argument you give in the properties file can also be specified on the command line. You must have specified a model using -model, either in the properties file or on the command line, as well as a file containing tagged words using -trainFile.

Tagging and Testing from the command line

Usage: For tagging:
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -textFile <textfile> 
For testing:
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -testFile <testfile> 
You can use the same properties file as for training if you pass it in with the "-props" argument. The most important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizeFactory". See below for more details.

Parameters can be defined using a Properties file (specified on the command-line with -prop propFile), or directly on the command line. The following properties are recognized:

Property NameTypeDefault ValueRelevant Phase(s)Description
modelStringN/AAllPath and filename where you would like to save the model (training) or where the model should be loaded from (testing, tagging).
trainFileStringN/ATrainPath to the file holding the training data; specifying this option puts the tagger in training mode. Only one of 'trainFile','testFile','texFile', and 'convertToSingleFile' may be specified.
testFileStringN/ATestPath to the file holding the test data; specifying this option puts the tagger in testing mode. Only one of 'trainFile','testFile','texFile', and 'convertToSingleFile' may be specified.
trainFileStringN/ATagPath to the file holding the text to tag; specifying this option puts the tagger in tagging mode. Only one of 'trainFile','testFile','textFile', and 'convertToSingleFile' may be specified.
convertToSingleFileStringN/AN/AProvided only for backwards compatibility, this option allows you to convert a tagger trained using a previous version of the tagger to the new single-file format. The path should be for the new file, and 'model' should be the path to the old tagger.
genpropsbooleanN/AN/AUse this option to output a default properties file, containing information about each of the possible configuration options.
delimiterchar\/AllDelimiter character that separates word and part of speech tags. For training and testing, this is the delimiter used in the train/test files. For tagging, this is the character that will be inserted between words and tags in the output.
encodingStringUTF-8AllEncoding of the read files (training, testing) and the output text files.
tokenizebooleantrueTag,TestWhether or not the file has been tokenized (so that white space separates all and only those things that should be tagged as separate words).
tokenizerFactoryStringedu.stanford.nlp.process.PTBTokenizerTag,TestFully qualified classname of the tokenizer to use. edu.stanford.nlp.process.PTBTokenizer does basic English tokenization.
archStringgenericTrainArchitecture of the model: this determines what features are sed to build your model. Options are 'left3words','left3wordswordshapes', 'left5wordswordshapes', 'bidirectional', 'bidirectionalwordshapes', bidirectionalwordshapes5words', 'generic', '[1357]words', or 'sighan2005' (Chinese). The left3words architectures are faster, but slightly less accurate, than the bidirectional architectures. The word shapes options increase accuracy.
langStringenglishTrainLanguage from which the part of speech tags are drawn. This option determines which tags are considered closed-class (only fixed set of words can be tagged with a closed-class tag, such as prepositions). Defined languages are 'english' (Penn tagset), 'polish' (very rudimentary), 'chinese', 'arabic', 'german', and 'medline'.
openClassTagsStringN/ATrainSpace separated list of tags that should be considered open-class. All tags encountered that are not in this list are considered closed-class. E.g. format: "NN VB"
closedClassTagsStringN/ATrainSpace separated list of tags that should be considered closed-class. All tags encountered that are not in this list are considered open-class.
learnClosedClassTagsbooleanfalseTrainIf true, induce which tags are closed-class by counting as closed-class tags all those tags which have fewer unique word tokens than closedClassTagThreshold.
closedClassTagThresholdintintTrainNumber of unique word tokens that a tag may have and still be considered closed-class; relevant only if learnClosedClassTags is true.
sgmlbooleanfalseTag, TestVery basic tagging of the contents of all sgml fields; for more complex mark-up, consider using the xmlInput option.
xmlInputStringTag, TestGive a space separated list of tags in an XML file whose content you would like tagged. Any internal tags that appear in the content of fields you would like tagged will be discarded; the rest of the XML will be preserved and the original text of specified fields will be replaced with the tagged text.
xmlOutputStringN/ATagIf a path is given, the tagged data be written out to the given file in xml. If true, each word will be written out within a word tag, with the part of speech as an attribute. If original input was XML, this will just appear in the field where the text originally came from. Otherwise, word tags will be surrounded by sentence tags as well. E.g., <sentence id="0"><word id="0" pos="NN">computer</word></sentence>
searchStringcgTrainSpecify the search method to be used in the optimization method for training. Options are 'cg' (conjugate gradient) or 'iis' (improved iterative scaling).
sigmaSquareddouble0.5TrainSigma-squared smoothing/regularization parameter to be used for conjugate gradient search. Default usually works reasonably well.
iterationsint100TrainNumber of iterations to be used for improved iterative scaling.
rareWordThreshint5TrainWords that appear fewer than this number of times during training are considered rare words and use extra rare word features.
minFeatureThresholdint5TrainFeatures whose history appears fewer than this number of times are discarded.
curWordMinFeatureThresholdint2TrainWords that occur more than this number of times will generate features with all of the tags they've been seen with.
rareWordMinFeatureThreshint10TrainFeatures of rare words whose histories occur fewer than this number of times are discarded.
veryCommonWordThreshint250TrainWords that occur more than this number of times form an equivalence class by themselves. Ignored unless you are using ambiguity classes.
useDistSimbooleanfalseTrainWhether to use distributional similarity classes for features. Provides improved performance on words not seen in training. Requires specifying a file with distributional similarity information; see the included English distributional similarity file (egw.bnc.200.pruned) for file format if you wish to make your own.
distSimPathStringN/AAllPath to the distributional similarity file. This distribution includes an English distributional similarity file.
useChineseDictbooleanfalseTrainWhether to use features built from Chinese dictionaries as part of the tagger; see ASBCDict class for more information.
chineseDictionaryPathStringN/AAllBase path to the Chinese dictionaries (only relevant if useChineseDict=true)
debugbooleanbooleanAllWhether to write debugging information (words, top words, unknown words). Useful for error analysis.
debugPrefixStringN/AAllPrefix for where to write out the debugging information (relevant only if debug=true).

Author:
Kristina Toutanova, Miler Lee, Joseph Smarr, Anna Rafferty

Field Summary
static String DEFAULT_DISTRIBUTION_PATH
           
static String DEFAULT_NLP_GROUP_MODEL_PATH
           
 
Constructor Summary
MaxentTagger()
          Non-static version of the tagger for the Function interface.
MaxentTagger(String modelFile)
          Constructor from a fileName.
 
Method Summary
 Sentence<TaggedWord> apply(Sentence in)
          Expects a sentence and returns a tagged sentence
static void init(String modelFile)
          Initializer that loads the dictionary.
static void init(String modelFile, TaggerConfig config)
          Initializer that loads the dictionary.
static void main(String[] args)
          Command-line tagger that takes input from stdin or a file.
 List<Sentence> process(List<Sentence> sentences)
          Tags the Words in each Sentence in the given List with their grammatical part-of-speech.
 Sentence processSentence(Sentence sentence)
          Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
static Sentence<TaggedWord> tagSentence(Sentence<? extends HasWord> sentence)
          Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
static String tagString(String toTag)
          Tags the input string and returns the tagged version.
static Sentence tagStringTokenized(String toTag)
          Tags the input string and returns the tagged version as a Sentence The tagger wants input that is space separated tokens, tokenized according to the conventions of the Penn Treebank.
static List tokenizeText(Reader r)
          Reads data from r, tokenizes it with the default (Penn Treebank tokenizes), and returns a List of Sentence objects, which can then be fed into tagSentence.
static List tokenizeText(Reader r, TokenizerFactory tokenizerFactory)
          Reads data from r, tokenizes it with the given tokenizer, and returns a List of Sentence objects, which can then be fed into tagSentence.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_NLP_GROUP_MODEL_PATH

public static final String DEFAULT_NLP_GROUP_MODEL_PATH
See Also:
Constant Field Values

DEFAULT_DISTRIBUTION_PATH

public static final String DEFAULT_DISTRIBUTION_PATH
See Also:
Constant Field Values
Constructor Detail

MaxentTagger

public MaxentTagger()
Non-static version of the tagger for the Function interface. A tagger is not initialized in the constructor, but will be initialized the first time it is used from DEFAULT_NLP_GROUP_MODEL_PATH.


MaxentTagger

public MaxentTagger(String modelFile)
             throws Exception
Constructor from a fileName. The modelFile is both a filename and a prefix that is used from which other filenames are built and their data loaded. The tagger data is loaded when the constructor is called (this can be slow). Since some of the data for the tagger is static, two different taggers cannot exist at the same time.

Throws:
Exception
Method Detail

init

public static void init(String modelFile)
                 throws Exception
Initializer that loads the dictionary. This maintains a flag as to whether initialization has been done previously, and if so, running this is a no-op.

Parameters:
modelFile - Filename of the trained model, for example /u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
Throws:
Exception

init

public static void init(String modelFile,
                        TaggerConfig config)
                 throws Exception
Initializer that loads the dictionary. This maintains a flag as to whether initialization has been done previously, and if so, running this is a no-op.

Parameters:
modelFile - Filename of the trained model, for example /u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
Throws:
Exception

tagString

public static String tagString(String toTag)
                        throws Exception
Tags the input string and returns the tagged version. The tagger wants input that is space separated tokens, tokenized according to the conventions of the Penn Treebank.

Parameters:
toTag - The untagged input String
Returns:
The same string with tags inserted in the form word/tag
Throws:
Exception

tagStringTokenized

public static Sentence tagStringTokenized(String toTag)
                                   throws Exception
Tags the input string and returns the tagged version as a Sentence The tagger wants input that is space separated tokens, tokenized according to the conventions of the Penn Treebank.

Parameters:
toTag - The untagged input String
Returns:
The same string with tags inserted in the form word/tag
Throws:
Exception

apply

public Sentence<TaggedWord> apply(Sentence in)
Expects a sentence and returns a tagged sentence

Specified by:
apply in interface Function<Sentence,Sentence>
Parameters:
in - This needs to be a Sentence
Returns:
A Sentence of TaggedWord

process

public List<Sentence> process(List<Sentence> sentences)
Tags the Words in each Sentence in the given List with their grammatical part-of-speech. The returned List contains Sentences consisting of TaggedWords.

NOTE: The input document must contain sentences as its elements, not words. To turn a Document of words into a Document of sentences, run it through WordToSentenceProcessor.

Specified by:
process in interface ListProcessor<Sentence,Sentence>
Parameters:
sentences - A List of Sentence
Returns:
A List of Sentence of TaggedWord (final generification cannot be listed due to lack of complete generification of super classes)

processSentence

public Sentence processSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convenience method when you only want to tag a single Sentence instead of a Document of sentences.

Specified by:
processSentence in interface SentenceProcessor
Parameters:
sentence - A sentence. Classes implementing this interface can assume that the sentence passed in is not null.

tagSentence

public static Sentence<TaggedWord> tagSentence(Sentence<? extends HasWord> sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convenience method when you only want to tag a single Sentence instead of a Document of sentences.


tokenizeText

public static List tokenizeText(Reader r)
Reads data from r, tokenizes it with the default (Penn Treebank tokenizes), and returns a List of Sentence objects, which can then be fed into tagSentence.


tokenizeText

public static List tokenizeText(Reader r,
                                TokenizerFactory tokenizerFactory)
Reads data from r, tokenizes it with the given tokenizer, and returns a List of Sentence objects, which can then be fed into tagSentence.


main

public static void main(String[] args)
                 throws Exception
Command-line tagger that takes input from stdin or a file. See class documentation for usage.

Throws:
Exception


Stanford NLP Group