public class MaxentTagger extends Tagger implements ListProcessor<List<? extends HasWord>,List<TaggedWord>>, Serializable
edu/stanford/nlp/models/pos-tagger/english-left3words/english-bidirectional-distsim.tagger
.
Its accuracy was 97.32% on Penn Treebank WSJ secs. 22-24.edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
This tagger runs a lot faster, and is recommended for general use.
Its accuracy was 96.92% on Penn Treebank WSJ secs. 22-24.MaxentTagger tagger = new MaxentTagger("models/left3words-wsj-0-18.tagger");
MaxentTagger tagger = new MaxentTagger(DEFAULT_NLP_GROUP_MODEL_PATH);
List<TaggedWord> taggedSentence = tagger.tagSentence(List<? extends HasWord> sentence)
List<TaggedWord> taggedSentence = tagger.apply(List<? extends HasWord> sentence)
List taggedList = tagger.process(List sentences)
String taggedString = tagger.tagString("Here's a tagged string.")
String taggedString = tagger.tagTokenizedString("Here 's a tagged string .")
The tagString
method uses the default tokenizer (PTBTokenizer).
If you wish to control tokenization, you may wish to call
tokenizeText(Reader, TokenizerFactory)
and then to call
process()
on the result.
java edu.stanford.nlp.tagger.maxent.MaxentTagger -genpropsThis gets you a default properties file with descriptions of each parameter you can set in your trained model. You can modify the properties file, or use the default options. To train, run:
java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.propswith the appropriate properties file specified. Any argument you give in the properties file can also be specified on the command line. You must have specified a model using -model, either in the properties file or on the command line, as well as a file containing tagged words using -trainFile. Useful flags for controlling the amount of output are -verbose, which prints extra debugging information, and -verboseResults, which prints full information about intermediate results. -verbose defaults to false and -verboseResults defaults to true.
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -textFile <textfile>For testing (evaluating against tagged text):
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -testFile <testfile>You can use the same properties file as for training if you pass it in with the "-props" argument. The most important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizerFactory". See below for more details.
Parameters can be defined using a Properties file
(specified on the command-line with -prop
propFile),
or directly on the command line (by preceding their name with a minus sign
("-") to turn them into a flag. The following properties are recognized:
Property Name | Type | Default Value | Relevant Phase(s) | Description |
model | String | N/A | All | Path and filename where you would like to save the model (training) or where the model should be loaded from (testing, tagging). |
trainFile | String | N/A | Train |
Path to the file holding the training data; specifying this option puts the tagger in training mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. There are three formats possible. The first is a text file of tagged data, Each line is considered a separate sentence. In each sentence, words are separated by whitespace. Each word must have a tag, which is separated using the specified tagSeparator. This format, called TEXT, is the default format. The second format is a file of Penn Treebank formatted tree files. Trees are loaded one at a time and the tagged words in a tree are used as a training sentence. To specify this format, preface the filename with " format=TREES, ". The final possible format is TSV files (tab-separated columns). To specify a TSV file, set trainFile to " format=TSV,wordColumn=x,tagColumn=y,filename ". Column numbers are indexed from 0, and sentences are separated with blank lines. The default wordColumn is 0 and default tagColumn is 1.
A file can be in a different encoding than the tagger's default encoding by prefacing the filename with "encoding=ENC". You can specify the tagSeparator character in a TEXT file by prefacing the filename with "tagSeparator=c". Tree files can be fed through TreeTransformers and TreeNormalizers. To specify a transformer, preface the filename with "treeTransformer=CLASSNAME". To specify a normalizer, preface the filename with "treeNormalizer=CLASSNAME". You can also filter trees using a Filter<Tree>, which can be specified with "treeFilter=CLASSNAME". A specific range of trees to be used can be specified with treeRange=X-Y. Multiple parts of the range can be separated by : as opposed to the normal separator of ,. For example, one could use the argument "-treeRange=25-50:75-100". You can specify a TreeReaderFactory by prefacing the filename with "trf=CLASSNAME". Multiple files can be specified by making a semicolon separated list of files. Each file can have its own format specifiers as above. You will note that none of , ; or = can be in filenames. |
testFile | String | N/A | Test | Path to the file holding the test data; specifying this option puts the tagger in testing mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. The same format as trainFile applies, but only one file can be specified. |
textFile | String | N/A | Tag | Path to the file holding the text to tag; specifying this option puts the tagger in tagging mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. No file reading options may be specified for textFile |
dump | String | N/A | Dump | Path to the file holding the model to dump; specifying this option puts the tagger in dumping mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. |
genprops | boolean | N/A | N/A | Use this option to output a default properties file, containing information about each of the possible configuration options. |
tagSeparator | char | / | All | Separator character that separates word and part of speech tags, such as out/IN or out_IN. For training and testing, this is the separator used in the train/test files. For tagging, this is the character that will be inserted between words and tags in the output. |
encoding | String | UTF-8 | All | Encoding of the read files (training, testing) and the output text files. |
tokenize | boolean | true | Tag,Test | Whether or not the file needs to be tokenized. If this is false, the tagger assumes that white space separates words if and only if they should be tagged as separate tokens, and that the input is strictly one sentence per line. |
tokenizerFactory | String | edu.stanford.nlp. process.PTBTokenizer | Tag,Test | Fully qualified class name of the tokenizer to use. edu.stanford.nlp.process.PTBTokenizer does basic English tokenization. |
tokenizerOptions | String | Tag,Test | Known options for the particular tokenizer used. A comma-separated list. For PTBTokenizer, options of interest include americanize=false and asciiQuotes (for German). Note that any choice of tokenizer options that conflicts with the tokenization used in the tagger training data will likely degrade tagger performance. | |
arch | String | generic | Train | Architecture of the model, as a comma-separated list of options, some with a parenthesized integer argument written k here: this determines what features are used to build your model. See ExtractorFrames and ExtractorFramesRare for more information. |
wordFunction | String | (none) | Train | A function to apply to the text before training or testing. Must inherit from edu.stanford.nlp.util.Function<String, String>. Can be blank. |
lang | String | english | Train | Language from which the part of speech tags are drawn. This option determines which tags are considered closed-class (only fixed set of words can be tagged with a closed-class tag, such as prepositions). Defined languages are 'english' (Penn tagset), 'polish' (very rudimentary), 'french', 'chinese', 'arabic', 'german', and 'medline'. |
openClassTags | String | N/A | Train | Space separated list of tags that should be considered open-class. All tags encountered that are not in this list are considered closed-class. E.g. format: "NN VB" |
closedClassTags | String | N/A | Train | Space separated list of tags that should be considered closed-class. All tags encountered that are not in this list are considered open-class. |
learnClosedClassTags | boolean | false | Train | If true, induce which tags are closed-class by counting as closed-class tags all those tags which have fewer unique word tokens than closedClassTagThreshold. |
closedClassTagThreshold | int | int | Train | Number of unique word tokens that a tag may have and still be considered closed-class; relevant only if learnClosedClassTags is true. |
sgml | boolean | false | Tag, Test | Very basic tagging of the contents of all sgml fields; for more complex mark-up, consider using the xmlInput option. |
xmlInput | String | Tag, Test | Give a space separated list of tags in an XML file whose content you would like tagged. Any internal tags that appear in the content of fields you would like tagged will be discarded; the rest of the XML will be preserved and the original text of specified fields will be replaced with the tagged text. | |
outputFile | String | "" | Tag | Path to write output to. If blank, stdout is used. |
outputFormat | String | "" | Tag | Output format. One of: slashTags (default), xml, or tsv |
outputFormatOptions | String | "" | Tag | Output format options. |
tagInside | String | "" | Tag | Tags inside elements that match the regular expression given in the String. |
search | String | cg | Train | Specify the search method to be used in the optimization method for training. Options are 'cg' (conjugate gradient), 'iis' (improved iterative scaling), or 'qn' (quasi-newton). |
sigmaSquared | double | 0.5 | Train | Sigma-squared smoothing/regularization parameter to be used for conjugate gradient search. Default usually works reasonably well. |
iterations | int | 100 | Train | Number of iterations to be used for improved iterative scaling. |
rareWordThresh | int | 5 | Train | Words that appear fewer than this number of times during training are considered rare words and use extra rare word features. |
minFeatureThreshold | int | 5 | Train | Features whose history appears fewer than this number of times are discarded. |
curWordMinFeatureThreshold | int | 2 | Train | Words that occur more than this number of times will generate features with all of the tags they've been seen with. |
rareWordMinFeatureThresh | int | 10 | Train | Features of rare words whose histories occur fewer than this number of times are discarded. |
veryCommonWordThresh | int | 250 | Train | Words that occur more than this number of times form an equivalence class by themselves. Ignored unless you are using ambiguity classes. |
debug | boolean | boolean | All | Whether to write debugging information (words, top words, unknown words, confusion matrix). Useful for error analysis. |
debugPrefix | String | N/A | All | File (path) prefix for where to write out the debugging information (relevant only if debug=true). |
nthreads | int | 1 | Test,Text | Number of threads to use when processing text. |
Modifier and Type | Field and Description |
---|---|
static String |
BASE_TAGGER_HOME
The directory from which to get taggers when using
DEFAULT_NLP_GROUP_MODEL_PATH.
|
static String |
DEFAULT_DISTRIBUTION_PATH |
static String |
DEFAULT_JAR_PATH |
static String |
DEFAULT_NLP_GROUP_MODEL_PATH |
static String |
TAGGER_HOME |
Constructor and Description |
---|
MaxentTagger() |
MaxentTagger(String modelFile)
Constructor for a tagger, loading a model stored in a particular file,
classpath resource, or URL.
|
MaxentTagger(String modelFile,
Properties config)
Constructor for a tagger using a model stored in a particular file,
with options taken from the supplied TaggerConfig.
|
MaxentTagger(String modelFile,
Properties config,
boolean printLoading)
Initializer that loads the tagger.
|
MaxentTagger(TaggerConfig config) |
Modifier and Type | Method and Description |
---|---|
int |
addTag(String tag)
Will return the index of a tag, adding it if it doesn't already exist
|
List<TaggedWord> |
apply(List<? extends HasWord> in)
Expects a sentence and returns a tagged sentence.
|
protected TokenizerFactory<? extends HasWord> |
chooseTokenizerFactory()
Figures out what tokenizer factory might be described by the
config.
|
protected static TokenizerFactory<? extends HasWord> |
chooseTokenizerFactory(boolean tokenize,
String tokenizerFactory,
String tokenizerOptions,
boolean invertible) |
protected void |
dumpModel(PrintStream out) |
String |
getTag(int index) |
int |
getTagIndex(String tag)
Will return the index of a tag if known, -1 if not already known
|
static void |
lemmatize(List<CoreLabel> sentence,
Morphology morpha)
Adds lemmas to the given list of CoreLabels, using the given
Morphology object.
|
static void |
main(String[] args)
Command-line tagger interface.
|
int |
numTags() |
void |
outputTaggedSentence(List<? extends HasWord> sentence,
boolean outputLemmas,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle,
boolean outputVerbosity,
int numSentences,
String separator,
Writer writer) |
List<List<TaggedWord>> |
process(List<? extends List<? extends HasWord>> sentences)
Tags the Words in each Sentence in the given List with their
grammatical part-of-speech.
|
protected void |
readModelAndInit(Properties config,
DataInputStream rf,
boolean printLoading)
This reads the complete tagger from a single model file, and inits
the tagger using a combination of the properties passed in and
parameters from the file.
|
protected void |
readModelAndInit(Properties config,
String modelFileOrUrl,
boolean printLoading)
This reads the complete tagger from a single model stored in a file, at a URL,
or as a resource in a jar file, and inits the tagger using a
combination of the properties passed in and parameters from the file.
|
void |
runTagger(BufferedReader reader,
BufferedWriter writer,
String tagInside,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)
This method runs the tagger on the provided reader and writer.
|
<X extends HasWord> |
runTagger(Iterable<List<X>> document,
BufferedWriter writer,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) |
void |
runTaggerSGML(BufferedReader reader,
BufferedWriter writer,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) |
void |
runTaggerStdin(BufferedReader reader,
BufferedWriter writer,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) |
protected void |
saveModel(DataOutputStream file) |
protected void |
saveModel(String filename) |
void |
tagAndOutputSentence(List<? extends HasWord> sentence,
boolean outputLemmas,
Morphology morpha,
PlainTextDocumentReaderAndWriter.OutputStyle outputStyle,
boolean outputVerbosity,
int numSentences,
String separator,
Writer writer) |
void |
tagCoreLabels(List<CoreLabel> sentence)
Takes a sentence composed of CoreLabels and add the tags to the
CoreLabels, modifying the input sentence.
|
void |
tagCoreLabels(List<CoreLabel> sentence,
boolean reuseTags)
Takes a sentence composed of CoreLabels and add the tags to the
CoreLabels, modifying the input sentence.
|
List<? extends HasWord> |
tagCoreLabelsOrHasWords(List<? extends HasWord> sentence,
Morphology morpha,
boolean outputLemmas) |
void |
tagFromXML(InputStream input,
Writer writer,
String... xmlTags)
Uses an XML transformer to turn an input stream into a bunch of
output.
|
void |
tagFromXML(Reader input,
Writer writer,
String... xmlTags) |
List<TaggedWord> |
tagSentence(List<? extends HasWord> sentence)
Returns a new Sentence that is a copy of the given sentence with all the
words tagged with their part-of-speech.
|
List<TaggedWord> |
tagSentence(List<? extends HasWord> sentence,
boolean reuseTags)
Returns a new Sentence that is a copy of the given sentence with all the
words tagged with their part-of-speech.
|
Set<String> |
tagSet() |
String |
tagString(String toTag)
Tags the input string and returns the tagged version.
|
String |
tagTokenizedString(String toTag)
Tags the tokenized input string and returns the tagged version.
|
static List<List<HasWord>> |
tokenizeText(Reader r)
Reads data from r, tokenizes it with the default (Penn Treebank)
tokenizer, and returns a List of Sentence objects, which can
then be fed into tagSentence.
|
static List<List<HasWord>> |
tokenizeText(Reader r,
TokenizerFactory<? extends HasWord> tokenizerFactory)
Reads data from r, tokenizes it with the given tokenizer, and
returns a List of Lists of (extends) HasWord objects, which can then be
fed into tagSentence.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
andThen, compose, identity
public static final String BASE_TAGGER_HOME
public static final String TAGGER_HOME
public static final String DEFAULT_NLP_GROUP_MODEL_PATH
public static final String DEFAULT_JAR_PATH
public static final String DEFAULT_DISTRIBUTION_PATH
public MaxentTagger()
public MaxentTagger(TaggerConfig config)
public MaxentTagger(String modelFile)
modelFile
- Filename, classpath resource, or URL for the trained modelRuntimeIOException
- if I/O errors or serialization errorspublic MaxentTagger(String modelFile, Properties config)
modelFile
- Filename, classpath resource, or URL for the trained modelconfig
- The configuration for the taggerRuntimeIOException
- if I/O errors or serialization errorspublic MaxentTagger(String modelFile, Properties config, boolean printLoading)
modelFile
- Where to initialize the tagger from.
Most commonly, this is the filename of the trained model,
for example,
/u/nlp/data/pos-tagger/wsj3t0-18-left3words/left3words-wsj-0-18.tagger
. However, if it starts with "https?://" it will
be interpreted as a URL. One can also load models
directly from the classpath, as in loading from
edu/stanford/nlp/models/pos-tagger/wsj3t0-18-bidirectional/bidirectional-distsim-wsj-0-18.taggerconfig
- TaggerConfig based on command-line argumentsprintLoading
- Whether to print a message saying what model file is being loaded and how long it took when finished.RuntimeIOException
- if I/O errors or serialization errorspublic int addTag(String tag)
public int getTagIndex(String tag)
public int numTags()
public String getTag(int index)
protected TokenizerFactory<? extends HasWord> chooseTokenizerFactory()
protected static TokenizerFactory<? extends HasWord> chooseTokenizerFactory(boolean tokenize, String tokenizerFactory, String tokenizerOptions, boolean invertible)
protected void saveModel(String filename)
protected void saveModel(DataOutputStream file) throws IOException
IOException
protected void readModelAndInit(Properties config, String modelFileOrUrl, boolean printLoading)
Note for the future: This assumes that the TaggerConfig in the file has already been read and used. This work is done inside the constructor of TaggerConfig. It might be better to refactor things so that is all done inside this method, but for the moment it seemed better to leave working code alone [cdm 2008].
config
- The tagger configmodelFileOrUrl
- The name of the model file. This routine opens and closes it.printLoading
- Whether to print a message saying what model file is being loaded and how long it took when finished.RuntimeIOException
- if I/O errors or serialization errorsprotected void readModelAndInit(Properties config, DataInputStream rf, boolean printLoading)
Note for the future: This assumes that the TaggerConfig in the file has already been read and used. It might be better to refactor things so that is all done inside this method, but for the moment it seemed better to leave working code alone [cdm 2008].
config
- The tagger configrf
- DataInputStream to read from. It's the caller's job to open and close this stream.printLoading
- Whether to print a message saying what model file is being loaded and how long it took when finished.RuntimeIOException
- if I/O errors or serialization errorsprotected void dumpModel(PrintStream out)
public String tagTokenizedString(String toTag)
toTag
- The untagged input Stringpublic String tagString(String toTag)
toTag
- The untagged input Stringpublic List<TaggedWord> apply(List<? extends HasWord> in)
apply
in interface java.util.function.Function<List<? extends HasWord>,List<TaggedWord>>
apply
in class Tagger
in
- This needs to be a sentence (List of words)public List<List<TaggedWord>> process(List<? extends List<? extends HasWord>> sentences)
NOTE: The input document must contain sentences as its elements,
not words. To turn a Document of words into a Document of sentences, run
it through WordToSentenceProcessor
.
process
in interface ListProcessor<List<? extends HasWord>,List<TaggedWord>>
sentences
- A List of Sentencepublic List<TaggedWord> tagSentence(List<? extends HasWord> sentence)
sentence
- sentence to tagpublic List<TaggedWord> tagSentence(List<? extends HasWord> sentence, boolean reuseTags)
sentence
- sentence to tagreuseTags
- whether or not to reuse the given tagpublic void tagCoreLabels(List<CoreLabel> sentence)
public void tagCoreLabels(List<CoreLabel> sentence, boolean reuseTags)
public static void lemmatize(List<CoreLabel> sentence, Morphology morpha)
public static List<List<HasWord>> tokenizeText(Reader r)
r
- Reader where untokenized text is readpublic static List<List<HasWord>> tokenizeText(Reader r, TokenizerFactory<? extends HasWord> tokenizerFactory)
r
- Reader where untokenized text is readtokenizerFactory
- Tokenizer. This can be null
in which case
the default English tokenizer (PTBTokenizerFactory) is used.public void tagFromXML(InputStream input, Writer writer, String... xmlTags)
public void runTaggerStdin(BufferedReader reader, BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) throws IOException
IOException
public void runTaggerSGML(BufferedReader reader, BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) throws IOException
IOException
public <X extends HasWord> void runTagger(Iterable<List<X>> document, BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) throws IOException
IOException
public void runTagger(BufferedReader reader, BufferedWriter writer, String tagInside, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle) throws IOException
reader
, applies the
tagger to it one sentence at a time (determined using
documentPreprocessor), and writes the output to the given
writer
.
The document is broken into sentences using the sentence
processor determined in the tagger's TaggerConfig.
tagInside
makes the tagger run in XML mode... if set
to non-empty, instead of processing the document as one large
text blob, it considers each region in between the given tag to
be a separate text blob.IOException
public List<? extends HasWord> tagCoreLabelsOrHasWords(List<? extends HasWord> sentence, Morphology morpha, boolean outputLemmas)
public void tagAndOutputSentence(List<? extends HasWord> sentence, boolean outputLemmas, Morphology morpha, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle, boolean outputVerbosity, int numSentences, String separator, Writer writer)
public void outputTaggedSentence(List<? extends HasWord> sentence, boolean outputLemmas, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle, boolean outputVerbosity, int numSentences, String separator, Writer writer)
public static void main(String[] args) throws Exception
args
- Command-line argumentsIOException
- If any file problemsException