|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.tagger.maxent.MaxentTagger
public class MaxentTagger
The main class for users to run, train, and test the part of speech tagger. You can tag things through the Java API or from the command line. The two taggers included in this distribution are:
MaxentTagger tagger = new MaxentTagger("models/left3words-wsj-0-18.tagger");
MaxentTagger tagger = new MaxentTagger();
String taggedString = maxentTagger.tagString("Here's a tagged string.")
Sentence taggedSentence = maxentTagger.tagSentence(Sentence sentence)
Sentence taggedSentence = maxentTagger.apply(Sentence sentence)
List taggedList = maxentTagger.process(List sentences)
MaxentTagger.init("models/left3words-wsj-0-18.tagger");
String taggedString = MaxentTagger.tagString("Here's a tagged string.");
String taggedString2 = MaxentTagger.tagString("This is your life.");
Here's/JJ a/DT tagged/VBD string./NNP
This/DT is/VBZ your/PRP$ life./NN
The MaxentTagger is initialized using a static call to init
,
which takes the path to a trained model as an argument. The trained model is loaded immediately (which may take awhile);
subsequent calls to init
will have no effect.
If no path is passed to init
, the default trained model is used.
tagString
can then be called with an untagged String; a tagged String is
returned (serious errors could result in a null
return value).
java edu.stanford.nlp.tagger.maxent.MaxentTagger -genpropsThis gets you a default properties file with descriptions of each parameter you can set in your trained model. You can modify the properties file , or use the default options. To train, run:
java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.propswith the appropriate properties file specified; any argument you give in the properties file can also be specified on the command line. You must have specified a model using -model, either in the properties file or on the command line, as well as a file containing tagged words using -trainFile.
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -textFile <textfile>For testing (evaluating against tagged text):
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelFile> -testFile <testfile>You can use the same properties file as for training if you pass it in with the "-props" argument. The most important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizerFactory". See below for more details. Note that the tagger assumes input has not yet been tokenized and tokenizes it using a default English tokenizer. If your input has already been tokenized, use the flag "-tokenized".
Parameters can be defined using a Properties file
(specified on the command-line with -prop
propFile),
or directly on the command line (by preceding their name with a minust sign
("-") to turn them into a flag. The following properties are recognized:
Property Name | Type | Default Value | Relevant Phase(s) | Description |
model | String | N/A | All | Path and filename where you would like to save the model (training) or where the model should be loaded from (testing, tagging). |
trainFile | String | N/A | Train | Path to the file holding the training data; specifying this option puts the tagger in training mode. Only one of 'trainFile','testFile','texFile', and 'convertToSingleFile' may be specified. |
testFile | String | N/A | Test | Path to the file holding the test data; specifying this option puts the tagger in testing mode. Only one of 'trainFile','testFile','texFile', and 'convertToSingleFile' may be specified. |
textFile | String | N/A | Tag | Path to the file holding the text to tag; specifying this option puts the tagger in tagging mode. Only one of 'trainFile','testFile','textFile', and 'convertToSingleFile' may be specified. |
convertToSingleFile | String | N/A | N/A | Provided only for backwards compatibility, this option allows you to convert a tagger trained using a previous version of the tagger to the new single-file format. The value of this flag should be the path for the new model file, 'model' should be the path prefix to the old tagger (up to but not including the ".holder"), and you should supply the properties configuration for the old tagger with -props (before these two arguments). |
genprops | boolean | N/A | N/A | Use this option to output a default properties file, containing information about each of the possible configuration options. |
delimiter | char | / | All | Delimiter character that separates word and part of speech tags. For training and testing, this is the delimiter used in the train/test files. For tagging, this is the character that will be inserted between words and tags in the output. |
encoding | String | UTF-8 | All | Encoding of the read files (training, testing) and the output text files. |
tokenize | boolean | true | Tag,Test | Whether or not the file has been tokenized (so that white space separates all and only those things that should be tagged as separate words). |
tokenizerFactory | String | edu.stanford.nlp.process.PTBTokenizer | Tag,Test | Fully qualified classname of the tokenizer to use. edu.stanford.nlp.process.PTBTokenizer does basic English tokenization. |
arch | String | generic | Train | Architecture of the model, as a comma-separated list of options, some with a parenthesized integer argument written k here: this determines what features are sed to build your model. Options are 'left3words', 'left5words', 'bidirectional', 'bidirectional5words', generic', 'sighan2005' (Chinese), 'german', 'words(k),' 'naacl2003unknowns', 'naacl2003conjunctions', wordshapes(k), motleyUnknown, suffix(k), prefix(k), prefixsuffix(k), capitalizationsuffix(k), distsim(s), chinesedictionaryfeatures(s), lctagfeatures, unicodeshapes(k). The left3words architectures are faster, but slightly less accurate, than the bidirectional architectures. 'naacl2003unknowns' was our traditional set of unknown word features, but you can now specify features more flexibility via the various other supported keywords. The 'shapes' options map words to equivalence classes, which slightly increase accuracy. |
lang | String | english | Train | Language from which the part of speech tags are drawn. This option determines which tags are considered closed-class (only fixed set of words can be tagged with a closed-class tag, such as prepositions). Defined languages are 'english' (Penn tagset), 'polish' (very rudimentary), 'chinese', 'arabic', 'german', and 'medline'. |
openClassTags | String | N/A | Train | Space separated list of tags that should be considered open-class. All tags encountered that are not in this list are considered closed-class. E.g. format: "NN VB" |
closedClassTags | String | N/A | Train | Space separated list of tags that should be considered closed-class. All tags encountered that are not in this list are considered open-class. |
learnClosedClassTags | boolean | false | Train | If true, induce which tags are closed-class by counting as closed-class tags all those tags which have fewer unique word tokens than closedClassTagThreshold. |
closedClassTagThreshold | int | int | Train | Number of unique word tokens that a tag may have and still be considered closed-class; relevant only if learnClosedClassTags is true. |
sgml | boolean | false | Tag, Test | Very basic tagging of the contents of all sgml fields; for more complex mark-up, consider using the xmlInput option. |
xmlInput | String | Tag, Test | Give a space separated list of tags in an XML file whose content you would like tagged. Any internal tags that appear in the content of fields you would like tagged will be discarded; the rest of the XML will be preserved and the original text of specified fields will be replaced with the tagged text. | |
xmlOutput | String | "" | Tag | If a path is given, the tagged data be written out to the given file in xml. If non-empty, each word will be written out within a word tag, with the part of speech as an attribute. If original input was XML, this will just appear in the field where the text originally came from. Otherwise, word tags will be surrounded by sentence tags as well. E.g., <sentence id="0"><word id="0" pos="NN">computer</word></sentence> |
tagInside | String | "" | Tag | Tags inside elements that match the regular expression given in the String. |
search | String | cg | Train | Specify the search method to be used in the optimization method for training. Options are 'cg' (conjugate gradient) or 'iis' (improved iterative scaling). |
sigmaSquared | double | 0.5 | Train | Sigma-squared smoothing/regularization parameter to be used for conjugate gradient search. Default usually works reasonably well. |
iterations | int | 100 | Train | Number of iterations to be used for improved iterative scaling. |
rareWordThresh | int | 5 | Train | Words that appear fewer than this number of times during training are considered rare words and use extra rare word features. |
minFeatureThreshold | int | 5 | Train | Features whose history appears fewer than this number of times are discarded. |
curWordMinFeatureThreshold | int | 2 | Train | Words that occur more than this number of times will generate features with all of the tags they've been seen with. |
rareWordMinFeatureThresh | int | 10 | Train | Features of rare words whose histories occur fewer than this number of times are discarded. |
veryCommonWordThresh | int | 250 | Train | Words that occur more than this number of times form an equivalence class by themselves. Ignored unless you are using ambiguity classes. |
debug | boolean | boolean | All | Whether to write debugging information (words, top words, unknown words). Useful for error analysis. |
debugPrefix | String | N/A | All | Prefix for where to write out the debugging information (relevant only if debug=true). |
Field Summary | |
---|---|
static String |
DEFAULT_DISTRIBUTION_PATH
|
static String |
DEFAULT_NLP_GROUP_MODEL_PATH
|
Constructor Summary | |
---|---|
MaxentTagger()
Non-static version of the tagger for the Function interface. |
|
MaxentTagger(String modelFile)
Constructor from a fileName. |
Method Summary | |
---|---|
Sentence<TaggedWord> |
apply(Sentence<? extends HasWord> in)
Expects a sentence and returns a tagged sentence |
static void |
init(String modelFile)
Initializer that loads the dictionary. |
static void |
init(String modelFile,
TaggerConfig config)
Initializer that loads the dictionary. |
static void |
main(String[] args)
Command-line tagger that takes input from stdin or a file. |
List<Sentence<TaggedWord>> |
process(List<Sentence<? extends HasWord>> sentences)
Tags the Words in each Sentence in the given List with their grammatical part-of-speech. |
Sentence<TaggedWord> |
processSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. |
static void |
tagFromXML(TaggerConfig config)
|
static Sentence<TaggedWord> |
tagSentence(Sentence<? extends HasWord> sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. |
static String |
tagString(String toTag)
Tags the input string and returns the tagged version. |
static Sentence<TaggedWord> |
tagStringTokenized(String toTag)
Tags the input string and returns the tagged version. |
static List<Sentence<? extends HasWord>> |
tokenizeText(Reader r)
Reads data from r, tokenizes it with the default (Penn Treebank tokenizer), and returns a List of Sentence objects, which can then be fed into tagSentence. |
static List<Sentence<? extends HasWord>> |
tokenizeText(Reader r,
TokenizerFactory tokenizerFactory)
Reads data from r, tokenizes it with the given tokenizer, and returns a List of Lists of (extends) HasWord objects, which can then be fed into tagSentence. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_NLP_GROUP_MODEL_PATH
public static final String DEFAULT_DISTRIBUTION_PATH
Constructor Detail |
---|
public MaxentTagger()
DEFAULT_NLP_GROUP_MODEL_PATH
.
public MaxentTagger(String modelFile) throws Exception
modelFile
is both a
filename and a prefix that is used from which other filenames are
built and their data loaded. The tagger data is loaded when the
constructor is called (this can be slow). Since some of the data
for the tagger is static, two different taggers cannot exist at
the same time.
Exception
Method Detail |
---|
public static void init(String modelFile) throws Exception
modelFile
- Filename of the trained model, for example
/u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
Exception
public static void init(String modelFile, TaggerConfig config) throws Exception
config
- TaggerConfig based on command-line argumentsmodelFile
- Filename of the trained model, for example
/u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
Exception
- if IO problempublic static String tagString(String toTag) throws Exception
toTag
- The untagged input String
Exception
- If there are IO errors or class initialization problemspublic static Sentence<TaggedWord> tagStringTokenized(String toTag) throws Exception
toTag
- The untagged input String
Exception
- If there are IO errors or class initialization problemspublic Sentence<TaggedWord> apply(Sentence<? extends HasWord> in)
apply
in interface Function<Sentence<? extends HasWord>,Sentence<TaggedWord>>
in
- This needs to be a Sentence
public List<Sentence<TaggedWord>> process(List<Sentence<? extends HasWord>> sentences)
NOTE: The input document must contain sentences as its elements,
not words. To turn a Document of words into a Document of sentences, run
it through WordToSentenceProcessor
.
process
in interface ListProcessor<Sentence<? extends HasWord>,Sentence<TaggedWord>>
sentences
- A List of Sentence
public Sentence<TaggedWord> processSentence(Sentence sentence)
processSentence
in interface SentenceProcessor
sentence
- A sentence. Classes implementing this interface can assume
that the sentence passed in is not null.public static Sentence<TaggedWord> tagSentence(Sentence<? extends HasWord> sentence)
public static List<Sentence<? extends HasWord>> tokenizeText(Reader r)
r
- Reader to read text from
public static List<Sentence<? extends HasWord>> tokenizeText(Reader r, TokenizerFactory tokenizerFactory)
public static void tagFromXML(TaggerConfig config)
public static void main(String[] args) throws IOException
args
- Command-line arguments
IOException
- If any file problems
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |