|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.tagger.maxent.MaxentTagger
public class MaxentTagger
A class for end users to part of speech tag text using an already trained and saved maxent tagger. You can tag things through the Java API or from the commandline. The two taggers included in this distribution are:
MaxentTagger tagger = new MaxentTagger("bidirectional/wsj3t0-18.holder");
MaxentTagger tagger = new MaxentTagger();
String taggedString = maxentTagger.tagString("Here's a tagged string.")
Sentence taggedSentence=maxentTagger.tagSentence(Sentence sentence)
Sentence taggedSentence=maxentTagger.apply(Sentence sentence)
List taggedList=maxentTagger.process(List sentences)
init
that takes in a trained model, which is loaded immediately (takes a
long time...); subsequent attempts to initialize the tagger will be
no-ops, therefore it is safe to call init
ad infinitum.
Otherwise the default trained model is used.
Then subsequent calls to tagString can be executed, passing in an
untagged String; a tagged String is returned, unless there was a
serious problem in the Tagging machinery, in which case null is
returned.
Example:
MaxentTagger.init("stanford-tagger/bidirectional/wsj0-18.holder");
String taggedString = MaxentTagger.tagString("Here's a tagged string.");
String taggedString2 = MaxentTagger.tagString("This is your life.");
Here's/JJ a/DT tagged/VBD string./NNP
and
This/DT is/VBZ your/PRP$ life./NN
respectively.
Note that in precisely the above fashion, this tagger just splits on
white space when tagging, but the tagger expects input tokenized as in
the Penn Treebank. Unless a prior tokenization step is done, the
tagger will perform poorly. You can tokenize text programmatically by
calling the tokenizeText
method.
java edu.stanford.nlp.tagger.maxent.MaxentTagger -model <modelPrefix> -file <testfile>You can use the same properties file as for training (c.f.
Train
)
if you pass it in with the "-props" argument. The only important
arguments for tagging (besides "model" and "file") are "tokenize"
and "tokenizeFactory". See the properties file documentation for
details.
Field Summary | |
---|---|
static java.lang.String |
DEFAULT_NLP_GROUP_MODEL_PATH
|
Constructor Summary | |
---|---|
MaxentTagger()
Non-static version of the tagger for the Function interface. |
|
MaxentTagger(java.lang.String modelFile)
Constructor from a fileName. |
Method Summary | |
---|---|
java.lang.Object |
apply(java.lang.Object in)
Expects a sentence and returns a tagged sentence |
static void |
init(java.lang.String modelFile)
Initializer that loads the dictionary. |
static void |
main(java.lang.String[] args)
Command-line tagger that takes input from stdin or a file. |
java.util.List |
process(java.util.List sentences)
Tags the Words in each Sentence in the given List with their grammatical part-of-speech. |
Sentence |
processSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. |
static Sentence |
tagSentence(Sentence sentence)
Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. |
static java.lang.String |
tagString(java.lang.String toTag)
Tags the input string and returns the tagged version. |
static java.util.List |
tokenizeText(java.io.Reader r)
Reads data from r, tokenizes it with the default (Penn Treebank tokenizes), and returns a List of Sentence objects, which can then be fed into tagSentence. |
static java.util.List |
tokenizeText(java.io.Reader r,
TokenizerFactory tokenizerFactory)
Reads data from r, tokenizes it with the given tokenizes, and returns a List of Sentence objects, which can then be fed into tagSentence. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String DEFAULT_NLP_GROUP_MODEL_PATH
Constructor Detail |
---|
public MaxentTagger()
DEFAULT_NLP_GROUP_MODEL_PATH
.
public MaxentTagger(java.lang.String modelFile) throws java.lang.Exception
modelFile
is both a
filename and a prefix that is used from which other filenames are
built and their data loaded. The tagger data is loaded when the
constructor is called (this can be slow). Since some of the data
for the tagger is static, two different taggers can not exist at
the same time.
java.lang.Exception
Method Detail |
---|
public static void init(java.lang.String modelFile) throws java.lang.Exception
modelFile
- Filename of the trained model, for example
/u/nlp/data/pos-tagger/wsj3t0-18-left3words/train-wsj-0-18.holder
java.lang.Exception
public static java.lang.String tagString(java.lang.String toTag) throws java.lang.Exception
toTag
- The untagged input String
java.lang.Exception
public java.lang.Object apply(java.lang.Object in)
apply
in interface Function
in
- This needs to be a Sentence
public java.util.List process(java.util.List sentences)
NOTE: The input document must contain sentences as its elements,
not words. To turn a Document of words into a Document of sentences, run
it through WordToSentenceProcessor
.
process
in interface ListProcessor
sentences
- A List of Sentence
public Sentence processSentence(Sentence sentence)
processSentence
in interface SentenceProcessor
sentence
- A sentence. Classes implementing this interface can assume
that the sentence passed in is not null.public static Sentence tagSentence(Sentence sentence)
public static java.util.List tokenizeText(java.io.Reader r)
public static java.util.List tokenizeText(java.io.Reader r, TokenizerFactory tokenizerFactory)
public static void main(java.lang.String[] args) throws java.lang.Exception
args
- There can be no arguments, or a file to read a tagger
from and a file to tag can be supplied.
If the text is pre-tokenized in Penn Treebank format,
use the -tokenized option:java edu.stanford.nlp.tagger.maxent.MaxentTagger
[-model modelName] [-file fileName] [-tokenize]
java.lang.Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |