|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier<IN>
public abstract class AbstractSequenceClassifier<IN extends CoreMap>
This class provides common functionality for (probabilistic) sequence models. It is a superclass of our CMM and CRF sequence classifiers, and is even used in the (deterministic) NumberSequenceClassifier. See implementing classes for more information.
A full implementation should implement these 5 abstract methods: List<CoreLabel> classify(List<CoreLabel> document); void train(Collection<List<CoreLabel>> docs); printProbsDocument(List<CoreLabel> document); void serializeClassifier(String serializePath); void loadClassifier(ObjectInputStream in, Properties props) throws IOException, ClassCastException, ClassNotFoundException; but a runtime (or rule-based) implementation can usefully implement just the first.
Field Summary | |
---|---|
Index<String> |
classIndex
|
FeatureFactory<IN> |
featureFactory
|
SeqClassifierFlags |
flags
|
protected Set<String> |
knownLCWords
|
protected IN |
pad
|
protected int |
windowSize
|
Constructor Summary | |
---|---|
AbstractSequenceClassifier(Properties props)
Construct a SeqClassifierFlags object based on the passed in properties, and then call the other constructor. |
|
AbstractSequenceClassifier(SeqClassifierFlags flags)
Initialize the featureFactory and other variables based on the passed in flags. |
Method Summary | |
---|---|
String |
apply(String in)
Maps a String input to an XML-formatted rendition of applying NER to the String. |
String |
backgroundSymbol()
Returns the background class for the classifier. |
abstract List<IN> |
classify(List<IN> document)
Classify a List of something that extendsCoreMap . |
List<List<IN>> |
classify(String str)
Classify the tokens in a String. |
void |
classifyAndWriteAnswers(Collection<File> testFiles,
DocumentReaderAndWriter<IN> readerWriter)
|
void |
classifyAndWriteAnswers(ObjectBank<List<IN>> documents,
PrintWriter printWriter,
DocumentReaderAndWriter<IN> readerWriter)
|
void |
classifyAndWriteAnswers(String testFile,
DocumentReaderAndWriter<IN> readerWriter)
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). |
void |
classifyAndWriteAnswers(String testFile,
OutputStream outStream,
DocumentReaderAndWriter<IN> readerWriter)
If the flag outputEncoding is defined, the output is written in that
character encoding, otherwise in the system default character encoding. |
void |
classifyAndWriteAnswers(String baseDir,
String filePattern,
DocumentReaderAndWriter<IN> readerWriter)
|
void |
classifyAndWriteAnswersKBest(ObjectBank<List<IN>> documents,
int k,
PrintWriter printWriter,
DocumentReaderAndWriter<IN> readerAndWriter)
Run the classifier on the documents in an ObjectBank, and print the answers to a given PrintWriter (with timing to stderr). |
void |
classifyAndWriteAnswersKBest(String testFile,
int k,
DocumentReaderAndWriter<IN> readerAndWriter)
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). |
void |
classifyAndWriteViterbiSearchGraph(String testFile,
String searchGraphPrefix,
DocumentReaderAndWriter<IN> readerAndWriter)
Load a test file, run the classifier on it, and then write a Viterbi search graph for each sequence. |
List<List<IN>> |
classifyFile(String filename)
Classify the contents of a file. |
Counter<List<IN>> |
classifyKBest(List<IN> doc,
Class<? extends CoreAnnotation<String>> answerField,
int k)
|
List<List<IN>> |
classifyRaw(String str,
DocumentReaderAndWriter<IN> readerAndWriter)
Classify the tokens in a String. |
List<IN> |
classifySentence(List<? extends HasWord> sentence)
Classify a List of IN. |
List<IN> |
classifySentenceWithGlobalInformation(List<? extends HasWord> tokenSequence,
CoreMap doc,
CoreMap sentence)
Classify a List of IN using whatever additional information is passed in globalInfo. |
void |
classifyStdin(DocumentReaderAndWriter<IN> readerWriter)
|
List<Triple<String,Integer,Integer>> |
classifyToCharacterOffsets(String sentences)
Classify the contents of a String to classified character offset
spans. |
String |
classifyToString(String sentences)
Classify the contents of a String to a tagged word/class String. |
String |
classifyToString(String sentences,
String outputFormat,
boolean preserveSpacing)
Classify the contents of a String to one of several String
representations that shows the classes. |
abstract List<IN> |
classifyWithGlobalInformation(List<IN> tokenSequence,
CoreMap document,
CoreMap sentence)
Classify a List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
String |
classifyWithInlineXML(String sentences)
Classify the contents of a String . |
static boolean |
countResults(List<? extends CoreMap> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN)
Count the successes and failures of the model on the given document. |
static boolean |
countResultsIOB(List<? extends CoreMap> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN)
|
Sampler<List<IN>> |
getSampler(List<IN> input)
|
SequenceModel |
getSequenceModel(List<IN> doc)
|
DFSA<String,Integer> |
getViterbiSearchGraph(List<IN> doc,
Class<? extends CoreAnnotation<String>> answerField)
|
Set<String> |
labels()
|
void |
loadClassifier(File file)
|
void |
loadClassifier(File file,
Properties props)
Loads a classifier from the file specified. |
void |
loadClassifier(InputStream in)
Load a classifier from the specified InputStream. |
void |
loadClassifier(InputStream in,
Properties props)
Load a classifier from the specified InputStream. |
abstract void |
loadClassifier(ObjectInputStream in,
Properties props)
Load a classifier from the specified input stream. |
void |
loadClassifier(String loadPath)
Loads a classifier from the file specified by loadPath. |
void |
loadClassifier(String loadPath,
Properties props)
Loads a classifier from the file specified by loadPath. |
void |
loadClassifierNoExceptions(File file)
|
void |
loadClassifierNoExceptions(File file,
Properties props)
|
void |
loadClassifierNoExceptions(InputStream in,
Properties props)
Loads a classifier from the given input stream. |
void |
loadClassifierNoExceptions(String loadPath)
|
void |
loadClassifierNoExceptions(String loadPath,
Properties props)
|
void |
loadJarClassifier(String modelName,
Properties props)
This function will load a classifier that is stored inside a jar file (if it is so stored). |
ObjectBank<List<IN>> |
makeObjectBankFromFile(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
|
ObjectBank<List<IN>> |
makeObjectBankFromFiles(Collection<File> files,
DocumentReaderAndWriter<IN> readerAndWriter)
|
ObjectBank<List<IN>> |
makeObjectBankFromFiles(String[] trainFileList,
DocumentReaderAndWriter<IN> readerAndWriter)
|
ObjectBank<List<IN>> |
makeObjectBankFromFiles(String baseDir,
String filePattern,
DocumentReaderAndWriter<IN> readerAndWriter)
|
ObjectBank<List<IN>> |
makeObjectBankFromReader(BufferedReader in,
DocumentReaderAndWriter<IN> readerAndWriter)
Set up an ObjectBank that will allow one to iterate over a collection of documents obtained from the passed in Reader. |
ObjectBank<List<IN>> |
makeObjectBankFromString(String string,
DocumentReaderAndWriter<IN> readerAndWriter)
Reads a String into an ObjectBank object. |
DocumentReaderAndWriter<IN> |
makeReaderAndWriter()
|
protected void |
printFeatureLists(IN wi,
Collection<List<String>> features)
Print the String features generated from a token |
protected void |
printFeatures(IN wi,
Collection<String> features)
Print the String features generated from a IN |
void |
printProbs(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. |
abstract void |
printProbsDocument(List<IN> document)
|
void |
printProbsDocuments(ObjectBank<List<IN>> documents)
Takes a List of documents and prints the likelihood of each
possible label at each point. |
static void |
printResults(Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN)
Given counters of true positives, false positives, and false negatives, prints out precision, recall, and f1 for each key. |
protected void |
reinit()
This method should be called after there have been changes to the flags (SeqClassifierFlags) variable, such as after deserializing a classifier. |
List<String> |
segmentString(String sentence)
ONLY USE IF LOADED A CHINESE WORD SEGMENTER!!!!! |
List<String> |
segmentString(String sentence,
DocumentReaderAndWriter<IN> readerAndWriter)
|
abstract void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path. |
static int |
tallyOneEntity(List<? extends CoreMap> doc,
int index,
Class<? extends CoreAnnotation<String>> source,
Class<? extends CoreAnnotation<String>> target,
Counter<String> positive,
Counter<String> negative)
|
void |
train()
Train the classifier based on values in flags. |
void |
train(Collection<List<IN>> docs)
Trains a classifier from a Collection of sequences. |
abstract void |
train(Collection<List<IN>> docs,
DocumentReaderAndWriter<IN> readerAndWriter)
Trains a classifier from a Collection of sequences. |
void |
train(String filename)
|
void |
train(String[] trainFileList,
DocumentReaderAndWriter<IN> readerAndWriter)
|
void |
train(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
|
void |
train(String baseTrainDir,
String trainFiles,
DocumentReaderAndWriter<IN> readerAndWriter)
|
void |
writeAnswers(List<IN> doc,
PrintWriter printWriter,
DocumentReaderAndWriter<IN> readerAndWriter)
Write the classifications of the Sequence classifier out to a writer in a format determined by the DocumentReaderAndWriter used. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public SeqClassifierFlags flags
public Index<String> classIndex
public FeatureFactory<IN extends CoreMap> featureFactory
protected IN extends CoreMap pad
protected int windowSize
protected Set<String> knownLCWords
Constructor Detail |
---|
public AbstractSequenceClassifier(Properties props)
props
- See SeqClassifierFlags for known properties.public AbstractSequenceClassifier(SeqClassifierFlags flags)
flags
- A specification of the AbstractSequenceClassifier to construct.Method Detail |
---|
protected final void reinit()
Implementation note: At the moment this variable doesn't set windowSize or featureFactory, since they are being serialized separately in the file, but we should probably stop serializing them and just reinitialize them from the flags?
public DocumentReaderAndWriter<IN> makeReaderAndWriter()
public String backgroundSymbol()
public Set<String> labels()
public List<IN> classifySentence(List<? extends HasWord> sentence)
sentence
- The List of IN to be classified.
CoreAnnotations.AnswerAnnotation
field.public List<IN> classifySentenceWithGlobalInformation(List<? extends HasWord> tokenSequence, CoreMap doc, CoreMap sentence)
tokenSequence
- The List of IN to be classified.
public SequenceModel getSequenceModel(List<IN> doc)
public Sampler<List<IN>> getSampler(List<IN> input)
public Counter<List<IN>> classifyKBest(List<IN> doc, Class<? extends CoreAnnotation<String>> answerField, int k)
public DFSA<String,Integer> getViterbiSearchGraph(List<IN> doc, Class<? extends CoreAnnotation<String>> answerField)
public List<List<IN>> classify(String str)
str
- A String with tokens in one or more sentences of text to be
classified.
List
of classified sentences (each a List of something that
extends CoreMap
).public List<List<IN>> classifyRaw(String str, DocumentReaderAndWriter<IN> readerAndWriter)
str
- A String with tokens in one or more sentences of text to be
classified.
List
of classified sentences (each a List of something that
extends CoreMap
).public List<List<IN>> classifyFile(String filename)
filename
- Contains the sentence(s) to be classified.
List
of classified List of IN.public String apply(String in)
apply
in interface Function<String,String>
in
- The function's argument
public String classifyToString(String sentences, String outputFormat, boolean preserveSpacing)
String
to one of several String
representations that shows the classes. Plain text or XML input is expected
and the PlainTextDocumentReaderAndWriter
is used. The classifier
will tokenize the text and treat each sentence as a separate document. The
output can be specified to be in a choice of three formats: slashTags
(e.g., Bill/PERSON Smith/PERSON died/O ./O), inlineXML (e.g.,
<PERSON>Bill Smith</PERSON> went to
<LOCATION>Paris</LOCATION> .), or xml, for stand-off XML (e.g.,
<wi num="0" entity="PERSON">Sue</wi> <wi num="1"
entity="O">shouted</wi> ). There is also a binary choice as to
whether the spacing between tokens of the original is preserved or whether
the (tagged) tokens are printed with a single space (for inlineXML or
slashTags) or a single newline (for xml) between each one.
Fine points: The slashTags and xml formats show tokens as transformed by any normalization processes inside the tokenizer, while inlineXML shows the tokens exactly as they appeared in the source text. When a period counts as both part of an abbreviation and as an end of sentence marker, it is included twice in the output String for slashTags or xml, but only once for inlineXML, where it is not counted as part of the abbreviation (or any named entity it is part of). For slashTags with preserveSpacing=true, there will be two successive periods such as "Jr.." The tokenized (preserveSpacing=false) output will have a space or a newline after the last token.
sentences
- The String to be classified. It will be tokenized and
divided into documents according to (heuristically
determined) sentence boundaries.outputFormat
- The format to put the output in: one of "slashTags", "xml", or
"inlineXML"preserveSpacing
- Whether to preserve the input spacing between tokens, which may
sometimes be none (true) or whether to tokenize the text and print
it with one space between each token (false)
String
with annotated with classification information.public String classifyWithInlineXML(String sentences)
String
. Plain text or XML is expected
and the PlainTextDocumentReaderAndWriter
is used. The classifier
will treat each sentence as a separate document. The output can be
specified to be in a choice of formats: Output is in inline XML format
(e.g. <PERSON>Bill Smith</PERSON> went to
<LOCATION>Paris</LOCATION> .)
sentences
- The string to be classified
String
with annotated with classification information.public String classifyToString(String sentences)
PlainTextDocumentReaderAndWriter
is used. Output looks like: My/O name/O is/O Bill/PERSON Smith/PERSON ./O
sentences
- The String to be classified
public List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
String
to classified character offset
spans. Plain text or XML input text is expected and the
PlainTextDocumentReaderAndWriter
is used. Output is a (possibly
empty, but not null
) List of Triples. Each Triple is an entity
name, followed by beginning and ending character offsets in the original
String. Character offsets can be thought of as fenceposts between the
characters, or, like certain methods in the Java String class, as character
positions, numbered starting from 0, with the end index pointing to the
position AFTER the entity ends. That is, end - start is the length of the
entity in characters.
Fine points: Token offsets are true wrt the source text, even though the tokenizer may internally normalize certain tokens to String representations of different lengths (e.g., " becoming `` or ''). When a period counts as both part of an abbreviation and as an end of sentence marker, and that abbreviation is part of a named entity, the reported entity string excludes the period.
sentences
- The string to be classified
List
of Triple
s, each of which gives an entity
type and the beginning and ending character offsets.public List<String> segmentString(String sentence)
sentence
- The string to be classified
public List<String> segmentString(String sentence, DocumentReaderAndWriter<IN> readerAndWriter)
public abstract List<IN> classify(List<IN> document)
List
of something that extendsCoreMap
.
The classifications are added in place to the items of the document,
which is also returned by this method
document
- A List
of something that extends CoreMap
.
List
, but with the elements annotated with their
answers (stored under the CoreAnnotations.AnswerAnnotation
key).public abstract List<IN> classifyWithGlobalInformation(List<IN> tokenSequence, CoreMap document, CoreMap sentence)
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.
tokenSequence
- document
- sentence
-
public void train()
public void train(String filename)
public void train(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
public void train(String baseTrainDir, String trainFiles, DocumentReaderAndWriter<IN> readerAndWriter)
public void train(String[] trainFileList, DocumentReaderAndWriter<IN> readerAndWriter)
public void train(Collection<List<IN>> docs)
docs
- An Objectbank or a collection of sequences of INpublic abstract void train(Collection<List<IN>> docs, DocumentReaderAndWriter<IN> readerAndWriter)
docs
- An Objectbank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filespublic ObjectBank<List<IN>> makeObjectBankFromString(String string, DocumentReaderAndWriter<IN> readerAndWriter)
string
- The String which will be the content of the ObjectBank
public ObjectBank<List<IN>> makeObjectBankFromFile(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
public ObjectBank<List<IN>> makeObjectBankFromFiles(String[] trainFileList, DocumentReaderAndWriter<IN> readerAndWriter)
public ObjectBank<List<IN>> makeObjectBankFromFiles(String baseDir, String filePattern, DocumentReaderAndWriter<IN> readerAndWriter)
public ObjectBank<List<IN>> makeObjectBankFromFiles(Collection<File> files, DocumentReaderAndWriter<IN> readerAndWriter)
public ObjectBank<List<IN>> makeObjectBankFromReader(BufferedReader in, DocumentReaderAndWriter<IN> readerAndWriter)
flags.documentReader
, and for some
reader choices, the column mapping given in flags.map
.
in
- Input data addNEWLCWords do we add new lowercase words from this
data to the word shape classifier
public void printProbs(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
filename
- The path to the specified filepublic void printProbsDocuments(ObjectBank<List<IN>> documents)
List
of documents and prints the likelihood of each
possible label at each point.
documents
- A List
of List
of something that extends
CoreMap
.public void classifyStdin(DocumentReaderAndWriter<IN> readerWriter) throws IOException
IOException
public abstract void printProbsDocument(List<IN> document)
public void classifyAndWriteAnswers(String testFile, DocumentReaderAndWriter<IN> readerWriter) throws IOException
testFile
- The file to test on.
IOException
public void classifyAndWriteAnswers(String testFile, OutputStream outStream, DocumentReaderAndWriter<IN> readerWriter) throws IOException
outputEncoding
is defined, the output is written in that
character encoding, otherwise in the system default character encoding.
IOException
public void classifyAndWriteAnswers(String baseDir, String filePattern, DocumentReaderAndWriter<IN> readerWriter) throws IOException
IOException
public void classifyAndWriteAnswers(Collection<File> testFiles, DocumentReaderAndWriter<IN> readerWriter) throws IOException
IOException
public void classifyAndWriteAnswers(ObjectBank<List<IN>> documents, PrintWriter printWriter, DocumentReaderAndWriter<IN> readerWriter) throws IOException
IOException
public void classifyAndWriteAnswersKBest(String testFile, int k, DocumentReaderAndWriter<IN> readerAndWriter) throws IOException
testFile
- The filename to test on.
IOException
public void classifyAndWriteAnswersKBest(ObjectBank<List<IN>> documents, int k, PrintWriter printWriter, DocumentReaderAndWriter<IN> readerAndWriter) throws IOException
documents
- The ObjectBank to test on.
IOException
public void classifyAndWriteViterbiSearchGraph(String testFile, String searchGraphPrefix, DocumentReaderAndWriter<IN> readerAndWriter) throws IOException
testFile
- The file to test on.
IOException
public void writeAnswers(List<IN> doc, PrintWriter printWriter, DocumentReaderAndWriter<IN> readerAndWriter) throws IOException
doc
- Documents to write outprintWriter
- Writer to use for output
IOException
- If an IO problempublic static boolean countResultsIOB(List<? extends CoreMap> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN)
public static int tallyOneEntity(List<? extends CoreMap> doc, int index, Class<? extends CoreAnnotation<String>> source, Class<? extends CoreAnnotation<String>> target, Counter<String> positive, Counter<String> negative)
public static boolean countResults(List<? extends CoreMap> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN)
public static void printResults(Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN)
public abstract void serializeClassifier(String serializePath)
serializePath
- The path/filename to write the classifier to.public void loadClassifierNoExceptions(InputStream in, Properties props)
in
- The InputStream to read frompublic void loadClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifier(InputStream in, Properties props) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic abstract void loadClassifier(ObjectInputStream in, Properties props) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifier(String loadPath) throws ClassCastException, IOException, ClassNotFoundException
ClassCastException
IOException
ClassNotFoundException
public void loadClassifier(String loadPath, Properties props) throws ClassCastException, IOException, ClassNotFoundException
ClassCastException
IOException
ClassNotFoundException
public void loadClassifierNoExceptions(String loadPath)
public void loadClassifierNoExceptions(String loadPath, Properties props)
public void loadClassifier(File file) throws ClassCastException, IOException, ClassNotFoundException
ClassCastException
IOException
ClassNotFoundException
public void loadClassifier(File file, Properties props) throws ClassCastException, IOException, ClassNotFoundException
file
- Loads a classifier from this file.props
- Properties in this object will be used to overwrite those
specified in the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifierNoExceptions(File file)
public void loadClassifierNoExceptions(File file, Properties props)
public void loadJarClassifier(String modelName, Properties props)
/classifiers/
) is coded in this
class. If the classifier is not stored in the jar file or this is not run
from inside a jar file, then this function will throw a RuntimeException.
modelName
- The name of the model file. Iff it ends in .gz, then it is assumed
to be gzip compressed.props
- A Properties object which can override certain properties in the
serialized file, such as the DocumentReaderAndWriter. You can pass
in null
to override nothing.protected void printFeatures(IN wi, Collection<String> features)
protected void printFeatureLists(IN wi, Collection<List<String>> features)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |