|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier
public abstract class AbstractSequenceClassifier
This class provides common functionality for (probabilistic) sequence models. It is a superclass of our CMM and CRF sequence classifiers, and is even used in the (deterministic) NumberSequenceClassifier. See implementing classes for more information.
Field Summary | |
---|---|
Index<String> |
classIndex
|
FeatureFactory |
featureFactory
|
SeqClassifierFlags |
flags
|
static String |
JAR_CLASSIFIER_PATH
|
protected Set<String> |
knownLCWords
|
protected CoreLabel |
pad
|
protected DocumentReaderAndWriter |
readerAndWriter
|
int |
windowSize
|
Constructor Summary | |
---|---|
AbstractSequenceClassifier(Properties props)
Construct a SeqClassifierFlags object based on the passed in properties, and then call the other constructor. |
|
AbstractSequenceClassifier(SeqClassifierFlags flags)
Initialize the featureFactor and other variables based on the passed in flags. |
Method Summary | |
---|---|
String |
apply(String in)
Maps a String input to an XML-formatted rendition of applying NER to the String. |
String |
backgroundSymbol()
|
abstract List<CoreLabel> |
classify(List<CoreLabel> document)
Classify a List of CoreLabel s. |
List<List<CoreLabel>> |
classify(String str)
Classify the tokens in a String. |
void |
classifyAndWriteAnswers(Collection<File> testFiles)
|
void |
classifyAndWriteAnswers(String testFile)
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). |
void |
classifyAndWriteAnswers(String baseDir,
String filePattern)
|
void |
classifyAndWriteAnswersKBest(String testFile,
int k)
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). |
void |
classifyAndWriteViterbiSearchGraph(String testFile,
String searchGraphPrefix)
Load a test file, run the classifier on it, and then write a Viterbi search graph for each sequence. |
List<List<CoreLabel>> |
classifyFile(String filename)
Classify the contents of a file. |
Counter<List<CoreLabel>> |
classifyKBest(List<CoreLabel> doc,
Class<? extends CoreAnnotation<String>> answerField,
int k)
|
List<CoreLabel> |
classifySentence(List<? extends HasWord> sentence)
Classify a Sentence . |
List<Triple<String,Integer,Integer>> |
classifyToCharacterOffsets(String sentences)
Classify the contents of a String . |
String |
classifyToString(String sentences)
Classify the contents of a String to a tagged word/class String. |
String |
classifyToString(String sentences,
String outputFormat,
boolean preserveSpacing)
Classify the contents of a String . |
List<CoreLabel> |
classifyWithCasing(List<CoreLabel> sentence)
Classify a List of CoreLabels using a TrueCasingDocumentReader. |
String |
classifyWithInlineXML(String sentences)
Classify the contents of a String . |
Sampler<List<CoreLabel>> |
getSampler(List<? extends CoreLabel> input)
|
SequenceModel |
getSequenceModel(List<? extends CoreLabel> doc)
|
DFSA<String,Integer> |
getViterbiSearchGraph(List<CoreLabel> doc,
Class<? extends CoreAnnotation<String>> answerField)
|
Set<String> |
labels()
|
void |
loadClassifier(File file)
|
void |
loadClassifier(File file,
Properties props)
Loads a classifier from the file specified. |
void |
loadClassifier(InputStream in)
Load a classsifier from the specified InputStream. |
void |
loadClassifier(InputStream in,
Properties props)
Load a classsifier from the specified InputStream. |
abstract void |
loadClassifier(ObjectInputStream in,
Properties props)
Load a classsifier from the specified input stream. |
void |
loadClassifier(String loadPath)
Loads a classifier from the file specified by loadPath. |
void |
loadClassifierNoExceptions(File file)
|
void |
loadClassifierNoExceptions(File file,
Properties props)
|
void |
loadClassifierNoExceptions(InputStream in)
Loads a classifier from the given input stream. |
void |
loadClassifierNoExceptions(String loadPath)
|
void |
loadClassifierNoExceptions(String loadPath,
Properties props)
|
void |
loadJarClassifier(String modelName,
Properties props)
This function will load a classifier that is stored inside a jar file (if it is so stored). |
ObjectBank<List<CoreLabel>> |
makeObjectBankFromFile(String filename)
|
ObjectBank<List<CoreLabel>> |
makeObjectBankFromFiles(Collection<File> files)
|
ObjectBank<List<CoreLabel>> |
makeObjectBankFromFiles(String[] trainFileList)
|
ObjectBank<List<CoreLabel>> |
makeObjectBankFromFiles(String baseDir,
String filePattern)
|
protected ObjectBank<List<CoreLabel>> |
makeObjectBankFromReader(BufferedReader in)
Set up an ObjectBank that will allow one to iterate over a collection of documents obtained from the passed in Reader. |
ObjectBank<List<CoreLabel>> |
makeObjectBankFromString(String string)
Reads a String into an ObjectBank object. |
void |
printProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. |
abstract void |
printProbsDocument(List<CoreLabel> document)
|
void |
printProbsDocuments(ObjectBank<List<CoreLabel>> documents)
Takes a List of documents and prints the likelihood
of each possible label at each point. |
protected void |
reinit()
This method should be called after there have been changes to the flags (SeqClassifierFlags) variable, such as after deserializing a classifier. |
List<String> |
segmentString(String sentence)
ONLY USE IF LOADED A CHINESE WORD SEGMENTER!!!!! |
abstract void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path. |
void |
train()
Train the classifier based on values in flags. |
abstract void |
train(ObjectBank<List<CoreLabel>> docs)
|
void |
train(String filename)
|
void |
train(String[] trainFileList)
|
void |
train(String baseTrainDir,
String trainFiles)
|
void |
writeAnswers(List<CoreLabel> doc)
Write the classifications of the Sequence classifier out to stdout in a format determined by the DocumentReaderAndWriter used. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String JAR_CLASSIFIER_PATH
public SeqClassifierFlags flags
public Index<String> classIndex
protected DocumentReaderAndWriter readerAndWriter
public FeatureFactory featureFactory
protected CoreLabel pad
public int windowSize
protected Set<String> knownLCWords
Constructor Detail |
---|
public AbstractSequenceClassifier(Properties props)
props
- See SeqClassifierFlags for known properties.public AbstractSequenceClassifier(SeqClassifierFlags flags)
flags
- A specification of the AbstractSequenceClassifier to construct.Method Detail |
---|
protected final void reinit()
Implementation note: At the moment this variable doesn't set windowSize or featureFactory, since they are being serialized separately in the file, but we should probably stop serializing them and just reinitialize them from the flags?
public String backgroundSymbol()
public Set<String> labels()
public List<CoreLabel> classifySentence(List<? extends HasWord> sentence)
Sentence
.
sentence
- The Sentence
to be classified.
Sentence
, where the classifier output for
each token is stored in its "answer" field.public SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
public Sampler<List<CoreLabel>> getSampler(List<? extends CoreLabel> input)
public Counter<List<CoreLabel>> classifyKBest(List<CoreLabel> doc, Class<? extends CoreAnnotation<String>> answerField, int k)
public DFSA<String,Integer> getViterbiSearchGraph(List<CoreLabel> doc, Class<? extends CoreAnnotation<String>> answerField)
public List<CoreLabel> classifyWithCasing(List<CoreLabel> sentence)
sentence
- a list of CoreLabels to be classifierd
public List<List<CoreLabel>> classify(String str)
str
- A String with tokens in one or more sentences of text
to be classified.
List
of classified sentences (each a List of
CoreLabel
s).public List<List<CoreLabel>> classifyFile(String filename)
filename
- Contains the sentence(s) to be classified.
List
of classified Sentence
s.public String apply(String in)
apply
in interface Function<String,String>
in
- The function's argument
public String classifyToString(String sentences, String outputFormat, boolean preserveSpacing)
String
. Plain text or XML input is
expected and the PlainTextDocumentReaderAndWriter
is used.
The classifier will tokenize the text and treat each sentence as a
separate document.
The output can be specified to be in a choice of three formats: slashTags
(e.g., Bill/PERSON Smith/PERSON died/O ./O), inlineXML
(e.g., <PERSON>Bill Smith</PERSON>
went to <LOCATION>Paris</LOCATION> .), or xml, for stand-off
XML (e.g., <wi num="0" entity="PERSON">Sue</wi>
<wi num="1" entity="O">shouted</wi> ).
There is also a binary choice as to whether the spacing between tokens
of the original is preserved or whether the (tagged) tokens are printed
with a single space (for inlineXML or slashTags) or a single newline
(for xml) between each one.
Fine points: The slashTags and xml formats show tokens as transformed by any normalization processes inside the tokenizer, while inlineXML shows the tokens exactly as they appeared in the source text. When a period counts as both part of an abbreviation and as an end of sentence marker, it is included twice in the output String for slashTags or xml, but only once for inlineXML, where it is not counted as part of the abbreviation (or any named entity it is part of). For slashTags with preserveSpacing=true, there will be two successive periods such as "Jr.." The tokenized (preserveSpacing=false) output will have a space or a newline after the last token.
sentences
- The String to be classified. It will be tokenized and
divided into documents according to (heuristically determined)
sentence boundaries.outputFormat
- The format to put the output in: one of "slashTags",
"xml", or "inlineXML"preserveSpacing
- Whether to preserve the input spacing between
tokens, which may sometimes be none (true) or whether to tokenize
the text and print it with one space between each token (false)
String
with annotated with classification
information.public String classifyWithInlineXML(String sentences)
String
. Plain text or XML is
expected and the PlainTextDocumentReaderAndWriter
is used.
The classifier will treat each sentence as a separate document.
The output can be specified to be in a choice of formats:
Output
is in inline XML format (e.g. <PERSON>Bill Smith</PERSON>
went to <LOCATION>Paris</LOCATION> .)
sentences
- The string to be classified
String
with annotated with classification
information.public String classifyToString(String sentences)
PlainTextDocumentReaderAndWriter
is used. Output
looks like: My/O name/O is/O Bill/PERSON Smith/PERSON ./O
sentences
- The String to be classified
public List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
String
. Plain text or XML input text
is expected and the PlainTextDocumentReaderAndWriter
is used.
Output is a (possibly empty, but not null
List of Triples.
Each Triple is an entity name, followed by beginning and ending
character offsets in the original String.
Character offsets can be thought of as fenceposts between the characters,
or, like certain methods in the Java String class, as character positions,
numbered starting from 0, with the end index pointing to the position
AFTER the entity ends. That is, end - start is the length of the entity
in characters.
Fine points: Token offsets are true wrt the source text, even though the tokenizer may internally normalize certain tokens to String representations of different lengths (e.g., " becoming `` or ''). When a period counts as both part of an abbreviation and as an end of sentence marker, and that abbreviation is part of a named entity, the reported entity string excludes the period.
sentences
- The string to be classified
List
of Triple
s, each of which gives an entity
type and the beginning and ending character offsets.public List<String> segmentString(String sentence)
sentence
- The string to be classified
public abstract List<CoreLabel> classify(List<CoreLabel> document)
List
of CoreLabel
s.
document
- A List
of CoreLabel
s.
List
, but with the elements annotated
with their answers (with setAnswer()
).public void train()
public void train(String filename)
public void train(String baseTrainDir, String trainFiles)
public void train(String[] trainFileList)
public abstract void train(ObjectBank<List<CoreLabel>> docs)
public ObjectBank<List<CoreLabel>> makeObjectBankFromString(String string)
string
- The String which will be the content of the ObjectBank
(ASSUMING THAT NO FILE OF THIS NAME EXISTS!)
public ObjectBank<List<CoreLabel>> makeObjectBankFromFile(String filename)
public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String[] trainFileList)
public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String baseDir, String filePattern)
public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(Collection<File> files)
protected ObjectBank<List<CoreLabel>> makeObjectBankFromReader(BufferedReader in)
flags.documentReader
,
and for some reader choices, the column mapping given in
flags.map
.
in
- Input data
addNEWLCWords do we add new lowercase words from this data to the word shape classifier
public void printProbs(String filename)
filename
- The path to the specified filepublic void printProbsDocuments(ObjectBank<List<CoreLabel>> documents)
List
of documents and prints the likelihood
of each possible label at each point.
documents
- A List
of List
of CoreLabel
s.public abstract void printProbsDocument(List<CoreLabel> document)
public void classifyAndWriteAnswers(String testFile) throws Exception
testFile
- The file to test on.
Exception
public void classifyAndWriteAnswers(String baseDir, String filePattern) throws Exception
Exception
public void classifyAndWriteAnswers(Collection<File> testFiles) throws Exception
Exception
public void classifyAndWriteAnswersKBest(String testFile, int k) throws Exception
testFile
- The file to test on.
Exception
public void classifyAndWriteViterbiSearchGraph(String testFile, String searchGraphPrefix) throws Exception
testFile
- The file to test on.
Exception
public void writeAnswers(List<CoreLabel> doc) throws Exception
outputEncoding
is defined, the output
is written in that character encoding, otherwise in the system default
character encoding.
doc
- Documents to write out
Exception
- If an IO problempublic abstract void serializeClassifier(String serializePath)
serializePath
- The path/filename to write the classifier to.public void loadClassifierNoExceptions(InputStream in)
in
- The InputStream to read frompublic void loadClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifier(InputStream in, Properties props) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the SeqClassifierFlags which
are read from the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic abstract void loadClassifier(ObjectInputStream in, Properties props) throws IOException, ClassCastException, ClassNotFoundException
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the SeqClassifierFlags which
are read from the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifier(String loadPath) throws ClassCastException, IOException, ClassNotFoundException
ClassCastException
IOException
ClassNotFoundException
public void loadClassifierNoExceptions(String loadPath)
public void loadClassifierNoExceptions(String loadPath, Properties props)
public void loadClassifier(File file) throws ClassCastException, IOException, ClassNotFoundException
ClassCastException
IOException
ClassNotFoundException
public void loadClassifier(File file, Properties props) throws ClassCastException, IOException, ClassNotFoundException
file
- Loads a classifier from this file.props
- Properties in this object will be used to overwrite those
specified in the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadClassifierNoExceptions(File file)
public void loadClassifierNoExceptions(File file, Properties props)
public void loadJarClassifier(String modelName, Properties props)
/classifiers/
) is
coded in this class. If the classifier is not stored in the jar file
or this is not run from inside a jar file, then this function will
throw a RuntimeException.
modelName
- The name of the model file. Iff it ends in .gz, then
it is assumed to be gzip compressed.props
- A Properties object which can override certain properties
in the serialized file, such as the DocumentReaderAndWriter.
You can pass in null
to override nothing.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |