edu.stanford.nlp.ie
Class AbstractSequenceClassifier

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier
All Implemented Interfaces:
Function<String,String>
Direct Known Subclasses:
CRFClassifier

public abstract class AbstractSequenceClassifier
extends Object
implements Function<String,String>

This class provides common functionality for (probabilistic) sequence models. It is a superclass of our CMM and CRF sequence classifiers, and is even used in the (deterministic) NumberSequenceClassifier. See implementing classes for more information.

Author:
Jenny Finkel, Dan Klein, Christopher Manning, Dan Cer

Field Summary
 Index<String> classIndex
           
 FeatureFactory featureFactory
           
 SeqClassifierFlags flags
           
static String JAR_CLASSIFIER_PATH
           
protected  Set<String> knownLCWords
           
protected  CoreLabel pad
           
protected  DocumentReaderAndWriter readerAndWriter
           
 int windowSize
           
 
Constructor Summary
AbstractSequenceClassifier(Properties props)
          Construct a SeqClassifierFlags object based on the passed in properties, and then call the other constructor.
AbstractSequenceClassifier(SeqClassifierFlags flags)
          Initialize the featureFactor and other variables based on the passed in flags.
 
Method Summary
 String apply(String in)
          Maps a String input to an XML-formatted rendition of applying NER to the String.
 String backgroundSymbol()
           
abstract  List<CoreLabel> classify(List<CoreLabel> document)
          Classify a List of CoreLabels.
 List<List<CoreLabel>> classify(String str)
          Classify the tokens in a String.
 void classifyAndWriteAnswers(Collection<File> testFiles)
           
 void classifyAndWriteAnswers(String testFile)
          Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr).
 void classifyAndWriteAnswers(String baseDir, String filePattern)
           
 void classifyAndWriteAnswersKBest(String testFile, int k)
          Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr).
 void classifyAndWriteViterbiSearchGraph(String testFile, String searchGraphPrefix)
          Load a test file, run the classifier on it, and then write a Viterbi search graph for each sequence.
 List<List<CoreLabel>> classifyFile(String filename)
          Classify the contents of a file.
 Counter<List<CoreLabel>> classifyKBest(List<CoreLabel> doc, Class<? extends CoreAnnotation<String>> answerField, int k)
           
 List<CoreLabel> classifySentence(List<? extends HasWord> sentence)
          Classify a Sentence.
 List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
          Classify the contents of a String.
 String classifyToString(String sentences)
          Classify the contents of a String to a tagged word/class String.
 String classifyToString(String sentences, String outputFormat, boolean preserveSpacing)
          Classify the contents of a String.
 List<CoreLabel> classifyWithCasing(List<CoreLabel> sentence)
          Classify a List of CoreLabels using a TrueCasingDocumentReader.
 String classifyWithInlineXML(String sentences)
          Classify the contents of a String.
 Sampler<List<CoreLabel>> getSampler(List<? extends CoreLabel> input)
           
 SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
           
 DFSA<String,Integer> getViterbiSearchGraph(List<CoreLabel> doc, Class<? extends CoreAnnotation<String>> answerField)
           
 Set<String> labels()
           
 void loadClassifier(File file)
           
 void loadClassifier(File file, Properties props)
          Loads a classifier from the file specified.
 void loadClassifier(InputStream in)
          Load a classsifier from the specified InputStream.
 void loadClassifier(InputStream in, Properties props)
          Load a classsifier from the specified InputStream.
abstract  void loadClassifier(ObjectInputStream in, Properties props)
          Load a classsifier from the specified input stream.
 void loadClassifier(String loadPath)
          Loads a classifier from the file specified by loadPath.
 void loadClassifierNoExceptions(File file)
           
 void loadClassifierNoExceptions(File file, Properties props)
           
 void loadClassifierNoExceptions(InputStream in)
          Loads a classifier from the given input stream.
 void loadClassifierNoExceptions(String loadPath)
           
 void loadClassifierNoExceptions(String loadPath, Properties props)
           
 void loadJarClassifier(String modelName, Properties props)
          This function will load a classifier that is stored inside a jar file (if it is so stored).
 ObjectBank<List<CoreLabel>> makeObjectBankFromFile(String filename)
           
 ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(Collection<File> files)
           
 ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String[] trainFileList)
           
 ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String baseDir, String filePattern)
           
protected  ObjectBank<List<CoreLabel>> makeObjectBankFromReader(BufferedReader in)
          Set up an ObjectBank that will allow one to iterate over a collection of documents obtained from the passed in Reader.
 ObjectBank<List<CoreLabel>> makeObjectBankFromString(String string)
          Reads a String into an ObjectBank object.
 void printProbs(String filename)
          Takes the file, reads it in, and prints out the likelihood of each possible label at each point.
abstract  void printProbsDocument(List<CoreLabel> document)
           
 void printProbsDocuments(ObjectBank<List<CoreLabel>> documents)
          Takes a List of documents and prints the likelihood of each possible label at each point.
protected  void reinit()
          This method should be called after there have been changes to the flags (SeqClassifierFlags) variable, such as after deserializing a classifier.
 List<String> segmentString(String sentence)
          ONLY USE IF LOADED A CHINESE WORD SEGMENTER!!!!!
abstract  void serializeClassifier(String serializePath)
          Serialize a sequence classifier to a file on the given path.
 void train()
          Train the classifier based on values in flags.
abstract  void train(ObjectBank<List<CoreLabel>> docs)
           
 void train(String filename)
           
 void train(String[] trainFileList)
           
 void train(String baseTrainDir, String trainFiles)
           
 void writeAnswers(List<CoreLabel> doc)
          Write the classifications of the Sequence classifier out to stdout in a format determined by the DocumentReaderAndWriter used.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

JAR_CLASSIFIER_PATH

public static final String JAR_CLASSIFIER_PATH
See Also:
Constant Field Values

flags

public SeqClassifierFlags flags

classIndex

public Index<String> classIndex

readerAndWriter

protected DocumentReaderAndWriter readerAndWriter

featureFactory

public FeatureFactory featureFactory

pad

protected CoreLabel pad

windowSize

public int windowSize

knownLCWords

protected Set<String> knownLCWords
Constructor Detail

AbstractSequenceClassifier

public AbstractSequenceClassifier(Properties props)
Construct a SeqClassifierFlags object based on the passed in properties, and then call the other constructor.

Parameters:
props - See SeqClassifierFlags for known properties.

AbstractSequenceClassifier

public AbstractSequenceClassifier(SeqClassifierFlags flags)
Initialize the featureFactor and other variables based on the passed in flags.

Parameters:
flags - A specification of the AbstractSequenceClassifier to construct.
Method Detail

reinit

protected final void reinit()
This method should be called after there have been changes to the flags (SeqClassifierFlags) variable, such as after deserializing a classifier. It is called inside the loadClassifier methods. It assumes that the flags variable and the pad variable exist, but reinitializes things like the pad variable, featureFactory and readerAndWriter based on the flags.

Implementation note: At the moment this variable doesn't set windowSize or featureFactory, since they are being serialized separately in the file, but we should probably stop serializing them and just reinitialize them from the flags?


backgroundSymbol

public String backgroundSymbol()

labels

public Set<String> labels()

classifySentence

public List<CoreLabel> classifySentence(List<? extends HasWord> sentence)
Classify a Sentence.

Parameters:
sentence - The Sentence to be classified.
Returns:
The classified Sentence, where the classifier output for each token is stored in its "answer" field.

getSequenceModel

public SequenceModel getSequenceModel(List<? extends CoreLabel> doc)

getSampler

public Sampler<List<CoreLabel>> getSampler(List<? extends CoreLabel> input)

classifyKBest

public Counter<List<CoreLabel>> classifyKBest(List<CoreLabel> doc,
                                              Class<? extends CoreAnnotation<String>> answerField,
                                              int k)

getViterbiSearchGraph

public DFSA<String,Integer> getViterbiSearchGraph(List<CoreLabel> doc,
                                                  Class<? extends CoreAnnotation<String>> answerField)

classifyWithCasing

public List<CoreLabel> classifyWithCasing(List<CoreLabel> sentence)
Classify a List of CoreLabels using a TrueCasingDocumentReader. Note: This was fairly quickly added to build a Truecaser. It may be revised or disappear.

Parameters:
sentence - a list of CoreLabels to be classifierd
Returns:
The classified list}.

classify

public List<List<CoreLabel>> classify(String str)
Classify the tokens in a String. Each sentence becomes a separate document.

Parameters:
str - A String with tokens in one or more sentences of text to be classified.
Returns:
List of classified sentences (each a List of CoreLabels).

classifyFile

public List<List<CoreLabel>> classifyFile(String filename)
Classify the contents of a file.

Parameters:
filename - Contains the sentence(s) to be classified.
Returns:
List of classified Sentences.

apply

public String apply(String in)
Maps a String input to an XML-formatted rendition of applying NER to the String. Implements the Function interface. Calls classifyWithInlineXML(String) [q.v.].

Specified by:
apply in interface Function<String,String>
Parameters:
in - The function's argument
Returns:
The function's evaluated value

classifyToString

public String classifyToString(String sentences,
                               String outputFormat,
                               boolean preserveSpacing)
Classify the contents of a String. Plain text or XML input is expected and the PlainTextDocumentReaderAndWriter is used. The classifier will tokenize the text and treat each sentence as a separate document. The output can be specified to be in a choice of three formats: slashTags (e.g., Bill/PERSON Smith/PERSON died/O ./O), inlineXML (e.g., <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .), or xml, for stand-off XML (e.g., <wi num="0" entity="PERSON">Sue</wi> <wi num="1" entity="O">shouted</wi> ). There is also a binary choice as to whether the spacing between tokens of the original is preserved or whether the (tagged) tokens are printed with a single space (for inlineXML or slashTags) or a single newline (for xml) between each one.

Fine points: The slashTags and xml formats show tokens as transformed by any normalization processes inside the tokenizer, while inlineXML shows the tokens exactly as they appeared in the source text. When a period counts as both part of an abbreviation and as an end of sentence marker, it is included twice in the output String for slashTags or xml, but only once for inlineXML, where it is not counted as part of the abbreviation (or any named entity it is part of). For slashTags with preserveSpacing=true, there will be two successive periods such as "Jr.." The tokenized (preserveSpacing=false) output will have a space or a newline after the last token.

Parameters:
sentences - The String to be classified. It will be tokenized and divided into documents according to (heuristically determined) sentence boundaries.
outputFormat - The format to put the output in: one of "slashTags", "xml", or "inlineXML"
preserveSpacing - Whether to preserve the input spacing between tokens, which may sometimes be none (true) or whether to tokenize the text and print it with one space between each token (false)
Returns:
A String with annotated with classification information.

classifyWithInlineXML

public String classifyWithInlineXML(String sentences)
Classify the contents of a String. Plain text or XML is expected and the PlainTextDocumentReaderAndWriter is used. The classifier will treat each sentence as a separate document. The output can be specified to be in a choice of formats: Output is in inline XML format (e.g. <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .)

Parameters:
sentences - The string to be classified
Returns:
A String with annotated with classification information.

classifyToString

public String classifyToString(String sentences)
Classify the contents of a String to a tagged word/class String. Plain text or XML input is expected and the PlainTextDocumentReaderAndWriter is used. Output looks like: My/O name/O is/O Bill/PERSON Smith/PERSON ./O

Parameters:
sentences - The String to be classified
Returns:
A String annotated with classification information.

classifyToCharacterOffsets

public List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)
Classify the contents of a String. Plain text or XML input text is expected and the PlainTextDocumentReaderAndWriter is used. Output is a (possibly empty, but not null List of Triples. Each Triple is an entity name, followed by beginning and ending character offsets in the original String. Character offsets can be thought of as fenceposts between the characters, or, like certain methods in the Java String class, as character positions, numbered starting from 0, with the end index pointing to the position AFTER the entity ends. That is, end - start is the length of the entity in characters.

Fine points: Token offsets are true wrt the source text, even though the tokenizer may internally normalize certain tokens to String representations of different lengths (e.g., " becoming `` or ''). When a period counts as both part of an abbreviation and as an end of sentence marker, and that abbreviation is part of a named entity, the reported entity string excludes the period.

Parameters:
sentences - The string to be classified
Returns:
A List of Triples, each of which gives an entity type and the beginning and ending character offsets.

segmentString

public List<String> segmentString(String sentence)
ONLY USE IF LOADED A CHINESE WORD SEGMENTER!!!!!

Parameters:
sentence - The string to be classified
Returns:
List of words

classify

public abstract List<CoreLabel> classify(List<CoreLabel> document)
Classify a List of CoreLabels.

Parameters:
document - A List of CoreLabels.
Returns:
the same List, but with the elements annotated with their answers (with setAnswer()).

train

public void train()
Train the classifier based on values in flags. It will use the first of these variables that is defined: trainFiles (and baseTrainDir), trainFileList, trainFile.


train

public void train(String filename)

train

public void train(String baseTrainDir,
                  String trainFiles)

train

public void train(String[] trainFileList)

train

public abstract void train(ObjectBank<List<CoreLabel>> docs)

makeObjectBankFromString

public ObjectBank<List<CoreLabel>> makeObjectBankFromString(String string)
Reads a String into an ObjectBank object. NOTE: that the current implementation of ReaderIteratorFactory will first try to interpret each string as a filename, so this method will yield unwanted results if it applies to a string that is at the same time a filename. It prints out a warning, at least.

Parameters:
string - The String which will be the content of the ObjectBank (ASSUMING THAT NO FILE OF THIS NAME EXISTS!)
Returns:
The ObjectBank

makeObjectBankFromFile

public ObjectBank<List<CoreLabel>> makeObjectBankFromFile(String filename)

makeObjectBankFromFiles

public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String[] trainFileList)

makeObjectBankFromFiles

public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(String baseDir,
                                                           String filePattern)

makeObjectBankFromFiles

public ObjectBank<List<CoreLabel>> makeObjectBankFromFiles(Collection<File> files)

makeObjectBankFromReader

protected ObjectBank<List<CoreLabel>> makeObjectBankFromReader(BufferedReader in)
Set up an ObjectBank that will allow one to iterate over a collection of documents obtained from the passed in Reader. Each document will be represented as a list of CoreLabel. If the ObjectBank iterator() is called until hasNext() returns false, then the Reader will be read till end of file, but no reading is done at the time of this call. Reading is done using the reading method specified in flags.documentReader, and for some reader choices, the column mapping given in flags.map.

Parameters:
in - Input data addNEWLCWords do we add new lowercase words from this data to the word shape classifier
Returns:
The list of documents

printProbs

public void printProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point.

Parameters:
filename - The path to the specified file

printProbsDocuments

public void printProbsDocuments(ObjectBank<List<CoreLabel>> documents)
Takes a List of documents and prints the likelihood of each possible label at each point.

Parameters:
documents - A List of List of CoreLabels.

printProbsDocument

public abstract void printProbsDocument(List<CoreLabel> document)

classifyAndWriteAnswers

public void classifyAndWriteAnswers(String testFile)
                             throws Exception
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). This uses the value of flags.documentReader to determine testFile format.

Parameters:
testFile - The file to test on.
Throws:
Exception

classifyAndWriteAnswers

public void classifyAndWriteAnswers(String baseDir,
                                    String filePattern)
                             throws Exception
Throws:
Exception

classifyAndWriteAnswers

public void classifyAndWriteAnswers(Collection<File> testFiles)
                             throws Exception
Throws:
Exception

classifyAndWriteAnswersKBest

public void classifyAndWriteAnswersKBest(String testFile,
                                         int k)
                                  throws Exception
Load a test file, run the classifier on it, and then print the answers to stdout (with timing to stderr). This uses the value of flags.documentReader to determine testFile format.

Parameters:
testFile - The file to test on.
Throws:
Exception

classifyAndWriteViterbiSearchGraph

public void classifyAndWriteViterbiSearchGraph(String testFile,
                                               String searchGraphPrefix)
                                        throws Exception
Load a test file, run the classifier on it, and then write a Viterbi search graph for each sequence.

Parameters:
testFile - The file to test on.
Throws:
Exception

writeAnswers

public void writeAnswers(List<CoreLabel> doc)
                  throws Exception
Write the classifications of the Sequence classifier out to stdout in a format determined by the DocumentReaderAndWriter used. If the flag outputEncoding is defined, the output is written in that character encoding, otherwise in the system default character encoding.

Parameters:
doc - Documents to write out
Throws:
Exception - If an IO problem

serializeClassifier

public abstract void serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path.

Parameters:
serializePath - The path/filename to write the classifier to.

loadClassifierNoExceptions

public void loadClassifierNoExceptions(InputStream in)
Loads a classifier from the given input stream. The JVM shuts down (System.exit(1)) if there is an exception. This does not close the InputStream.

Parameters:
in - The InputStream to read from

loadClassifier

public void loadClassifier(InputStream in)
                    throws IOException,
                           ClassCastException,
                           ClassNotFoundException
Load a classsifier from the specified InputStream. No extra properties are supplied. This does not close the InputStream.

Parameters:
in - The InputStream to load the serialized classifier from
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

loadClassifier

public void loadClassifier(InputStream in,
                           Properties props)
                    throws IOException,
                           ClassCastException,
                           ClassNotFoundException
Load a classsifier from the specified InputStream. The classifier is reinitialized from the flags serialized in the classifier. This does not close the InputStream.

Parameters:
in - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

loadClassifier

public abstract void loadClassifier(ObjectInputStream in,
                                    Properties props)
                             throws IOException,
                                    ClassCastException,
                                    ClassNotFoundException
Load a classsifier from the specified input stream. The classifier is reinitialized from the flags serialized in the classifier.

Parameters:
in - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

loadClassifier

public void loadClassifier(String loadPath)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Loads a classifier from the file specified by loadPath. If loadPath ends in .gz, uses a GZIPInputStream, else uses a regular FileInputStream.

Throws:
ClassCastException
IOException
ClassNotFoundException

loadClassifierNoExceptions

public void loadClassifierNoExceptions(String loadPath)

loadClassifierNoExceptions

public void loadClassifierNoExceptions(String loadPath,
                                       Properties props)

loadClassifier

public void loadClassifier(File file)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Throws:
ClassCastException
IOException
ClassNotFoundException

loadClassifier

public void loadClassifier(File file,
                           Properties props)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Loads a classifier from the file specified. If the file's name ends in .gz, uses a GZIPInputStream, else uses a regular FileInputStream. This method closes the File when done.

Parameters:
file - Loads a classifier from this file.
props - Properties in this object will be used to overwrite those specified in the serialized classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

loadClassifierNoExceptions

public void loadClassifierNoExceptions(File file)

loadClassifierNoExceptions

public void loadClassifierNoExceptions(File file,
                                       Properties props)

loadJarClassifier

public void loadJarClassifier(String modelName,
                              Properties props)
This function will load a classifier that is stored inside a jar file (if it is so stored). The classifier should be specified as its full filename, but the path in the jar file (/classifiers/) is coded in this class. If the classifier is not stored in the jar file or this is not run from inside a jar file, then this function will throw a RuntimeException.

Parameters:
modelName - The name of the model file. Iff it ends in .gz, then it is assumed to be gzip compressed.
props - A Properties object which can override certain properties in the serialized file, such as the DocumentReaderAndWriter. You can pass in null to override nothing.


Stanford NLP Group