edu.stanford.nlp.ie.ner
Class CMMClassifier<IN extends CoreLabel>

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier<IN>
      extended by edu.stanford.nlp.ie.ner.CMMClassifier<IN>
All Implemented Interfaces:
DocumentProcessor, ListProcessor<Object,WordTag>, Function<String,String>

public class CMMClassifier<IN extends CoreLabel>
extends AbstractSequenceClassifier<IN>
implements DocumentProcessor, ListProcessor<Object,WordTag>

Does Sequence Classification using a Conditional Markov Model. It could be used for other purposes, but the provided features are aimed at doing Named Entity Recognition. The code has functionality for different document encodings, but when using the standard ColumnDocumentReader, input files are expected to be one word per line with the columns indicating things like the word, POS, chunk, and class.

Typical usage

For running a trained model with a provided serialized classifier:

java -server -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -prop propFile

To train and test a model from the command line:

java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -trainFile trainFile -testFile testFile -goodCoNLL > output

Features are defined by a FeatureFactory; the FeatureFactory which is used by default is NERFeatureFactory, and you should look there for feature templates. Features are specified either by a Properties file (which is the recommended method) or on the command line. The features are read into a SeqClassifierFlags object, which the user need not know much about, unless one wishes to add new features.

CMMClassifier may also be used programmatically. When creating a new instance, you must specify a properties file. The other way to get a CMMClassifier is to deserialize one via getClassifier(String), which returns a deserialized classifier. You may then tag sentences using either the assorted test or testSentence methods.

Author:
Dan Klein, Jenny Finkel, Christopher Manning, Shipra Dingare, Huy Nguyen, Sarah Spikes (sdspikes@cs.stanford.edu) - cleanup and filling in types

Field Summary
static String DEFAULT_CLASSIFIER
          Default place to look in Jar file for classifier.
 
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
classIndex, featureFactory, flags, knownLCWords, pad, windowSize
 
Constructor Summary
protected CMMClassifier()
           
  CMMClassifier(Properties props)
           
 
Method Summary
 void adapt(ObjectBank<List<IN>> featureLabels, Dataset<String,String> trainDataset)
           
 void adapt(String filename, Dataset<String,String> trainDataset, DocumentReaderAndWriter<IN> readerWriter)
           
 List<IN> classify(List<IN> document)
          Classify a List of CoreLabels.
 List<IN> classifyWithGlobalInformation(List<IN> tokenSeq, CoreMap doc, CoreMap sent)
          Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence.
protected  String classOf(List<IN> lineInfos, int pos)
          Returns the most likely class for the word at the given position.
 Dataset<String,String> getBiasedDataset(ObjectBank<List<IN>> data, Index<String> featureIndex, Index<String> classIndex)
           
static CMMClassifier getClassifier(File file)
           
static CMMClassifier getClassifier(InputStream in)
           
static CMMClassifier getClassifier(String loadPath)
           
static CMMClassifier getClassifierNoExceptions(File file)
           
static CMMClassifier getClassifierNoExceptions(InputStream in)
           
static CMMClassifier getClassifierNoExceptions(String loadPath)
           
 Dataset<String,String> getDataset(Collection<List<IN>> data)
          Build a Dataset from some data.
 Dataset<String,String> getDataset(Collection<List<IN>> data, Index<String> featureIndex, Index<String> classIndex)
          Build a Dataset from some data.
 Dataset<String,String> getDataset(Dataset<String,String> oldData, Index<String> goodFeatures)
          Build a Dataset from some data.
 Dataset<String,String> getDataset(ObjectBank<List<IN>> data, Dataset<String,String> origDataset)
          Build a Dataset from some data.
static CMMClassifier getDefaultClassifier()
          Used to obtain the default classifier which is stored inside a jar file.
 Index<String> getFeaturesAboveThreshhold(Dataset<String,String> dataset, double thresh)
           
 SequenceModel getSequenceModel(List<IN> document)
           
 Set<String> getTags()
          Returns the Set of entities recognized by this Classifier.
 void loadClassifier(ObjectInputStream ois, Properties props)
          Load a classifier from the given Stream.
 void loadDefaultClassifier()
          Used to load the default supplied classifier.
 double loglikelihood(List<IN> lineInfos)
          Returns the log conditional likelihood of the given dataset.
static void main(String[] args)
          Command-line version of the classifier.
<T extends CoreLabel>
Datum<String,String>
makeDatum(List<IN> info, int loc, FeatureFactory featureFactory)
          Make an individual Datum out of the data list info, focused at position loc.
 void printProbsDocument(List<IN> document)
          Takes a List of CoreLabels and prints the likelihood of each possible label at each point.
 List<WordTag> process(List list)
          Assigns NER labels to the words in the given List.
 Document<?,?,WordTag> processDocument(Document in)
          Assigns NER labels to the words in the given Document.
 void retrain(ObjectBank<List<IN>> doc)
           
 void retrain(ObjectBank<List<IN>> featureLabels, Index<String> featureIndex, Index<String> labelIndex)
           
 Counter<String> scoresOf(List<IN> lineInfos, int pos)
           
 void serializeClassifier(String serializePath)
          Serialize a sequence classifier to a file on the given path.
 void train(Collection<List<IN>> wordInfos, DocumentReaderAndWriter<IN> readerAndWriter)
          Trains a classifier from a Collection of sequences.
 void trainSemiSup()
           
 double weight(String feature, String label)
           
 double[][] weights()
           
 
Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResultsIOB, defaultReaderAndWriter, getSampler, getViterbiSearchGraph, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbsDocuments, printResults, reinit, segmentString, segmentString, tallyOneEntity, train, train, train, train, train, train, writeAnswers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_CLASSIFIER

public static final String DEFAULT_CLASSIFIER
Default place to look in Jar file for classifier.

See Also:
Constant Field Values
Constructor Detail

CMMClassifier

protected CMMClassifier()

CMMClassifier

public CMMClassifier(Properties props)
Method Detail

getTags

public Set<String> getTags()
Returns the Set of entities recognized by this Classifier.

Returns:
The Set of entities recognized by this Classifier.

classify

public List<IN> classify(List<IN> document)
Classify a List of CoreLabels.

Specified by:
classify in class AbstractSequenceClassifier<IN extends CoreLabel>
Parameters:
document - A List of CoreLabels to be classified.
Returns:
The same List, but with the elements annotated with their answers (stored under the CoreAnnotations.AnswerAnnotation key).

classOf

protected String classOf(List<IN> lineInfos,
                         int pos)
Returns the most likely class for the word at the given position.


loglikelihood

public double loglikelihood(List<IN> lineInfos)
Returns the log conditional likelihood of the given dataset.

Returns:
The log conditional likelihood of the given dataset.

getSequenceModel

public SequenceModel getSequenceModel(List<IN> document)
Overrides:
getSequenceModel in class AbstractSequenceClassifier<IN extends CoreLabel>

adapt

public void adapt(String filename,
                  Dataset<String,String> trainDataset,
                  DocumentReaderAndWriter<IN> readerWriter)
Parameters:
filename - adaptation file
trainDataset - original dataset (used in training)

adapt

public void adapt(ObjectBank<List<IN>> featureLabels,
                  Dataset<String,String> trainDataset)
Parameters:
featureLabels - adaptation docs
trainDataset - original dataset (used in training)

retrain

public void retrain(ObjectBank<List<IN>> featureLabels,
                    Index<String> featureIndex,
                    Index<String> labelIndex)
Parameters:
featureLabels - retrain docs
featureIndex - featureIndex of original dataset (used in training)
labelIndex - labelIndex of original dataset (used in training)

retrain

public void retrain(ObjectBank<List<IN>> doc)

train

public void train(Collection<List<IN>> wordInfos,
                  DocumentReaderAndWriter<IN> readerAndWriter)
Description copied from class: AbstractSequenceClassifier
Trains a classifier from a Collection of sequences. Note that the Collection can be (and usually is) an ObjectBank.

Specified by:
train in class AbstractSequenceClassifier<IN extends CoreLabel>
Parameters:
wordInfos - An Objectbank or a collection of sequences of IN
readerAndWriter - A DocumentReaderAndWriter to use when loading test files

getFeaturesAboveThreshhold

public Index<String> getFeaturesAboveThreshhold(Dataset<String,String> dataset,
                                                double thresh)

getDataset

public Dataset<String,String> getDataset(Collection<List<IN>> data)
Build a Dataset from some data. Used for training a classifier.

Parameters:
data - This variable is a list of lists of CoreLabel. That is, it is a collection of documents, each of which is represented as a sequence of CoreLabel objects.
Returns:
The Dataset which is an efficient encoding of the information in a List of Datums

getDataset

public Dataset<String,String> getDataset(Collection<List<IN>> data,
                                         Index<String> featureIndex,
                                         Index<String> classIndex)
Build a Dataset from some data. Used for training a classifier. By passing in extra featureIndex and classIndex, you can get a Dataset based on featureIndex and classIndex

Parameters:
data - This variable is a list of lists of CoreLabel. That is, it is a collection of documents, each of which is represented as a sequence of CoreLabel objects.
classIndex - if you want to get a Dataset based on featureIndex and classIndex in an existing origDataset
Returns:
The Dataset which is an efficient encoding of the information in a List of Datums

getBiasedDataset

public Dataset<String,String> getBiasedDataset(ObjectBank<List<IN>> data,
                                               Index<String> featureIndex,
                                               Index<String> classIndex)

getDataset

public Dataset<String,String> getDataset(ObjectBank<List<IN>> data,
                                         Dataset<String,String> origDataset)
Build a Dataset from some data. Used for training a classifier. By passing in an extra origDataset, you can get a Dataset based on featureIndex and classIndex in an existing origDataset.

Parameters:
data - This variable is a list of lists of CoreLabel. That is, it is a collection of documents, each of which is represented as a sequence of CoreLabel objects.
origDataset - if you want to get a Dataset based on featureIndex and classIndex in an existing origDataset
Returns:
The Dataset which is an efficient encoding of the information in a List of Datums

getDataset

public Dataset<String,String> getDataset(Dataset<String,String> oldData,
                                         Index<String> goodFeatures)
Build a Dataset from some data.

Parameters:
oldData - This Dataset represents data for which we which to some features, specifically those features not in the Index goodFeatures.
goodFeatures - An Index of features we wish to retain.
Returns:
A new Dataset wheres each datapoint contains only features which were in goodFeatures.

serializeClassifier

public void serializeClassifier(String serializePath)
Description copied from class: AbstractSequenceClassifier
Serialize a sequence classifier to a file on the given path.

Specified by:
serializeClassifier in class AbstractSequenceClassifier<IN extends CoreLabel>
Parameters:
serializePath - The path/filename to write the classifier to.

loadDefaultClassifier

public void loadDefaultClassifier()
Used to load the default supplied classifier. **THIS FUNCTION WILL ONLY WORK IF RUN INSIDE A JAR FILE**


getDefaultClassifier

public static CMMClassifier getDefaultClassifier()
Used to obtain the default classifier which is stored inside a jar file. THIS FUNCTION WILL ONLY WORK IF RUN INSIDE A JAR FILE.

Returns:
A Default CMMClassifier from a jar file

loadClassifier

public void loadClassifier(ObjectInputStream ois,
                           Properties props)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Load a classifier from the given Stream. Implementation note: This method does not close the Stream that it reads from.

Specified by:
loadClassifier in class AbstractSequenceClassifier<IN extends CoreLabel>
Parameters:
ois - The ObjectInputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

getClassifierNoExceptions

public static CMMClassifier getClassifierNoExceptions(File file)

getClassifier

public static CMMClassifier getClassifier(File file)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

getClassifierNoExceptions

public static CMMClassifier getClassifierNoExceptions(String loadPath)

getClassifier

public static CMMClassifier getClassifier(String loadPath)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

getClassifierNoExceptions

public static CMMClassifier getClassifierNoExceptions(InputStream in)

getClassifier

public static CMMClassifier getClassifier(InputStream in)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

makeDatum

public <T extends CoreLabel> Datum<String,String> makeDatum(List<IN> info,
                                                            int loc,
                                                            FeatureFactory featureFactory)
Make an individual Datum out of the data list info, focused at position loc.

Parameters:
info - A List of WordInfo objects
loc - The position in the info list to focus feature creation on
featureFactory - The factory that constructs features out of the item
Returns:
A Datum (BasicDatum) representing this data instance

trainSemiSup

public void trainSemiSup()

scoresOf

public Counter<String> scoresOf(List<IN> lineInfos,
                                int pos)

printProbsDocument

public void printProbsDocument(List<IN> document)
Takes a List of CoreLabels and prints the likelihood of each possible label at each point. TODO: Finish or delete this method!

Specified by:
printProbsDocument in class AbstractSequenceClassifier<IN extends CoreLabel>
Parameters:
document - A List of CoreLabels.

main

public static void main(String[] args)
                 throws Exception
Command-line version of the classifier. See the class comments for examples of use, and SeqClassifierFlags for more information on supported flags.

Throws:
Exception

processDocument

public Document<?,?,WordTag> processDocument(Document in)
Assigns NER labels to the words in the given Document. Implements the DocumentProcessor interface. Outputs a new document with the same meta-data as the old one, but whose contents are a List of WordTags, where the tags are the NER labels assigned to the word.

Specified by:
processDocument in interface DocumentProcessor
See Also:
FunctionProcessor

process

public List<WordTag> process(List list)
Assigns NER labels to the words in the given List. Implements the ListProcessor interface. Checks the input for instances of HasWord and HasTag, or uses the toString() method, the HasWord check fails. Outputs a list of WordTags, where the tag is the NER label assigned to the word.

Specified by:
process in interface ListProcessor<Object,WordTag>

weight

public double weight(String feature,
                     String label)

weights

public double[][] weights()

classifyWithGlobalInformation

public List<IN> classifyWithGlobalInformation(List<IN> tokenSeq,
                                              CoreMap doc,
                                              CoreMap sent)
Description copied from class: AbstractSequenceClassifier
Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence. This is needed for SUTime (NumberSequenceClassifier), which requires the document date to resolve relative dates.

Specified by:
classifyWithGlobalInformation in class AbstractSequenceClassifier<IN extends CoreLabel>
Returns:
Classified version of the input tokenSequence


Stanford NLP Group