edu.stanford.nlp.ie.crf
Class CRFClassifier

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier
      extended by edu.stanford.nlp.ie.crf.CRFClassifier
All Implemented Interfaces:
Function<String,String>

public class CRFClassifier
extends AbstractSequenceClassifier

Class for Sequence Classification using a Conditional Random Field model. The code has functionality for different document formats, but when using the standard ColumnDocumentReaderAndWriter for training or testing models, input files are expected to be one token per line with the columns indicating things like the word, POS, chunk, and answer class. The default for ColumnDocumentReaderAndWriter training data is 3 column input, with the columns containing a word, its POS, and its gold class, but this can be specified via the map property.

When run on a file with -textFile, the file is assumed to be plain English text (or perhaps simple HTML/XML), and a reasonable attempt is made at English tokenization by PlainTextDocumentReaderAndWriter.

Typical command-line usage

For running a trained model with a provided serialized classifier on a text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

Features are defined by a FeatureFactory. NERFeatureFactory is used by default, and you should look there for feature templates and properties or flags that will cause certain features to be used when training an NER classifier. There is also a edu.stanford.nlp.wordseg.SighanFeatureFactory, and various successors such as edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory, which are used for Chinese word segmentation. Features are specified either by a Properties file (which is the recommended method) or by flags on the command line. The flags are read into a SeqClassifierFlags object, which the user need not be concerned with, unless wishing to add new features.

CRFClassifier may also be used programatically. When creating a new instance, you must specify a Properties object. You may then call train methods to train a classifier, or load a classifier. The other way to get a CRFClassifier is to deserialize one via the static getClassifier(String) methods, which return a deserialized classifier. You may then tag (classify the items of) documents using either the assorted classify() or the assorted classify methods in AbstractSequenceClassifier. Probabilities assigned by the CRF can be interrogated using either the printProbsDocument() or getCliqueTrees() methods.

Author:
Jenny Finkel

Nested Class Summary
static class CRFClassifier.TestSequenceModel
           
 
Field Summary
static String DEFAULT_CLASSIFIER
          Name of default serialized classifier resource to look for in a jar file.
 
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
classIndex, featureFactory, flags, JAR_CLASSIFIER_PATH, knownLCWords, pad, readerAndWriter, windowSize
 
Constructor Summary
protected CRFClassifier()
           
  CRFClassifier(Properties props)
           
 
Method Summary
protected  void addProcessedData(List<List<CRFDatum>> processedData, int[][][][] data, int[][] labels, int offset)
          Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document.
protected static Index<CRFLabel> allLabels(int window, Index classIndex)
           
 List<CoreLabel> classify(List<CoreLabel> document)
          Classify a List of CoreLabels.
 List<CoreLabel> classifyGibbs(List<CoreLabel> document)
           
 List<CoreLabel> classifyMaxEnt(List<CoreLabel> document)
          Do standard sequence inference, using either Viterbi or Beam inference depending on the value of flags.inferenceType.
 Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
          Convert an ObjectBank to arrays of data features and labels.
 Pair<int[][][],int[]> documentToDataAndLabels(List<? extends CoreLabel> document)
          Convert a document List into arrays storing the data features and labels.
 void dropFeaturesBelowThreshold(double threshold)
           
protected  List<CRFDatum> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List<CoreLabel> labeledWordInfos)
          Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels.
static CRFClassifier getClassifier(File file)
          Loads a CRF classifier from a filepath, and returns it.
static CRFClassifier getClassifier(InputStream in)
          Loads a CRF classifier from an InputStream, and returns it.
static CRFClassifier getClassifier(String loadPath)
           
static CRFClassifier getClassifierNoExceptions(String loadPath)
           
 List<CRFCliqueTree> getCliqueTrees(String filename)
          Want to make arbitrary probability queries? Then this is the method for you.
static CRFClassifier getDefaultClassifier()
          Used to get the default supplied classifier inside the jar file.
static CRFClassifier getJarClassifier(String resourceName, Properties props)
          Used to load a classifier stored as a resource inside a jar file.
protected  Minimizer getMinimizer()
           
protected  Minimizer getMinimizer(int featurePruneIteration)
           
 SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
           
 void loadClassifier(ObjectInputStream ois, Properties props)
          Loads a classifier from the specified InputStream.
 void loadDefaultClassifier()
          This is used to load the default supplied classifier stored within the jar file.
protected static List loadProcessedData(String filename)
           
 void loadTextClassifier(String text, Properties props)
           
static void main(String[] args)
          The main method.
 CRFDatum<Serializable,CRFLabel> makeDatum(List<? extends CoreLabel> info, int loc, FeatureFactory featureFactory)
          Makes a CRFDatum by producing features and a label from input data at a specific position, using the provided factory.
 void printFirstOrderProbs(String filename)
          Takes the file, reads it in, and prints out the likelihood of each possible label at each point.
 void printFirstOrderProbsDocument(List<CoreLabel> document)
          Takes a List of CoreLabels and prints the likelihood of each possible label at each point.
 void printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
          Takes a List of documents and prints the likelihood of each possible label at each point.
 void printLabelInformation(String testFile)
           
 void printLabelValue(List<CoreLabel> document)
           
 void printProbsDocument(List<CoreLabel> document)
          Takes a List of CoreLabels and prints the likelihood of each possible label at each point.
protected static void saveProcessedData(List datums, String filename)
           
 void serializeClassifier(String serializePath)
          Serialize a sequence classifier to a file on the given path.
 void serializeTextClassifier(String serializePath)
          Serialize the model to a human readable format.
 void train(ObjectBank<List<CoreLabel>> docs)
          Train a classifier from documents.
 
Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyKBest, classifySentence, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithCasing, classifyWithInlineXML, getSampler, getViterbiSearchGraph, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, printProbs, printProbsDocuments, reinit, segmentString, train, train, train, train, writeAnswers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_CLASSIFIER

public static final String DEFAULT_CLASSIFIER
Name of default serialized classifier resource to look for in a jar file.

See Also:
Constant Field Values
Constructor Detail

CRFClassifier

protected CRFClassifier()

CRFClassifier

public CRFClassifier(Properties props)
Method Detail

dropFeaturesBelowThreshold

public void dropFeaturesBelowThreshold(double threshold)

documentToDataAndLabels

public Pair<int[][][],int[]> documentToDataAndLabels(List<? extends CoreLabel> document)
Convert a document List into arrays storing the data features and labels.

Parameters:
document - Training documents
Returns:
A Pair, where the first element is an int[][][] representing the data and the second element is an int[] representing the labels

printLabelInformation

public void printLabelInformation(String testFile)
                           throws Exception
Throws:
Exception

printLabelValue

public void printLabelValue(List<CoreLabel> document)

documentsToDataAndLabels

public Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
Convert an ObjectBank to arrays of data features and labels.

Parameters:
documents -
Returns:
A Pair, where the first element is an int[][][][] representing the data and the second element is an int[][] representing the labels.

allLabels

protected static Index<CRFLabel> allLabels(int window,
                                           Index classIndex)

makeDatum

public CRFDatum<Serializable,CRFLabel> makeDatum(List<? extends CoreLabel> info,
                                                 int loc,
                                                 FeatureFactory featureFactory)
Makes a CRFDatum by producing features and a label from input data at a specific position, using the provided factory.

Parameters:
info - The input data
loc - The position to build a datum at
featureFactory - The FeatureFactory to use to extract features
Returns:
The constructed CRFDatum

classify

public List<CoreLabel> classify(List<CoreLabel> document)
Description copied from class: AbstractSequenceClassifier
Classify a List of CoreLabels.

Specified by:
classify in class AbstractSequenceClassifier
Parameters:
document - A List of CoreLabels.
Returns:
the same List, but with the elements annotated with their answers (with setAnswer()).

getSequenceModel

public SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
Overrides:
getSequenceModel in class AbstractSequenceClassifier

classifyMaxEnt

public List<CoreLabel> classifyMaxEnt(List<CoreLabel> document)
Do standard sequence inference, using either Viterbi or Beam inference depending on the value of flags.inferenceType.

Parameters:
document - Document to classify. Classification happens in place. This document is modified.
Returns:
The classified document

classifyGibbs

public List<CoreLabel> classifyGibbs(List<CoreLabel> document)
                              throws ClassNotFoundException,
                                     SecurityException,
                                     NoSuchMethodException,
                                     IllegalArgumentException,
                                     InstantiationException,
                                     IllegalAccessException,
                                     InvocationTargetException
Throws:
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException

printProbsDocument

public void printProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabels and prints the likelihood of each possible label at each point.

Specified by:
printProbsDocument in class AbstractSequenceClassifier
Parameters:
document - A List of CoreLabels.

printFirstOrderProbs

public void printFirstOrderProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. This gives a simple way to examine the probability distributions of the CRF. See getCliqueTrees() for more.

Parameters:
filename - The path to the specified file

printFirstOrderProbsDocuments

public void printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
Takes a List of documents and prints the likelihood of each possible label at each point.

Parameters:
documents - A List of List of CoreLabels.

getCliqueTrees

public List<CRFCliqueTree> getCliqueTrees(String filename)
Want to make arbitrary probability queries? Then this is the method for you. Given the filename, it reads it in and breaks it into documents, and then makes a CRFCliqueTree for each document. you can then ask the clique tree for marginals and conditional probabilities of almost anything you want.


printFirstOrderProbsDocument

public void printFirstOrderProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabels and prints the likelihood of each possible label at each point.

Parameters:
document - A List of CoreLabels.

train

public void train(ObjectBank<List<CoreLabel>> docs)
Train a classifier from documents.

Specified by:
train in class AbstractSequenceClassifier
Parameters:
docs - An objectbank representation of documents.

getMinimizer

protected Minimizer getMinimizer()

getMinimizer

protected Minimizer getMinimizer(int featurePruneIteration)

extractDatumSequence

protected List<CRFDatum> extractDatumSequence(int[][][] allData,
                                              int beginPosition,
                                              int endPosition,
                                              List<CoreLabel> labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels.

Parameters:
allData -
beginPosition -
endPosition -
labeledWordInfos -
Returns:
A new CRFDatum

addProcessedData

protected void addProcessedData(List<List<CRFDatum>> processedData,
                                int[][][][] data,
                                int[][] labels,
                                int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document. Adds context labels in addition to the target label for each datum, meaning that for a particular document, the number of labels will be windowSize-1 greater than the number of datums.

Parameters:
processedData - a List of Lists of CRFDatums
data -
labels -
offset -

saveProcessedData

protected static void saveProcessedData(List datums,
                                        String filename)

loadProcessedData

protected static List loadProcessedData(String filename)

loadTextClassifier

public void loadTextClassifier(String text,
                               Properties props)
                        throws ClassCastException,
                               IOException,
                               ClassNotFoundException,
                               InstantiationException,
                               IllegalAccessException
Throws:
ClassCastException
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException

serializeTextClassifier

public void serializeTextClassifier(String serializePath)
Serialize the model to a human readable format. It's not yet complete. It should now work for Chinese segmenter though. TODO: check things in serializeClassifier and add other necessary serialization back

Parameters:
serializePath - File to write text format of classifier to.

serializeClassifier

public void serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path.

Specified by:
serializeClassifier in class AbstractSequenceClassifier
Parameters:
serializePath - The path/filename to write the classifier to.

loadClassifier

public void loadClassifier(ObjectInputStream ois,
                           Properties props)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Loads a classifier from the specified InputStream. This version works quietly (unless VERBOSE is true). If props is non-null then any properties it specifies override those in the serialized file. However, only some properties are sensible to change (you shouldn't change how features are defined).

Note: This method does not close the ObjectInputStream. (But earlier versions of the code used to, so beware....)

Specified by:
loadClassifier in class AbstractSequenceClassifier
Parameters:
ois - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
ClassCastException - If there are problems interpreting the serialized data
IOException - If there are problems accessing the input stream
ClassNotFoundException - If there are problems interpreting the serialized data

loadDefaultClassifier

public void loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar file. THIS FUNCTION WILL ONLY WORK IF THE CODE WAS LOADED FROM A JAR FILE WHICH HAS A SERIALIZED CLASSIFIER STORED INSIDE IT.


getDefaultClassifier

public static CRFClassifier getDefaultClassifier()
Used to get the default supplied classifier inside the jar file. THIS FUNCTION WILL ONLY WORK IF THE CODE WAS LOADED FROM A JAR FILE WHICH HAS A SERIALIZED CLASSIFIER STORED INSIDE IT.

Returns:
The default CRFClassifier in the jar file (if there is one)

getJarClassifier

public static CRFClassifier getJarClassifier(String resourceName,
                                             Properties props)
Used to load a classifier stored as a resource inside a jar file. THIS FUNCTION WILL ONLY WORK IF THE CODE WAS LOADED FROM A JAR FILE WHICH HAS A SERIALIZED CLASSIFIER STORED INSIDE IT.

Parameters:
resourceName - Name of clasifier resource inside the jar file.
Returns:
A CRFClassifier stored in the jar file

getClassifier

public static CRFClassifier getClassifier(File file)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Loads a CRF classifier from a filepath, and returns it.

Parameters:
file - File to load classifier from
Returns:
The CRF classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

getClassifier

public static CRFClassifier getClassifier(InputStream in)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Loads a CRF classifier from an InputStream, and returns it. This method does not buffer the InputStream, so you should have buffered it before calling this method.

Parameters:
in - InputStream to load classifier from
Returns:
The CRF classifier
Throws:
IOException - If there are problems accessing the input stream
ClassCastException - If there are problems interpreting the serialized data
ClassNotFoundException - If there are problems interpreting the serialized data

getClassifierNoExceptions

public static CRFClassifier getClassifierNoExceptions(String loadPath)

getClassifier

public static CRFClassifier getClassifier(String loadPath)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception
The main method. See the class documentation.

Throws:
Exception


Stanford NLP Group