edu.stanford.nlp.ie.crf
Class CRFClassifier

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier
      extended by edu.stanford.nlp.ie.crf.CRFClassifier
All Implemented Interfaces:
Function, Serializable

public class CRFClassifier
extends AbstractSequenceClassifier

Does Sequence Classification using a Conditional Random Field model. The code has functionality for different document encodings, but when using the standard ColumnDocumentReaderAndWriter for training or testing models, input files are expected to be one word per line with the columns indicating things like the word, POS, chunk, and class. When run on a file with -textFile, the file is assumed to be plain English text (or perhaps HTML/XML), and a reasonable attempt is made at tokenization by PlainTextDocumentReaderAndWriter.

Typical usage

For running a trained model with a provided serialized classifier on a text file:

java -server -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or runtime):

java -server -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile trainFile -testFile testFile -macro > output

Features are defined by a FeatureFactory. NERFeatureFactory is used by default, and you should look there for feature templates and properties or flags that will cause certain features to be used when training an NER classifier. There is also a ChineseFeatureFactory, which is used for Chinese word segmentation. Features are specified either by a Properties file (which is the recommended method) or on the command line. The features are read into a SeqClassifierFlags object, which the user need not concern himself with unless he wishes to add new features.

CRFClassifier may also be used programatically. When creating a new instance, you must specify a properties file. The other way to get a CRFClassifier is to deserialize one via getClassifier(String), which returns a deserialized classifier. You may then tag sentences using either the assorted test or testSentence methods.

Author:
Jenny Finkel
See Also:
Serialized Form

Nested Class Summary
static class CRFClassifier.TestSequenceModel
           
 
Field Summary
static String DEFAULT_CLASSIFIER
           
 
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
classIndex, featureFactory, flags, JAR_CLASSIFIER_PATH, knownLCWords, pad, readerAndWriter, windowSize
 
Constructor Summary
protected CRFClassifier()
           
  CRFClassifier(Properties props)
           
 
Method Summary
protected  void addProcessedData(List processedData, int[][][][] data, int[][] labels, int offset)
          Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document.
protected  Index allLabels(int window, Index classIndex)
           
 Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<FeatureLabel>> documents)
          Convert an ObjectBank to arrays of data features and labels.
 Pair<int[][][],int[]> documentToDataAndLabels(List<FeatureLabel> document)
          Convert a document List into arrays storing the data features and labels.
 void dropFeaturesBelowThreshold(double threshold)
           
protected  List<CRFDatum> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List labeledWordInfos)
          Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels
static CRFClassifier getClassifier(File file)
           
static CRFClassifier getClassifier(InputStream in)
           
static CRFClassifier getClassifier(String loadPath)
           
static CRFClassifier getClassifierNoExceptions(File file)
           
static CRFClassifier getClassifierNoExceptions(InputStream in)
           
static CRFClassifier getClassifierNoExceptions(String loadPath)
           
static CRFClassifier getDefaultClassifier()
          Used to get the default supplied classifier.
static CRFClassifier getJarClassifier(String resourceName, Properties props)
          Used to load a classifier stored as a resource inside a jar file.
 SequenceModel getSequenceModel(List<FeatureLabel> doc)
           
 void loadClassifier(InputStream in, Properties props)
          Loads a classifier from the specified InputStream.
 void loadDefaultClassifier()
          This is used to load the default supplied classifier stored within the jar file.
protected  List loadProcessedData(String filename)
           
static void main(String[] args)
          The main method.
 CRFDatum makeDatum(List<FeatureLabel> info, int loc)
           
 CRFDatum makeDatum(List<FeatureLabel> info, int loc, FeatureFactory featureFactory)
           
 void printFirstOrderProbs(String filename)
          Takes the file, reads it in, and prints out the likelihood of each possible label at each point.
 void printFirstOrderProbsDocument(List<FeatureLabel> document)
          Takes a List of FeatureLabels and prints the likelihood of each possible label at each point.
 void printFirstOrderProbsDocuments(ObjectBank<List<FeatureLabel>> documents)
          Takes a List of documents and prints the likelihood of each possible label at each point.
 void printProbsDocument(List<FeatureLabel> document)
          Takes a List of FeatureLabels and prints the likelihood of each possible label at each point.
protected  void saveProcessedData(List datums, String filename)
           
 void serializeClassifier(String serializePath)
           
 List<FeatureLabel> test(List<FeatureLabel> document)
          Classify a List of FeatureLabels.
 List<FeatureLabel> testGibbs(List<FeatureLabel> document)
           
 List<FeatureLabel> testMaxEnt(List<FeatureLabel> document)
           
 void train(ObjectBank<List<FeatureLabel>> docs)
          Train a classifier:
 
Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, getSampler, init, init, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBank, makeObjectBank, makeObjectBank, printProbs, printProbsDocuments, reinit, segmentString, testAndWriteAnswers, testAndWriteAnswersKBest, testFile, testKBest, testSentence, testSentences, testSentenceWithCasing, testString, testStringInlineXML, testStringXML, train, train, writeAnswers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_CLASSIFIER

public static final String DEFAULT_CLASSIFIER
See Also:
Constant Field Values
Constructor Detail

CRFClassifier

protected CRFClassifier()

CRFClassifier

public CRFClassifier(Properties props)
Method Detail

dropFeaturesBelowThreshold

public void dropFeaturesBelowThreshold(double threshold)

documentToDataAndLabels

public Pair<int[][][],int[]> documentToDataAndLabels(List<FeatureLabel> document)
Convert a document List into arrays storing the data features and labels.

Parameters:
document -
Returns:
A Pair, where the first element is an int[][][] representing the data and the second element is an int[] representing the labels

documentsToDataAndLabels

public Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<FeatureLabel>> documents)
Convert an ObjectBank to arrays of data features and labels.

Parameters:
documents -
Returns:
A Pair, where the first element is an int[][][][] representing the data and the second element is an int[][] representing the labels.

allLabels

protected Index allLabels(int window,
                          Index classIndex)

makeDatum

public CRFDatum makeDatum(List<FeatureLabel> info,
                          int loc)

makeDatum

public CRFDatum makeDatum(List<FeatureLabel> info,
                          int loc,
                          FeatureFactory featureFactory)

test

public List<FeatureLabel> test(List<FeatureLabel> document)
Description copied from class: AbstractSequenceClassifier
Classify a List of FeatureLabels.

Specified by:
test in class AbstractSequenceClassifier
Parameters:
document - A List of FeatureLabels.
Returns:
the same List, but with the elements annotated with their answers (with setAnswer()).

getSequenceModel

public SequenceModel getSequenceModel(List<FeatureLabel> doc)
Overrides:
getSequenceModel in class AbstractSequenceClassifier

testMaxEnt

public List<FeatureLabel> testMaxEnt(List<FeatureLabel> document)

testGibbs

public List<FeatureLabel> testGibbs(List<FeatureLabel> document)
                             throws ClassNotFoundException,
                                    SecurityException,
                                    NoSuchMethodException,
                                    IllegalArgumentException,
                                    InstantiationException,
                                    IllegalAccessException,
                                    InvocationTargetException
Throws:
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException

printProbsDocument

public void printProbsDocument(List<FeatureLabel> document)
Takes a List of FeatureLabels and prints the likelihood of each possible label at each point.

Specified by:
printProbsDocument in class AbstractSequenceClassifier
Parameters:
document - A List of FeatureLabels.

printFirstOrderProbs

public void printFirstOrderProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point.

Parameters:
filename - The path to the specified file

printFirstOrderProbsDocuments

public void printFirstOrderProbsDocuments(ObjectBank<List<FeatureLabel>> documents)
Takes a List of documents and prints the likelihood of each possible label at each point.

Parameters:
documents - A List of List of FeatureLabels.

printFirstOrderProbsDocument

public void printFirstOrderProbsDocument(List<FeatureLabel> document)
Takes a List of FeatureLabels and prints the likelihood of each possible label at each point.

Parameters:
document - A List of FeatureLabels.

train

public void train(ObjectBank<List<FeatureLabel>> docs)
Train a classifier:

Specified by:
train in class AbstractSequenceClassifier

extractDatumSequence

protected List<CRFDatum> extractDatumSequence(int[][][] allData,
                                              int beginPosition,
                                              int endPosition,
                                              List labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels

Parameters:
allData -
beginPosition -
endPosition -
labeledWordInfos -
Returns:
A new CRFDatum

addProcessedData

protected void addProcessedData(List processedData,
                                int[][][][] data,
                                int[][] labels,
                                int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document. Adds context labels in addition to the target label for each datum, meaning that for a particular document, the number of labels will be windowSize-1 greater than the number of datums.

Parameters:
processedData - a List of Lists of CRFDatums
data -
labels -
offset -

saveProcessedData

protected void saveProcessedData(List datums,
                                 String filename)

loadProcessedData

protected List loadProcessedData(String filename)

serializeClassifier

public void serializeClassifier(String serializePath)
Specified by:
serializeClassifier in class AbstractSequenceClassifier

loadClassifier

public void loadClassifier(InputStream in,
                           Properties props)
                    throws ClassCastException,
                           IOException,
                           ClassNotFoundException
Loads a classifier from the specified InputStream. This version works quietly (unless VERBOSE is true). If props is non-null then any properties it specifies override those in the serialized file. However, only some properties are sensible to change (you shouldn't change how features are defined).

Specified by:
loadClassifier in class AbstractSequenceClassifier
Parameters:
in - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
ClassCastException
IOException
ClassNotFoundException

loadDefaultClassifier

public void loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar file. **THIS FUNCTION WILL ONLY WORK IF RUN INSIDE A JAR FILE**


getDefaultClassifier

public static CRFClassifier getDefaultClassifier()
Used to get the default supplied classifier. **THIS FUNCTION WILL ONLY WORK IF RUN INSIDE A JAR FILE**


getJarClassifier

public static CRFClassifier getJarClassifier(String resourceName,
                                             Properties props)
Used to load a classifier stored as a resource inside a jar file. **THIS FUNCTION WILL ONLY WORK IF RUN INSIDE A JAR FILE**


getClassifierNoExceptions

public static CRFClassifier getClassifierNoExceptions(File file)

getClassifier

public static CRFClassifier getClassifier(File file)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

getClassifierNoExceptions

public static CRFClassifier getClassifierNoExceptions(String loadPath)

getClassifier

public static CRFClassifier getClassifier(String loadPath)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

getClassifierNoExceptions

public static CRFClassifier getClassifierNoExceptions(InputStream in)

getClassifier

public static CRFClassifier getClassifier(InputStream in)
                                   throws IOException,
                                          ClassCastException,
                                          ClassNotFoundException
Throws:
IOException
ClassCastException
ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception
The main method. See the class documentation.

Throws:
Exception


Stanford NLP Group