public class CRFClassifier<IN extends CoreMap> extends AbstractSequenceClassifier<IN>
ColumnDocumentReaderAndWriter
for training
or testing models, input files are expected to
be one token per line with the columns indicating things like the word,
POS, chunk, and answer class. The default for
ColumnDocumentReaderAndWriter
training data is 3 column input,
with the columns containing a word, its POS, and its gold class, but
this can be specified via the map
property.
When run on a file with -textFile
,
the file is assumed to be plain English text (or perhaps simple HTML/XML),
and a reasonable attempt is made at English tokenization by
PlainTextDocumentReaderAndWriter
. The class used to read
the text can be changed with -plainTextDocumentReaderAndWriter.
Extra options can be supplied to the tokenizer using the
-tokenizeOptions flag.
To read from stdin, use the flag -readStdin. The same reader/writer will be used as for -textFile.
Typical command-line usageFor running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFile trainFile -testFile testFile -macro > output
To train with multiple files:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFileList file1,file2,... -testFile testFile -macro > output
To test on multiple files, use the -testFiles option and a comma separated list.
Features are defined by aFeatureFactory
.
NERFeatureFactory
is used by default, and you should look
there for feature templates and properties or flags that will cause
certain features to be used when training an NER classifier. There
are also various feature factories for Chinese word segmentation
such as ChineseSegmenterFeatureFactory
.
Features are specified either
by a Properties file (which is the recommended method) or by flags on the
command line. The flags are read into a SeqClassifierFlags
object,
which the user need not be concerned with, unless wishing to add new
features. CRFClassifier may also be used programmatically. When creating
a new instance, you must specify a Properties object. You may then
call train methods to train a classifier, or load a classifier. The other way
to get a CRFClassifier is to deserialize one via the static
getClassifier(String)
methods, which return a
deserialized classifier. You may then tag (classify the items of) documents
using either the assorted classify()
or the assorted
classify
methods in AbstractSequenceClassifier
.
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument()
or getCliqueTrees()
methods.Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_CLASSIFIER
Name of default serialized classifier resource to look for in a jar file.
|
classIndex, CUT_LABEL, featureFactories, flags, knownLCWords, pad, windowSize
Modifier | Constructor and Description |
---|---|
protected |
CRFClassifier() |
|
CRFClassifier(CRFClassifier<IN> crf)
Makes a copy of the crf classifier
|
|
CRFClassifier(Properties props) |
|
CRFClassifier(SeqClassifierFlags flags) |
Modifier and Type | Method and Description |
---|---|
protected void |
addProcessedData(List<List<CRFDatum<Collection<String>,String>>> processedData,
int[][][][] data,
int[][] labels,
double[][][][] featureVals,
int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating
each datum as if it were its own document.
|
protected static Index<CRFLabel> |
allLabels(int window,
Index<String> classIndex) |
List<IN> |
classify(List<IN> document)
|
List<IN> |
classifyGibbs(List<IN> document) |
List<IN> |
classifyGibbs(List<IN> document,
Triple<int[][][],int[],double[][][]> documentDataAndLabels) |
List<IN> |
classifyMaxEnt(List<IN> document)
Do standard sequence inference, using either Viterbi or Beam inference
depending on the value of
flags.inferenceType . |
List<IN> |
classifyWithGlobalInformation(List<IN> tokenSeq,
CoreMap doc,
CoreMap sent)
|
void |
combine(CRFClassifier<IN> crf,
double weight)
Combines weighted crf with this crf
|
Triple<int[][][][],int[][],double[][][][]> |
documentsToDataAndLabels(Collection<List<IN>> documents)
Convert an ObjectBank to arrays of data features and labels.
|
List<Triple<int[][][],int[],double[][][]>> |
documentsToDataAndLabelsList(Collection<List<IN>> documents)
Convert an ObjectBank to corresponding collection of data features and
labels.
|
Triple<int[][][],int[],double[][][]> |
documentToDataAndLabels(List<IN> document)
Convert a document List into arrays storing the data features and labels.
|
void |
dropFeaturesBelowThreshold(double threshold) |
void |
dumpFeatures(Collection<List<IN>> docs)
Does nothing by default.
|
protected List<CRFDatum<? extends Collection<String>,? extends CharSequence>> |
extractDatumSequence(int[][][] allData,
int beginPosition,
int endPosition,
List<IN> labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the
document number, position number, and a List of Object labels.
|
static <INN extends CoreMap> |
getClassifier(File file)
Loads a CRF classifier from a filepath, and returns it.
|
static <INN extends CoreMap> |
getClassifier(InputStream in)
Loads a CRF classifier from an InputStream, and returns it.
|
static CRFClassifier<CoreLabel> |
getClassifier(String loadPath) |
static <INN extends CoreMap> |
getClassifier(String loadPath,
Properties props) |
static <INN extends CoreMap> |
getClassifierNoExceptions(String loadPath) |
protected CliquePotentialFunction |
getCliquePotentialFunctionForTest() |
CRFCliqueTree<String> |
getCliqueTree(List<IN> document) |
CRFCliqueTree<String> |
getCliqueTree(Triple<int[][][],int[],double[][][]> p) |
List<CRFCliqueTree<String>> |
getCliqueTrees(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Want to make arbitrary probability queries? Then this is the method for
you.
|
static <INN extends CoreMap> |
getDefaultClassifier()
Used to get the default supplied classifier inside the jar file.
|
static <INN extends CoreMap> |
getDefaultClassifier(Properties props)
Used to get the default supplied classifier inside the jar file.
|
static <INN extends CoreMap> |
getJarClassifier(String resourceName,
Properties props)
Used to load a classifier stored as a resource inside a jar file.
|
Minimizer<DiffFunction> |
getMinimizer() |
Minimizer<DiffFunction> |
getMinimizer(int featurePruneIteration,
Evaluator[] evaluators) |
int |
getNumWeights()
Returns the total number of weights associated with this classifier.
|
protected CRFLogConditionalObjectiveFunction |
getObjectiveFunction(int[][][][] data,
int[][] labels) |
SequenceModel |
getSequenceModel(List<IN> doc) |
protected Collection<List<IN>> |
loadAuxiliaryData(Collection<List<IN>> docs,
DocumentReaderAndWriter<IN> readerAndWriter)
Load auxiliary data to be used in constructing features and labels
Intended to be overridden by subclasses
|
void |
loadClassifier(ObjectInputStream ois,
Properties props)
Loads a classifier from the specified InputStream.
|
static Index<String> |
loadClassIndexFromFile(String serializePath) |
void |
loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar
file.
|
void |
loadDefaultClassifier(Properties props)
This is used to load the default supplied classifier stored within the jar
file.
|
static Index<String> |
loadFeatureIndexFromFile(String serializePath) |
protected static List<List<CRFDatum<Collection<String>,String>>> |
loadProcessedData(String filename) |
void |
loadTagIndex() |
protected void |
loadTextClassifier(BufferedReader br) |
void |
loadTextClassifier(String text,
Properties props) |
static double[][] |
loadWeightsFromFile(String serializePath) |
static void |
main(String[] args)
The main method.
|
protected void |
makeAnswerArraysAndTagIndex(Collection<List<IN>> ob)
This routine builds the
labelIndices which give the
empirically legal label sequences (of length (order) at most
windowSize ) and the classIndex , which indexes
known answer classes. |
CRFDatum<List<String>,CRFLabel> |
makeDatum(List<IN> info,
int loc,
List<FeatureFactory<IN>> featureFactories)
Makes a CRFDatum by producing features and a label from input data at a
specific position, using the provided factory.
|
void |
printFactorTable(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Takes the file, reads it in, and prints out the factor table at each position.
|
void |
printFactorTableDocument(List<IN> document)
|
void |
printFactorTableDocuments(ObjectBank<List<IN>> documents)
Takes a
List of documents and prints the factor table
at each point. |
protected void |
printFeatures() |
void |
printFirstOrderProbs(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Takes the file, reads it in, and prints out the likelihood of each possible
label at each point.
|
void |
printFirstOrderProbsDocument(List<IN> document)
|
void |
printFirstOrderProbsDocuments(ObjectBank<List<IN>> documents)
Takes a
List of documents and prints the likelihood of each
possible label at each point. |
void |
printLabelInformation(String testFile,
DocumentReaderAndWriter<IN> readerAndWriter) |
void |
printLabelValue(List<IN> document) |
void |
printProbsDocument(List<IN> document)
|
protected void |
pruneNodeFeatureIndices(int totalNumOfFeatureSlices,
int numOfFeatureSlices) |
protected static void |
saveProcessedData(List datums,
String filename) |
void |
scaleWeights(double scale)
Scales the weights of this CRFClassifier by the specified weight.
|
void |
serializeClassifier(ObjectOutputStream oos)
Serialize the classifier to the given ObjectOutputStream.
|
void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path.
|
void |
serializeClassIndex(String serializePath) |
void |
serializeFeatureIndex(String serializePath) |
protected void |
serializeTextClassifier(PrintWriter pw) |
void |
serializeTextClassifier(String serializePath)
Serialize the model to a human readable format.
|
void |
serializeWeights(String serializePath) |
double[][] |
to2D(double[] weights,
List<Index<CRFLabel>> labelIndices,
int[] map) |
Map<String,Counter<String>> |
topWeights() |
void |
train(Collection<List<IN>> objectBankWrapper,
DocumentReaderAndWriter<IN> readerAndWriter)
Trains a classifier from a Collection of sequences.
|
protected double[] |
trainWeights(int[][][][] data,
int[][] labels,
Evaluator[] evaluators,
int pruneFeatureItr,
double[][][][] featureVals) |
void |
updateWeightsForTest(double[] x) |
void |
writeWeights(PrintStream p) |
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyFilesAndWriteAnswers, classifyFilesAndWriteAnswers, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResults, countResultsIOB, countResultsIOB2, countResultsSegmenter, defaultReaderAndWriter, finalizeClassification, getSampler, getViterbiSearchGraph, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBankFromFile, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbsDocuments, printResults, reinit, segmentString, segmentString, tallyOneEntityIOB, train, train, train, train, train, train, windowSize, writeAnswers
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
andThen, compose, identity
public static final String DEFAULT_CLASSIFIER
protected CRFClassifier()
public CRFClassifier(Properties props)
public CRFClassifier(SeqClassifierFlags flags)
public CRFClassifier(CRFClassifier<IN> crf)
public int getNumWeights()
public void scaleWeights(double scale)
scale
- The scale to multiply bypublic void combine(CRFClassifier<IN> crf, double weight)
crf
- weight
- public void dropFeaturesBelowThreshold(double threshold)
public Triple<int[][][],int[],double[][][]> documentToDataAndLabels(List<IN> document)
document
- Testing documentspublic void printLabelInformation(String testFile, DocumentReaderAndWriter<IN> readerAndWriter) throws Exception
Exception
public Triple<int[][][][],int[][],double[][][][]> documentsToDataAndLabels(Collection<List<IN>> documents)
public List<Triple<int[][][],int[],double[][][]>> documentsToDataAndLabelsList(Collection<List<IN>> documents)
protected void printFeatures()
protected void makeAnswerArraysAndTagIndex(Collection<List<IN>> ob)
labelIndices
which give the
empirically legal label sequences (of length (order) at most
windowSize
) and the classIndex
, which indexes
known answer classes.ob
- The training data: Read from an ObjectBank, each item in it is a
List<CoreLabel>
.public CRFDatum<List<String>,CRFLabel> makeDatum(List<IN> info, int loc, List<FeatureFactory<IN>> featureFactories)
info
- The input dataloc
- The position to build a datum atfeatureFactories
- The FeatureFactories to use to extract featurespublic void dumpFeatures(Collection<List<IN>> docs)
AbstractSequenceClassifier
dumpFeatures
in class AbstractSequenceClassifier<IN extends CoreMap>
public List<IN> classify(List<IN> document)
AbstractSequenceClassifier
List
of something that extendsCoreMap
.
The classifications are added in place to the items of the document,
which is also returned by this methodclassify
in class AbstractSequenceClassifier<IN extends CoreMap>
document
- A List
of something that extends CoreMap
.List
, but with the elements annotated with their
answers (stored under the
CoreAnnotations.AnswerAnnotation
key).public SequenceModel getSequenceModel(List<IN> doc)
getSequenceModel
in class AbstractSequenceClassifier<IN extends CoreMap>
protected CliquePotentialFunction getCliquePotentialFunctionForTest()
public void updateWeightsForTest(double[] x)
public List<IN> classifyMaxEnt(List<IN> document)
flags.inferenceType
.document
- Document to classify. Classification happens in place.
This document is modified.public List<IN> classifyGibbs(List<IN> document) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
public List<IN> classifyGibbs(List<IN> document, Triple<int[][][],int[],double[][][]> documentDataAndLabels) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
public void printProbsDocument(List<IN> document)
List
of something that extends CoreMap
and prints
the likelihood of each possible label at each point.printProbsDocument
in class AbstractSequenceClassifier<IN extends CoreMap>
document
- A List
of something that extends CoreMap.public void printFirstOrderProbs(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
getCliqueTrees()
for more.filename
- The path to the specified filepublic void printFirstOrderProbsDocuments(ObjectBank<List<IN>> documents)
List
of documents and prints the likelihood of each
possible label at each point.public void printFactorTable(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
filename
- The path to the specified filepublic void printFactorTableDocuments(ObjectBank<List<IN>> documents)
List
of documents and prints the factor table
at each point.public List<CRFCliqueTree<String>> getCliqueTrees(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
public CRFCliqueTree<String> getCliqueTree(Triple<int[][][],int[],double[][][]> p)
public CRFCliqueTree<String> getCliqueTree(List<IN> document)
protected Collection<List<IN>> loadAuxiliaryData(Collection<List<IN>> docs, DocumentReaderAndWriter<IN> readerAndWriter)
public void train(Collection<List<IN>> objectBankWrapper, DocumentReaderAndWriter<IN> readerAndWriter)
train
in class AbstractSequenceClassifier<IN extends CoreMap>
objectBankWrapper
- An ObjectBank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filesprotected void pruneNodeFeatureIndices(int totalNumOfFeatureSlices, int numOfFeatureSlices)
protected CRFLogConditionalObjectiveFunction getObjectiveFunction(int[][][][] data, int[][] labels)
protected double[] trainWeights(int[][][][] data, int[][] labels, Evaluator[] evaluators, int pruneFeatureItr, double[][][][] featureVals)
public Minimizer<DiffFunction> getMinimizer()
public Minimizer<DiffFunction> getMinimizer(int featurePruneIteration, Evaluator[] evaluators)
protected List<CRFDatum<? extends Collection<String>,? extends CharSequence>> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List<IN> labeledWordInfos)
protected void addProcessedData(List<List<CRFDatum<Collection<String>,String>>> processedData, int[][][][] data, int[][] labels, double[][][][] featureVals, int offset)
processedData
- a List of Lists of CRFDatumsprotected static List<List<CRFDatum<Collection<String>,String>>> loadProcessedData(String filename)
protected void loadTextClassifier(BufferedReader br) throws Exception
Exception
public void loadTextClassifier(String text, Properties props) throws ClassCastException, IOException, ClassNotFoundException, InstantiationException, IllegalAccessException
protected void serializeTextClassifier(PrintWriter pw) throws Exception
Exception
public void serializeTextClassifier(String serializePath)
serializePath
- File to write text format of classifier to.public void serializeClassIndex(String serializePath)
public void serializeWeights(String serializePath)
public static double[][] loadWeightsFromFile(String serializePath)
public void serializeFeatureIndex(String serializePath)
public void serializeClassifier(String serializePath)
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreMap>
serializePath
- The path/filename to write the classifier to.public void serializeClassifier(ObjectOutputStream oos)
public void loadClassifier(ObjectInputStream ois, Properties props) throws ClassCastException, IOException, ClassNotFoundException
Note: This method does not close the ObjectInputStream. (But earlier versions of the code used to, so beware....)
loadClassifier
in class AbstractSequenceClassifier<IN extends CoreMap>
ois
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifierClassCastException
- If there are problems interpreting the serialized dataIOException
- If there are problems accessing the input streamClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadDefaultClassifier()
public void loadTagIndex()
public void loadDefaultClassifier(Properties props)
public static <INN extends CoreMap> CRFClassifier<INN> getDefaultClassifier()
public static <INN extends CoreMap> CRFClassifier<INN> getDefaultClassifier(Properties props)
public static <INN extends CoreMap> CRFClassifier<INN> getJarClassifier(String resourceName, Properties props)
resourceName
- Name of classifier resource inside the jar file.public static <INN extends CoreMap> CRFClassifier<INN> getClassifier(File file) throws IOException, ClassCastException, ClassNotFoundException
file
- File to load classifier fromIOException
- If there are problems accessing the input streamClassCastException
- If there are problems interpreting the serialized dataClassNotFoundException
- If there are problems interpreting the serialized datapublic static <INN extends CoreMap> CRFClassifier<INN> getClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
in
- InputStream to load classifier fromIOException
- If there are problems accessing the input streamClassCastException
- If there are problems interpreting the serialized dataClassNotFoundException
- If there are problems interpreting the serialized datapublic static <INN extends CoreMap> CRFClassifier<INN> getClassifierNoExceptions(String loadPath)
public static CRFClassifier<CoreLabel> getClassifier(String loadPath) throws IOException, ClassCastException, ClassNotFoundException
public static <INN extends CoreMap> CRFClassifier<INN> getClassifier(String loadPath, Properties props) throws IOException, ClassCastException, ClassNotFoundException
public static void main(String[] args) throws Exception
Exception
public List<IN> classifyWithGlobalInformation(List<IN> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.classifyWithGlobalInformation
in class AbstractSequenceClassifier<IN extends CoreMap>
public void writeWeights(PrintStream p)