|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier<IN>
edu.stanford.nlp.ie.crf.CRFClassifier<IN>
public class CRFClassifier<IN extends CoreMap>
Class for Sequence Classification using a Conditional Random Field model.
The code has functionality for different document formats, but when
using the standard ColumnDocumentReaderAndWriter
for training
or testing models, input files are expected to
be one token per line with the columns indicating things like the word,
POS, chunk, and answer class. The default for
ColumnDocumentReaderAndWriter
training data is 3 column input,
with the columns containing a word, its POS, and its gold class, but
this can be specified via the map
property.
-textFile
,
the file is assumed to be plain English text (or perhaps simple HTML/XML),
and a reasonable attempt is made at English tokenization by
PlainTextDocumentReaderAndWriter
. Extra options can be supplied
to the tokenizer using the -tokenizeOptions flag.
Typical command-line usage
For running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFile trainFile -testFile testFile -macro > output
To train with multiple files:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFileList file1,file2,... -testFile testFile -macro > output
To test on multiple files, use the -testFiles option and a comma separated list.
Features are defined by aFeatureFactory
.
NERFeatureFactory
is used by default, and you should look there for
feature templates and properties or flags that will cause certain features to
be used when training an NER classifier. There is also a
edu.stanford.nlp.wordseg.SighanFeatureFactory
, and various successors
such as edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory
,
which are used for Chinese word segmentation. Features are specified either
by a Properties file (which is the recommended method) or by flags on the
command line. The flags are read into a SeqClassifierFlags
object,
which the user need not be concerned with, unless wishing to add new
features. CRFClassifier may also be used programmatically. When creating
a new instance, you must specify a Properties object. You may then
call train methods to train a classifier, or load a classifier. The other way
to get a CRFClassifier is to deserialize one via the static
getClassifier(String)
methods, which return a
deserialized classifier. You may then tag (classify the items of) documents
using either the assorted classify()
or the assorted
classify
methods in AbstractSequenceClassifier
.
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument()
or getCliqueTrees()
methods.
Nested Class Summary | |
---|---|
static class |
CRFClassifier.TestSequenceModel
|
Field Summary | |
---|---|
static String |
DEFAULT_CLASSIFIER
Name of default serialized classifier resource to look for in a jar file. |
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier |
---|
classIndex, featureFactory, flags, knownLCWords, pad, windowSize |
Constructor Summary | |
---|---|
protected |
CRFClassifier()
|
|
CRFClassifier(CRFClassifier<IN> crf)
Makes a copy of the crf classifier |
|
CRFClassifier(Properties props)
|
|
CRFClassifier(SeqClassifierFlags flags)
|
Method Summary | ||
---|---|---|
protected void |
addProcessedData(List<List<CRFDatum<Collection<String>,String>>> processedData,
int[][][][] data,
int[][] labels,
int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document. |
|
protected static Index<CRFLabel> |
allLabels(int window,
Index classIndex)
|
|
List<IN> |
classify(List<IN> document)
Classify a List of something that extendsCoreMap . |
|
void |
classifyAndWriteAnswers(Collection<List<IN>> documents,
List<Pair<int[][][],int[]>> documentDataAndLabels,
PrintWriter printWriter,
DocumentReaderAndWriter<IN> readerAndWriter)
|
|
List<IN> |
classifyGibbs(List<IN> document)
|
|
List<IN> |
classifyGibbs(List<IN> document,
Pair<int[][][],int[]> documentDataAndLabels)
|
|
List<IN> |
classifyGibbsUsingPrior(List<IN> sentence,
SequenceModel[] priorModels,
SequenceListener[] priorListeners,
double[] modelWts)
|
|
List<IN> |
classifyGibbsUsingPrior(List<IN> sentence,
SequenceModel priorModel,
SequenceListener priorListener,
double model1Wt,
double model2Wt)
|
|
List<IN> |
classifyMaxEnt(List<IN> document)
Do standard sequence inference, using either Viterbi or Beam inference depending on the value of flags.inferenceType . |
|
List<IN> |
classifyWithGlobalInformation(List<IN> tokenSeq,
CoreMap doc,
CoreMap sent)
Classify a List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
|
void |
combine(CRFClassifier<IN> crf,
double weight)
Combines weighted crf with this crf |
|
Pair<int[][][][],int[][]> |
documentsToDataAndLabels(Collection<List<IN>> documents)
Convert an ObjectBank to arrays of data features and labels. |
|
List<Pair<int[][][],int[]>> |
documentsToDataAndLabelsList(Collection<List<IN>> documents)
Convert an ObjectBank to corresponding collection of data features and labels. |
|
Pair<int[][][],int[]> |
documentToDataAndLabels(List<IN> document)
Convert a document List into arrays storing the data features and labels. |
|
void |
dropFeaturesBelowThreshold(double threshold)
|
|
protected List<CRFDatum> |
extractDatumSequence(int[][][] allData,
int beginPosition,
int endPosition,
List<IN> labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels. |
|
static
|
getClassifier(File file)
Loads a CRF classifier from a filepath, and returns it. |
|
static CRFClassifier |
getClassifier(InputStream in)
Loads a CRF classifier from an InputStream, and returns it. |
|
static CRFClassifier |
getClassifier(String loadPath)
|
|
static CRFClassifier |
getClassifier(String loadPath,
Properties props)
|
|
static CRFClassifier |
getClassifierNoExceptions(String loadPath)
|
|
CRFCliqueTree |
getCliqueTree(List<IN> document)
|
|
List<CRFCliqueTree> |
getCliqueTrees(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Want to make arbitrary probability queries? Then this is the method for you. |
|
static
|
getDefaultClassifier()
Used to get the default supplied classifier inside the jar file. |
|
static
|
getDefaultClassifier(Properties props)
Used to get the default supplied classifier inside the jar file. |
|
static
|
getJarClassifier(String resourceName,
Properties props)
Used to load a classifier stored as a resource inside a jar file. |
|
protected Minimizer |
getMinimizer()
|
|
protected Minimizer |
getMinimizer(int featurePruneIteration,
Evaluator[] evaluators)
|
|
int |
getNumWeights()
Returns the total number of weights associated with this classifier. |
|
SequenceModel |
getSequenceModel(List<IN> doc)
|
|
SequenceModel |
getSequenceModel(List<IN> doc,
Pair<int[][][],int[]> documentDataAndLabels)
|
|
void |
loadClassifier(ObjectInputStream ois,
Properties props)
Loads a classifier from the specified InputStream. |
|
void |
loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar file. |
|
void |
loadDefaultClassifier(Properties props)
This is used to load the default supplied classifier stored within the jar file. |
|
protected static List |
loadProcessedData(String filename)
|
|
void |
loadTextClassifier(String text,
Properties props)
|
|
static void |
main(String[] args)
The main method. |
|
protected void |
makeAnswerArraysAndTagIndex(Collection<List<IN>> ob)
This routine builds the labelIndices which give the
empirically legal label sequences (of length (order) at most
windowSize ) and the classIndex , which indexes
known answer classes. |
|
CRFDatum<List<String>,CRFLabel> |
makeDatum(List<IN> info,
int loc,
FeatureFactory<IN> featureFactory)
Makes a CRFDatum by producing features and a label from input data at a specific position, using the provided factory. |
|
protected void |
printFeatures()
|
|
void |
printFirstOrderProbs(String filename,
DocumentReaderAndWriter<IN> readerAndWriter)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. |
|
void |
printFirstOrderProbsDocument(List<IN> document)
Takes a List of something that extends CoreMap and prints
the likelihood of each possible label at each point. |
|
void |
printFirstOrderProbsDocuments(ObjectBank<List<IN>> documents)
Takes a List of documents and prints the likelihood of each
possible label at each point. |
|
void |
printLabelInformation(String testFile,
DocumentReaderAndWriter<IN> readerAndWriter)
|
|
void |
printLabelValue(List<IN> document)
|
|
void |
printProbsDocument(List<IN> document)
Takes a List of something that extends CoreMap and prints
the likelihood of each possible label at each point. |
|
protected static void |
saveProcessedData(List datums,
String filename)
|
|
void |
scaleWeights(double scale)
Scales the weights of this crfclassifier by the specified weight |
|
void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path. |
|
void |
serializeTextClassifier(String serializePath)
Serialize the model to a human readable format. |
|
void |
train(Collection<List<IN>> docs,
DocumentReaderAndWriter<IN> readerAndWriter)
Train a classifier from documents. |
|
void |
writeWeights(PrintStream p)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_CLASSIFIER
Constructor Detail |
---|
protected CRFClassifier()
public CRFClassifier(Properties props)
public CRFClassifier(SeqClassifierFlags flags)
public CRFClassifier(CRFClassifier<IN> crf)
Method Detail |
---|
public int getNumWeights()
public void scaleWeights(double scale)
scale
- public void combine(CRFClassifier<IN> crf, double weight)
crf
- weight
- public void dropFeaturesBelowThreshold(double threshold)
public Pair<int[][][],int[]> documentToDataAndLabels(List<IN> document)
document
- Training documents
public void printLabelInformation(String testFile, DocumentReaderAndWriter<IN> readerAndWriter) throws Exception
Exception
public void printLabelValue(List<IN> document)
public Pair<int[][][][],int[][]> documentsToDataAndLabels(Collection<List<IN>> documents)
public List<Pair<int[][][],int[]>> documentsToDataAndLabelsList(Collection<List<IN>> documents)
protected void printFeatures()
protected void makeAnswerArraysAndTagIndex(Collection<List<IN>> ob)
labelIndices
which give the
empirically legal label sequences (of length (order) at most
windowSize
) and the classIndex
, which indexes
known answer classes.
ob
- The training data: Read from an ObjectBank, each item in it is a
Listprotected static Index<CRFLabel> allLabels(int window, Index classIndex)
public CRFDatum<List<String>,CRFLabel> makeDatum(List<IN> info, int loc, FeatureFactory<IN> featureFactory)
info
- The input dataloc
- The position to build a datum atfeatureFactory
- The FeatureFactory to use to extract features
public List<IN> classify(List<IN> document)
AbstractSequenceClassifier
List
of something that extendsCoreMap
.
The classifications are added in place to the items of the document,
which is also returned by this method
classify
in class AbstractSequenceClassifier<IN extends CoreMap>
document
- A List
of something that extends CoreMap
.
List
, but with the elements annotated with their
answers (stored under the CoreAnnotations.AnswerAnnotation
key).public void classifyAndWriteAnswers(Collection<List<IN>> documents, List<Pair<int[][][],int[]>> documentDataAndLabels, PrintWriter printWriter, DocumentReaderAndWriter<IN> readerAndWriter) throws IOException
IOException
public SequenceModel getSequenceModel(List<IN> doc)
getSequenceModel
in class AbstractSequenceClassifier<IN extends CoreMap>
public SequenceModel getSequenceModel(List<IN> doc, Pair<int[][][],int[]> documentDataAndLabels)
public List<IN> classifyMaxEnt(List<IN> document)
flags.inferenceType
.
document
- Document to classify. Classification happens in place. This
document is modified.
public List<IN> classifyGibbs(List<IN> document) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public List<IN> classifyGibbs(List<IN> document, Pair<int[][][],int[]> documentDataAndLabels) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public List<IN> classifyGibbsUsingPrior(List<IN> sentence, SequenceModel[] priorModels, SequenceListener[] priorListeners, double[] modelWts) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
sentence
- priorModels
- an array of prior modelspriorListeners
- an array of prior listenersmodelWts
- an array of model weights: IMPORTANT: this includes the weight of
CRF clasifier as well at position 0, and therefore is longer than
priorListeners/priorModels array by 1.
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public List<IN> classifyGibbsUsingPrior(List<IN> sentence, SequenceModel priorModel, SequenceListener priorListener, double model1Wt, double model2Wt) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public void printProbsDocument(List<IN> document)
List
of something that extends CoreMap
and prints
the likelihood of each possible label at each point.
printProbsDocument
in class AbstractSequenceClassifier<IN extends CoreMap>
document
- A List
of something that extends CoreMap.public void printFirstOrderProbs(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
getCliqueTrees()
for more.
filename
- The path to the specified filepublic void printFirstOrderProbsDocuments(ObjectBank<List<IN>> documents)
List
of documents and prints the likelihood of each
possible label at each point.
documents
- A List
of List
of INs.public List<CRFCliqueTree> getCliqueTrees(String filename, DocumentReaderAndWriter<IN> readerAndWriter)
public CRFCliqueTree getCliqueTree(List<IN> document)
public void printFirstOrderProbsDocument(List<IN> document)
List
of something that extends CoreMap
and prints
the likelihood of each possible label at each point.
document
- A List
of something that extends CoreMap
.public void train(Collection<List<IN>> docs, DocumentReaderAndWriter<IN> readerAndWriter)
train
in class AbstractSequenceClassifier<IN extends CoreMap>
docs
- An objectbank representation of documents. Changed this type from
ObjectBank to Collection for generality (mihai)readerAndWriter
- A DocumentReaderAndWriter to use when loading test filesprotected Minimizer getMinimizer()
protected Minimizer getMinimizer(int featurePruneIteration, Evaluator[] evaluators)
protected List<CRFDatum> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List<IN> labeledWordInfos)
protected void addProcessedData(List<List<CRFDatum<Collection<String>,String>>> processedData, int[][][][] data, int[][] labels, int offset)
processedData
- a List of Lists of CRFDatumsprotected static void saveProcessedData(List datums, String filename)
protected static List loadProcessedData(String filename)
public void loadTextClassifier(String text, Properties props) throws ClassCastException, IOException, ClassNotFoundException, InstantiationException, IllegalAccessException
ClassCastException
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException
public void serializeTextClassifier(String serializePath)
serializePath
- File to write text format of classifier to.public void serializeClassifier(String serializePath)
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreMap>
serializePath
- The path/filename to write the classifier to.public void loadClassifier(ObjectInputStream ois, Properties props) throws ClassCastException, IOException, ClassNotFoundException
Note: This method does not close the ObjectInputStream. (But earlier versions of the code used to, so beware....)
loadClassifier
in class AbstractSequenceClassifier<IN extends CoreMap>
ois
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifier
ClassCastException
- If there are problems interpreting the serialized data
IOException
- If there are problems accessing the input stream
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadDefaultClassifier()
public void loadDefaultClassifier(Properties props)
public static <IN extends CoreMap> CRFClassifier<IN> getDefaultClassifier()
public static <IN extends CoreMap> CRFClassifier<IN> getDefaultClassifier(Properties props)
public static <IN extends CoreMap> CRFClassifier<IN> getJarClassifier(String resourceName, Properties props)
resourceName
- Name of clasifier resource inside the jar file.
public static <IN extends CoreMap> CRFClassifier<IN> getClassifier(File file) throws IOException, ClassCastException, ClassNotFoundException
file
- File to load classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CRFClassifier getClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
in
- InputStream to load classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CRFClassifier getClassifierNoExceptions(String loadPath)
public static CRFClassifier getClassifier(String loadPath) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static CRFClassifier getClassifier(String loadPath, Properties props) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static void main(String[] args) throws Exception
Exception
public List<IN> classifyWithGlobalInformation(List<IN> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.
classifyWithGlobalInformation
in class AbstractSequenceClassifier<IN extends CoreMap>
public void writeWeights(PrintStream p)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |