|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier
edu.stanford.nlp.ie.crf.CRFClassifier
public class CRFClassifier
Class for Sequence Classification using a Conditional Random Field model.
The code has functionality for different document formats, but when
using the standard ColumnDocumentReaderAndWriter
for training
or testing models, input files are expected to
be one token per line with the columns indicating things like the word,
POS, chunk, and answer class. The default for
ColumnDocumentReaderAndWriter
training data is 3 column input,
with the columns containing a word, its POS, and its gold class, but
this can be specified via the map
property.
-textFile
,
the file is assumed to be plain English text (or perhaps simple HTML/XML),
and a reasonable attempt is made at English tokenization by
PlainTextDocumentReaderAndWriter
.
Typical command-line usage
For running a trained model with a provided serialized classifier on a text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFile trainFile -testFile testFile -macro > output
FeatureFactory
.
NERFeatureFactory
is used by default, and
you should look there for feature templates and properties or flags that
will cause certain features to be used when training an NER classifier.
There is also
a edu.stanford.nlp.wordseg.SighanFeatureFactory
, and various
successors such as
edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory
,
which are used for Chinese word segmentation.
Features are specified either by a Properties file (which is the
recommended method) or by flags on the command line. The flags are read
into a SeqClassifierFlags
object, which the
user need not be concerned with, unless wishing to add new features.
CRFClassifier may also be used programatically. When creating a new
instance, you must specify a Properties object. You may then
call train methods to train a classifier, or load a classifier.
The other way to get a CRFClassifier is to deserialize one via
the static getClassifier(String)
methods, which
return a deserialized
classifier. You may then tag (classify the items of) documents
using either the assorted
classify()
or the assorted classify
methods in
AbstractSequenceClassifier
.
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument()
or
getCliqueTrees()
methods.
Nested Class Summary | |
---|---|
static class |
CRFClassifier.TestSequenceModel
|
Field Summary | |
---|---|
static String |
DEFAULT_CLASSIFIER
Name of default serialized classifier resource to look for in a jar file. |
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier |
---|
classIndex, featureFactory, flags, JAR_CLASSIFIER_PATH, knownLCWords, pad, readerAndWriter, windowSize |
Constructor Summary | |
---|---|
protected |
CRFClassifier()
|
|
CRFClassifier(Properties props)
|
Method Summary | |
---|---|
protected void |
addProcessedData(List<List<CRFDatum>> processedData,
int[][][][] data,
int[][] labels,
int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document. |
protected static Index<CRFLabel> |
allLabels(int window,
Index classIndex)
|
List<CoreLabel> |
classify(List<CoreLabel> document)
Classify a List of CoreLabel s. |
List<CoreLabel> |
classifyGibbs(List<CoreLabel> document)
|
List<CoreLabel> |
classifyMaxEnt(List<CoreLabel> document)
Do standard sequence inference, using either Viterbi or Beam inference depending on the value of flags.inferenceType . |
Pair<int[][][][],int[][]> |
documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
Convert an ObjectBank to arrays of data features and labels. |
Pair<int[][][],int[]> |
documentToDataAndLabels(List<? extends CoreLabel> document)
Convert a document List into arrays storing the data features and labels. |
void |
dropFeaturesBelowThreshold(double threshold)
|
protected List<CRFDatum> |
extractDatumSequence(int[][][] allData,
int beginPosition,
int endPosition,
List<CoreLabel> labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels. |
static CRFClassifier |
getClassifier(File file)
Loads a CRF classifier from a filepath, and returns it. |
static CRFClassifier |
getClassifier(InputStream in)
Loads a CRF classifier from an InputStream, and returns it. |
static CRFClassifier |
getClassifier(String loadPath)
|
static CRFClassifier |
getClassifierNoExceptions(String loadPath)
|
List<CRFCliqueTree> |
getCliqueTrees(String filename)
Want to make arbitrary probability queries? Then this is the method for you. |
static CRFClassifier |
getDefaultClassifier()
Used to get the default supplied classifier inside the jar file. |
static CRFClassifier |
getJarClassifier(String resourceName,
Properties props)
Used to load a classifier stored as a resource inside a jar file. |
protected Minimizer |
getMinimizer()
|
protected Minimizer |
getMinimizer(int featurePruneIteration)
|
SequenceModel |
getSequenceModel(List<? extends CoreLabel> doc)
|
void |
loadClassifier(ObjectInputStream ois,
Properties props)
Loads a classifier from the specified InputStream. |
void |
loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar file. |
protected static List |
loadProcessedData(String filename)
|
void |
loadTextClassifier(String text,
Properties props)
|
static void |
main(String[] args)
The main method. |
CRFDatum<Serializable,CRFLabel> |
makeDatum(List<? extends CoreLabel> info,
int loc,
FeatureFactory featureFactory)
Makes a CRFDatum by producing features and a label from input data at a specific position, using the provided factory. |
void |
printFirstOrderProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. |
void |
printFirstOrderProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
void |
printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
Takes a List of documents and prints the likelihood
of each possible label at each point. |
void |
printLabelInformation(String testFile)
|
void |
printLabelValue(List<CoreLabel> document)
|
void |
printProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
protected static void |
saveProcessedData(List datums,
String filename)
|
void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path. |
void |
serializeTextClassifier(String serializePath)
Serialize the model to a human readable format. |
void |
train(ObjectBank<List<CoreLabel>> docs)
Train a classifier from documents. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_CLASSIFIER
Constructor Detail |
---|
protected CRFClassifier()
public CRFClassifier(Properties props)
Method Detail |
---|
public void dropFeaturesBelowThreshold(double threshold)
public Pair<int[][][],int[]> documentToDataAndLabels(List<? extends CoreLabel> document)
document
- Training documents
public void printLabelInformation(String testFile) throws Exception
Exception
public void printLabelValue(List<CoreLabel> document)
public Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
documents
-
protected static Index<CRFLabel> allLabels(int window, Index classIndex)
public CRFDatum<Serializable,CRFLabel> makeDatum(List<? extends CoreLabel> info, int loc, FeatureFactory featureFactory)
info
- The input dataloc
- The position to build a datum atfeatureFactory
- The FeatureFactory to use to extract features
public List<CoreLabel> classify(List<CoreLabel> document)
AbstractSequenceClassifier
List
of CoreLabel
s.
classify
in class AbstractSequenceClassifier
document
- A List
of CoreLabel
s.
List
, but with the elements annotated
with their answers (with setAnswer()
).public SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
getSequenceModel
in class AbstractSequenceClassifier
public List<CoreLabel> classifyMaxEnt(List<CoreLabel> document)
flags.inferenceType
.
document
- Document to classify. Classification happens in place.
This document is modified.
public List<CoreLabel> classifyGibbs(List<CoreLabel> document) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public void printProbsDocument(List<CoreLabel> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
printProbsDocument
in class AbstractSequenceClassifier
document
- A List
of CoreLabel
s.public void printFirstOrderProbs(String filename)
getCliqueTrees()
for more.
filename
- The path to the specified filepublic void printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
List
of documents and prints the likelihood
of each possible label at each point.
documents
- A List
of List
of CoreLabel
s.public List<CRFCliqueTree> getCliqueTrees(String filename)
public void printFirstOrderProbsDocument(List<CoreLabel> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
document
- A List
of CoreLabel
s.public void train(ObjectBank<List<CoreLabel>> docs)
train
in class AbstractSequenceClassifier
docs
- An objectbank representation of documents.protected Minimizer getMinimizer()
protected Minimizer getMinimizer(int featurePruneIteration)
protected List<CRFDatum> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List<CoreLabel> labeledWordInfos)
allData
- beginPosition
- endPosition
- labeledWordInfos
-
protected void addProcessedData(List<List<CRFDatum>> processedData, int[][][][] data, int[][] labels, int offset)
processedData
- a List of Lists of CRFDatumsdata
- labels
- offset
- protected static void saveProcessedData(List datums, String filename)
protected static List loadProcessedData(String filename)
public void loadTextClassifier(String text, Properties props) throws ClassCastException, IOException, ClassNotFoundException, InstantiationException, IllegalAccessException
ClassCastException
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException
public void serializeTextClassifier(String serializePath)
serializePath
- File to write text format of classifier to.public void serializeClassifier(String serializePath)
serializeClassifier
in class AbstractSequenceClassifier
serializePath
- The path/filename to write the classifier to.public void loadClassifier(ObjectInputStream ois, Properties props) throws ClassCastException, IOException, ClassNotFoundException
Note: This method does not close the ObjectInputStream. (But earlier versions of the code used to, so beware....)
loadClassifier
in class AbstractSequenceClassifier
ois
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the SeqClassifierFlags which
are read from the serialized classifier
ClassCastException
- If there are problems interpreting the serialized data
IOException
- If there are problems accessing the input stream
ClassNotFoundException
- If there are problems interpreting the serialized datapublic void loadDefaultClassifier()
public static CRFClassifier getDefaultClassifier()
public static CRFClassifier getJarClassifier(String resourceName, Properties props)
resourceName
- Name of clasifier resource inside the jar file.
public static CRFClassifier getClassifier(File file) throws IOException, ClassCastException, ClassNotFoundException
file
- File to load classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CRFClassifier getClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
in
- InputStream to load classifier from
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CRFClassifier getClassifierNoExceptions(String loadPath)
public static CRFClassifier getClassifier(String loadPath) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static void main(String[] args) throws Exception
Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |