|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier
edu.stanford.nlp.ie.crf.CRFClassifier
public class CRFClassifier
Does Sequence Classification using a Conditional Random Field model.
The code has functionality for different document formats, but when
using the standard ColumnDocumentReaderAndWriter
for training
or testing models, input files are expected to
be one token per line with the columns indicating things like the word,
POS, chunk, and class. The default for
ColumnDocumentReaderAndWriter
training data is 3 column input,
with the columns containing a word, its POS, and its gold class, but
this can be specified via the map
property.
-textFile
,
the file is assumed to be plain English text (or perhaps simple HTML/XML),
and a reasonable attempt is made at English tokenization by
PlainTextDocumentReaderAndWriter
.
Typical command-line usage
For running a trained model with a provided serialized classifier on a text file:
java -server -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -server -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier
-trainFile trainFile -testFile testFile -macro > output
FeatureFactory
.
NERFeatureFactory
is used by default, and
you should look there for feature templates and properties or flags that
will cause certain features to be used when training an NER classifier.
There is also
a edu.stanford.nlp.wordseg.SighanFeatureFactory
, and various
successors such as
edu.stanford.nlp.wordseg.ChineseSegmenterFeatureFactory
,
which are used for Chinese word segmentation.
Features are specified either by a Properties file (which is the
recommended method) or by flags on the command line. The flags are read
into a SeqClassifierFlags
object, which the
user need not be concerned with, unless wishing to add new features.
CRFClassifier may also be used programatically. When creating a new
instance, you must
specify a Properties object. The other way to get a CRFClassifier is to
deserialize one via getClassifier(String)
, which
returns a deserialized
classifier. You may then tag (classify the items of) documents
using either the assorted
test
or testSentence
methods. Probabilities
assigned by the CRF can be interrogated using either the
printProbsDocument()
or
getCliqueTrees()
methods.
Nested Class Summary | |
---|---|
static class |
CRFClassifier.TestSequenceModel
|
Field Summary | |
---|---|
static String |
DEFAULT_CLASSIFIER
|
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier |
---|
classIndex, featureFactory, flags, JAR_CLASSIFIER_PATH, knownLCWords, pad, readerAndWriter, windowSize |
Constructor Summary | |
---|---|
protected |
CRFClassifier()
|
|
CRFClassifier(Properties props)
|
Method Summary | |
---|---|
protected void |
addProcessedData(List processedData,
int[][][][] data,
int[][] labels,
int offset)
Adds the List of Lists of CRFDatums to the data and labels arrays, treating each datum as if it were its own document. |
protected Index<CRFLabel> |
allLabels(int window,
Index classIndex)
|
Pair<int[][][][],int[][]> |
documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
Convert an ObjectBank to arrays of data features and labels. |
Pair<int[][][],int[]> |
documentToDataAndLabels(List<? extends CoreLabel> document)
Convert a document List into arrays storing the data features and labels. |
void |
dropFeaturesBelowThreshold(double threshold)
|
protected List<CRFDatum> |
extractDatumSequence(int[][][] allData,
int beginPosition,
int endPosition,
List labeledWordInfos)
Creates a new CRFDatum from the preprocessed allData format, given the document number, position number, and a List of Object labels. |
static CRFClassifier |
getClassifier(File file)
|
static CRFClassifier |
getClassifier(InputStream in)
|
static CRFClassifier |
getClassifier(String loadPath)
|
static CRFClassifier |
getClassifierNoExceptions(File file)
Loads a CRF classifier from a file, and returns it. |
static CRFClassifier |
getClassifierNoExceptions(InputStream in)
|
static CRFClassifier |
getClassifierNoExceptions(String loadPath)
|
List<CRFCliqueTree> |
getCliqueTrees(String filename)
Want to make arbitrary probability queries? Then this is the method for you. |
static CRFClassifier |
getDefaultClassifier()
Used to get the default supplied classifier. |
static CRFClassifier |
getJarClassifier(String resourceName,
Properties props)
Used to load a classifier stored as a resource inside a jar file. |
protected Minimizer |
getMinimizer()
|
protected Minimizer |
getMinimizer(int featurePruneIteration)
|
SequenceModel |
getSequenceModel(List<? extends CoreLabel> doc)
|
void |
loadClassifier(InputStream in,
Properties props)
Loads a classifier from the specified InputStream. |
void |
loadDefaultClassifier()
This is used to load the default supplied classifier stored within the jar file. |
protected static List |
loadProcessedData(String filename)
|
void |
loadTextClassifier(String text,
Properties props)
|
static void |
main(String[] args)
The main method. |
CRFDatum |
makeDatum(List<? extends CoreLabel> info,
int loc,
FeatureFactory featureFactory)
|
CRFDatum |
makeDatum(List<CoreLabel> info,
int loc)
|
void |
printErrorStuff(List<CoreLabel> document)
|
void |
printErrorStuff(String testFile)
|
void |
printFirstOrderProbs(String filename)
Takes the file, reads it in, and prints out the likelihood of each possible label at each point. |
void |
printFirstOrderProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
void |
printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
Takes a List of documents and prints the likelihood
of each possible label at each point. |
void |
printProbsDocument(List<CoreLabel> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
protected static void |
saveProcessedData(List datums,
String filename)
|
void |
serializeClassifier(String serializePath)
|
void |
serializeTextClassifier(String serializePath)
Serialize the model to a human readable format. |
List<CoreLabel> |
test(List<CoreLabel> document)
Classify a List of CoreLabel s. |
List<CoreLabel> |
testGibbs(List<CoreLabel> document)
|
List<CoreLabel> |
testMaxEnt(List<CoreLabel> document)
Do standard sequence inference, using either Viterbi or Beam inference depending on the value of flags.inferenceType . |
void |
train(ObjectBank<List<CoreLabel>> docs)
Train a classifier from documents.getJar |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_CLASSIFIER
Constructor Detail |
---|
protected CRFClassifier()
public CRFClassifier(Properties props)
Method Detail |
---|
public void dropFeaturesBelowThreshold(double threshold)
public Pair<int[][][],int[]> documentToDataAndLabels(List<? extends CoreLabel> document)
document
- Training documents
public void printErrorStuff(String testFile) throws Exception
Exception
public void printErrorStuff(List<CoreLabel> document)
public Pair<int[][][][],int[][]> documentsToDataAndLabels(ObjectBank<List<CoreLabel>> documents)
documents
-
protected Index<CRFLabel> allLabels(int window, Index classIndex)
public CRFDatum makeDatum(List<CoreLabel> info, int loc)
public CRFDatum makeDatum(List<? extends CoreLabel> info, int loc, FeatureFactory featureFactory)
public List<CoreLabel> test(List<CoreLabel> document)
AbstractSequenceClassifier
List
of CoreLabel
s.
test
in class AbstractSequenceClassifier
document
- A List
of CoreLabel
s.
List
, but with the elements annotated
with their answers (with setAnswer()
).public SequenceModel getSequenceModel(List<? extends CoreLabel> doc)
getSequenceModel
in class AbstractSequenceClassifier
public List<CoreLabel> testMaxEnt(List<CoreLabel> document)
flags.inferenceType
.
document
- Document to classify. Classification happens in place.
This document is modified.
public List<CoreLabel> testGibbs(List<CoreLabel> document) throws ClassNotFoundException, SecurityException, NoSuchMethodException, IllegalArgumentException, InstantiationException, IllegalAccessException, InvocationTargetException
ClassNotFoundException
SecurityException
NoSuchMethodException
IllegalArgumentException
InstantiationException
IllegalAccessException
InvocationTargetException
public void printProbsDocument(List<CoreLabel> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
printProbsDocument
in class AbstractSequenceClassifier
document
- A List
of CoreLabel
s.public void printFirstOrderProbs(String filename)
getCliqueTrees()
for more.
filename
- The path to the specified filepublic void printFirstOrderProbsDocuments(ObjectBank<List<CoreLabel>> documents)
List
of documents and prints the likelihood
of each possible label at each point.
documents
- A List
of List
of CoreLabel
s.public List<CRFCliqueTree> getCliqueTrees(String filename)
public void printFirstOrderProbsDocument(List<CoreLabel> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
document
- A List
of CoreLabel
s.public void train(ObjectBank<List<CoreLabel>> docs)
train
in class AbstractSequenceClassifier
docs
- An objectbank representation of documents.protected Minimizer getMinimizer()
protected Minimizer getMinimizer(int featurePruneIteration)
protected List<CRFDatum> extractDatumSequence(int[][][] allData, int beginPosition, int endPosition, List labeledWordInfos)
allData
- beginPosition
- endPosition
- labeledWordInfos
-
protected void addProcessedData(List processedData, int[][][][] data, int[][] labels, int offset)
processedData
- a List of Lists of CRFDatumsdata
- labels
- offset
- protected static void saveProcessedData(List datums, String filename)
protected static List loadProcessedData(String filename)
public void loadTextClassifier(String text, Properties props) throws ClassCastException, IOException, ClassNotFoundException, InstantiationException, IllegalAccessException
ClassCastException
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException
public void serializeTextClassifier(String serializePath)
public void serializeClassifier(String serializePath)
serializeClassifier
in class AbstractSequenceClassifier
public void loadClassifier(InputStream in, Properties props) throws ClassCastException, IOException, ClassNotFoundException
loadClassifier
in class AbstractSequenceClassifier
in
- The InputStream to load the serialized classifier fromprops
- This Properties object will be used to update the SeqClassifierFlags which
are read from the serialized classifier
ClassCastException
IOException
ClassNotFoundException
public void loadDefaultClassifier()
public static CRFClassifier getDefaultClassifier()
public static CRFClassifier getJarClassifier(String resourceName, Properties props)
public static CRFClassifier getClassifierNoExceptions(File file)
file
- File to load classifier from
public static CRFClassifier getClassifier(File file) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static CRFClassifier getClassifierNoExceptions(String loadPath)
public static CRFClassifier getClassifier(String loadPath) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static CRFClassifier getClassifierNoExceptions(InputStream in)
public static CRFClassifier getClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static void main(String[] args) throws Exception
Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |