public class CMMClassifier<IN extends CoreLabel> extends AbstractSequenceClassifier<IN>
ColumnDocumentReader
,
input files are expected to
be one word per line with the columns indicating things like the word,
POS, chunk, and class.
Typical usage
For running a trained model with a provided serialized classifier:
java -server -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -prop propFile
To train and test a model from the command line:
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier
-trainFile trainFile -testFile testFile -goodCoNLL > output
Features are defined by a FeatureFactory
; the
FeatureFactory
which is used by default is
NERFeatureFactory
, and you should look there for feature templates.
Features are specified either by a Properties file (which is the
recommended method) or on the command line. The features are read into
a SeqClassifierFlags
object, which the
user need not know much about, unless one wishes to add new features.
CMMClassifier may also be used programmatically. When creating a new instance, you
must specify a properties file. The other way to get a CMMClassifier is to
deserialize one via getClassifier(String)
, which returns a
deserialized classifier. You may then tag sentences using either the assorted
test
or testSentence
methods.Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DEFAULT_CLASSIFIER
Default place to look in Jar file for classifier.
|
classIndex, featureFactories, flags, knownLCWords, pad, windowSize
Modifier | Constructor and Description |
---|---|
protected |
CMMClassifier() |
|
CMMClassifier(java.util.Properties props) |
|
CMMClassifier(SeqClassifierFlags flags) |
Modifier and Type | Method and Description |
---|---|
void |
adapt(ObjectBank<java.util.List<IN>> featureLabels,
Dataset<java.lang.String,java.lang.String> trainDataset) |
void |
adapt(java.lang.String filename,
Dataset<java.lang.String,java.lang.String> trainDataset,
DocumentReaderAndWriter<IN> readerWriter) |
java.util.List<IN> |
classify(java.util.List<IN> document)
Classify a
List of CoreLabel s. |
java.util.List<IN> |
classifyWithGlobalInformation(java.util.List<IN> tokenSeq,
CoreMap doc,
CoreMap sent)
Classify a
List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
protected java.lang.String |
classOf(java.util.List<IN> lineInfos,
int pos)
Returns the most likely class for the word at the given position.
|
Dataset<java.lang.String,java.lang.String> |
getBiasedDataset(ObjectBank<java.util.List<IN>> data,
Index<java.lang.String> featureIndex,
Index<java.lang.String> classIndex) |
static CMMClassifier<? extends CoreLabel> |
getClassifier(java.io.File file) |
static CMMClassifier<? extends CoreLabel> |
getClassifier(java.io.InputStream in) |
static <INN extends CoreMap> |
getClassifier(java.io.ObjectInputStream ois) |
static <INN extends CoreMap> |
getClassifier(java.io.ObjectInputStream ois,
java.util.Properties props) |
static CMMClassifier<? extends CoreLabel> |
getClassifier(java.lang.String loadPath) |
static CMMClassifier<? extends CoreLabel> |
getClassifierNoExceptions(java.io.File file) |
static CMMClassifier<? extends CoreLabel> |
getClassifierNoExceptions(java.io.InputStream in) |
static CMMClassifier<CoreLabel> |
getClassifierNoExceptions(java.lang.String loadPath) |
Dataset<java.lang.String,java.lang.String> |
getDataset(java.util.Collection<java.util.List<IN>> data)
Build a Dataset from some data.
|
Dataset<java.lang.String,java.lang.String> |
getDataset(java.util.Collection<java.util.List<IN>> data,
Index<java.lang.String> featureIndex,
Index<java.lang.String> classIndex)
Build a Dataset from some data.
|
Dataset<java.lang.String,java.lang.String> |
getDataset(Dataset<java.lang.String,java.lang.String> oldData,
Index<java.lang.String> goodFeatures)
Build a Dataset from some data.
|
Dataset<java.lang.String,java.lang.String> |
getDataset(ObjectBank<java.util.List<IN>> data,
Dataset<java.lang.String,java.lang.String> origDataset)
Build a Dataset from some data.
|
static CMMClassifier<? extends CoreLabel> |
getDefaultClassifier()
Used to obtain the default classifier which is
stored inside a jar file.
|
SequenceModel |
getSequenceModel(java.util.List<IN> document) |
java.util.Set<java.lang.String> |
getTags()
Returns the Set of entities recognized by this Classifier.
|
void |
loadClassifier(java.io.ObjectInputStream ois,
java.util.Properties props)
Load a classifier from the given Stream.
|
void |
loadDefaultClassifier()
Used to load the default supplied classifier.
|
double |
loglikelihood(java.util.List<IN> lineInfos)
Returns the log conditional likelihood of the given dataset.
|
static void |
main(java.lang.String[] args)
Command-line version of the classifier.
|
Datum<java.lang.String,java.lang.String> |
makeDatum(java.util.List<IN> info,
int loc,
java.util.List<FeatureFactory<IN>> featureFactories)
Make an individual Datum out of the data list info, focused at position loc.
|
Triple<Counter<java.lang.Integer>,Counter<java.lang.Integer>,TwoDimensionalCounter<java.lang.Integer,java.lang.String>> |
printProbsDocument(java.util.List<IN> document)
Takes a
List of CoreLabel s and prints the likelihood
of each possible label at each point. |
void |
retrain(ObjectBank<java.util.List<IN>> doc) |
void |
retrain(ObjectBank<java.util.List<IN>> featureLabels,
Index<java.lang.String> featureIndex,
Index<java.lang.String> labelIndex) |
Counter<java.lang.String> |
scoresOf(java.util.List<IN> lineInfos,
int pos) |
void |
serializeClassifier(java.io.ObjectOutputStream oos)
Serialize a sequence classifier to an object output stream
|
void |
serializeClassifier(java.lang.String serializePath)
Serialize a sequence classifier to a file on the given path.
|
void |
train(java.util.Collection<java.util.List<IN>> wordInfos,
DocumentReaderAndWriter<IN> readerAndWriter)
Trains a classifier from a Collection of sequences.
|
void |
trainSemiSup() |
double |
weight(java.lang.String feature,
java.lang.String label) |
double[][] |
weights() |
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyFilesAndWriteAnswers, classifyFilesAndWriteAnswers, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResultsSegmenter, defaultReaderAndWriter, dumpFeatures, finalizeClassification, getKnownLCWords, getSampler, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, makeObjectBankFromFile, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbs, printProbsDocuments, printResults, reinit, segmentString, segmentString, train, train, train, train, train, train, windowSize, writeAnswers
public static final java.lang.String DEFAULT_CLASSIFIER
protected CMMClassifier()
public CMMClassifier(java.util.Properties props)
public CMMClassifier(SeqClassifierFlags flags)
public java.util.Set<java.lang.String> getTags()
public java.util.List<IN> classify(java.util.List<IN> document)
List
of CoreLabel
s.classify
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s
to be classified.List
, but with the elements annotated with their
answers (stored under the
CoreAnnotations.AnswerAnnotation
key). The answers will be the class labels defined by the CRF
Classifier. They might be things like entity labels (in BIO
notation or not) or something like "1" vs. "0" on whether to
begin a new token here or not (in word segmentation).protected java.lang.String classOf(java.util.List<IN> lineInfos, int pos)
public double loglikelihood(java.util.List<IN> lineInfos)
public SequenceModel getSequenceModel(java.util.List<IN> document)
getSequenceModel
in class AbstractSequenceClassifier<IN extends CoreLabel>
public void adapt(java.lang.String filename, Dataset<java.lang.String,java.lang.String> trainDataset, DocumentReaderAndWriter<IN> readerWriter)
filename
- adaptation filetrainDataset
- original dataset (used in training)public void adapt(ObjectBank<java.util.List<IN>> featureLabels, Dataset<java.lang.String,java.lang.String> trainDataset)
featureLabels
- adaptation docstrainDataset
- original dataset (used in training)public void retrain(ObjectBank<java.util.List<IN>> featureLabels, Index<java.lang.String> featureIndex, Index<java.lang.String> labelIndex)
featureLabels
- retrain docsfeatureIndex
- featureIndex of original dataset (used in training)labelIndex
- labelIndex of original dataset (used in training)public void retrain(ObjectBank<java.util.List<IN>> doc)
public void train(java.util.Collection<java.util.List<IN>> wordInfos, DocumentReaderAndWriter<IN> readerAndWriter)
AbstractSequenceClassifier
train
in class AbstractSequenceClassifier<IN extends CoreLabel>
wordInfos
- An ObjectBank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filespublic Dataset<java.lang.String,java.lang.String> getDataset(java.util.Collection<java.util.List<IN>> data)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.public Dataset<java.lang.String,java.lang.String> getDataset(java.util.Collection<java.util.List<IN>> data, Index<java.lang.String> featureIndex, Index<java.lang.String> classIndex)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.classIndex
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDatasetpublic Dataset<java.lang.String,java.lang.String> getBiasedDataset(ObjectBank<java.util.List<IN>> data, Index<java.lang.String> featureIndex, Index<java.lang.String> classIndex)
public Dataset<java.lang.String,java.lang.String> getDataset(ObjectBank<java.util.List<IN>> data, Dataset<java.lang.String,java.lang.String> origDataset)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.origDataset
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDatasetpublic Dataset<java.lang.String,java.lang.String> getDataset(Dataset<java.lang.String,java.lang.String> oldData, Index<java.lang.String> goodFeatures)
public void serializeClassifier(java.lang.String serializePath)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
serializePath
- The path/filename to write the classifier to.public void serializeClassifier(java.io.ObjectOutputStream oos)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
public void loadDefaultClassifier()
public static CMMClassifier<? extends CoreLabel> getDefaultClassifier()
public void loadClassifier(java.io.ObjectInputStream ois, java.util.Properties props) throws java.lang.ClassCastException, java.io.IOException, java.lang.ClassNotFoundException
loadClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
ois
- The ObjectInputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifierjava.io.IOException
- If there are problems accessing the input streamjava.lang.ClassCastException
- If there are problems interpreting the serialized datajava.lang.ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CMMClassifier<? extends CoreLabel> getClassifierNoExceptions(java.io.File file)
public static CMMClassifier<? extends CoreLabel> getClassifier(java.io.File file) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static CMMClassifier<CoreLabel> getClassifierNoExceptions(java.lang.String loadPath)
public static CMMClassifier<? extends CoreLabel> getClassifier(java.lang.String loadPath) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static CMMClassifier<? extends CoreLabel> getClassifierNoExceptions(java.io.InputStream in)
public static <INN extends CoreMap> CMMClassifier<? extends CoreLabel> getClassifier(java.io.ObjectInputStream ois) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static <INN extends CoreMap> CMMClassifier<? extends CoreLabel> getClassifier(java.io.ObjectInputStream ois, java.util.Properties props) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static CMMClassifier<? extends CoreLabel> getClassifier(java.io.InputStream in) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public Datum<java.lang.String,java.lang.String> makeDatum(java.util.List<IN> info, int loc, java.util.List<FeatureFactory<IN>> featureFactories)
info
- A List of IN objectsloc
- The position in the info list to focus feature creation onfeatureFactories
- The factory that constructs features out of the itempublic void trainSemiSup()
public double weight(java.lang.String feature, java.lang.String label)
public double[][] weights()
public java.util.List<IN> classifyWithGlobalInformation(java.util.List<IN> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.classifyWithGlobalInformation
in class AbstractSequenceClassifier<IN extends CoreLabel>
tokenSeq
- A List
of something that extends CoreMap
public Triple<Counter<java.lang.Integer>,Counter<java.lang.Integer>,TwoDimensionalCounter<java.lang.Integer,java.lang.String>> printProbsDocument(java.util.List<IN> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
TODO: Write this method!printProbsDocument
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s.public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception