|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier<IN>
edu.stanford.nlp.ie.ner.CMMClassifier<IN>
public class CMMClassifier<IN extends CoreLabel>
Does Sequence Classification using a Conditional Markov Model.
It could be used for other purposes, but the provided features
are aimed at doing Named Entity Recognition.
The code has functionality for different document encodings, but when
using the standard ColumnDocumentReader
,
input files are expected to
be one word per line with the columns indicating things like the word,
POS, chunk, and class.
For running a trained model with a provided serialized classifier:
java -server -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -prop propFile
To train and test a model from the command line:
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier
-trainFile trainFile -testFile testFile -goodCoNLL > output
FeatureFactory
; the
FeatureFactory
which is used by default is
NERFeatureFactory
, and you should look there for feature templates.
Features are specified either by a Properties file (which is the
recommended method) or on the command line. The features are read into
a SeqClassifierFlags
object, which the
user need not know much about, unless one wishes to add new features.
CMMClassifier may also be used programmatically. When creating a new instance, you
must specify a properties file. The other way to get a CMMClassifier is to
deserialize one via getClassifier(String)
, which returns a
deserialized classifier. You may then tag sentences using either the assorted
test
or testSentence
methods.
Field Summary | |
---|---|
static java.lang.String |
DEFAULT_CLASSIFIER
Default place to look in Jar file for classifier. |
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier |
---|
classIndex, featureFactory, flags, knownLCWords, pad, windowSize |
Constructor Summary | |
---|---|
protected |
CMMClassifier()
|
|
CMMClassifier(java.util.Properties props)
|
Method Summary | |
---|---|
void |
adapt(ObjectBank<java.util.List<IN>> featureLabels,
Dataset<java.lang.String,java.lang.String> trainDataset)
|
void |
adapt(java.lang.String filename,
Dataset<java.lang.String,java.lang.String> trainDataset,
DocumentReaderAndWriter<IN> readerWriter)
|
java.util.List<IN> |
classify(java.util.List<IN> document)
Classify a List of CoreLabel s. |
java.util.List<IN> |
classifyWithGlobalInformation(java.util.List<IN> tokenSeq,
CoreMap doc,
CoreMap sent)
Classify a List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
protected java.lang.String |
classOf(java.util.List<IN> lineInfos,
int pos)
Returns the most likely class for the word at the given position. |
Dataset<java.lang.String,java.lang.String> |
getBiasedDataset(ObjectBank<java.util.List<IN>> data,
Index<java.lang.String> featureIndex,
Index<java.lang.String> classIndex)
|
static CMMClassifier |
getClassifier(java.io.File file)
|
static CMMClassifier |
getClassifier(java.io.InputStream in)
|
static CMMClassifier |
getClassifier(java.lang.String loadPath)
|
static CMMClassifier |
getClassifierNoExceptions(java.io.File file)
|
static CMMClassifier |
getClassifierNoExceptions(java.io.InputStream in)
|
static CMMClassifier |
getClassifierNoExceptions(java.lang.String loadPath)
|
Dataset<java.lang.String,java.lang.String> |
getDataset(java.util.Collection<java.util.List<IN>> data)
Build a Dataset from some data. |
Dataset<java.lang.String,java.lang.String> |
getDataset(java.util.Collection<java.util.List<IN>> data,
Index<java.lang.String> featureIndex,
Index<java.lang.String> classIndex)
Build a Dataset from some data. |
Dataset<java.lang.String,java.lang.String> |
getDataset(Dataset<java.lang.String,java.lang.String> oldData,
Index<java.lang.String> goodFeatures)
Build a Dataset from some data. |
Dataset<java.lang.String,java.lang.String> |
getDataset(ObjectBank<java.util.List<IN>> data,
Dataset<java.lang.String,java.lang.String> origDataset)
Build a Dataset from some data. |
static CMMClassifier |
getDefaultClassifier()
Used to obtain the default classifier which is stored inside a jar file. |
Index<java.lang.String> |
getFeaturesAboveThreshhold(Dataset<java.lang.String,java.lang.String> dataset,
double thresh)
|
SequenceModel |
getSequenceModel(java.util.List<IN> document)
|
java.util.Set<java.lang.String> |
getTags()
Returns the Set of entities recognized by this Classifier. |
void |
loadClassifier(java.io.ObjectInputStream ois,
java.util.Properties props)
Load a classifier from the given Stream. |
void |
loadDefaultClassifier()
Used to load the default supplied classifier. |
double |
loglikelihood(java.util.List<IN> lineInfos)
Returns the log conditional likelihood of the given dataset. |
static void |
main(java.lang.String[] args)
Command-line version of the classifier. |
Datum<java.lang.String,java.lang.String> |
makeDatum(java.util.List<IN> info,
int loc,
FeatureFactory<IN> featureFactory)
Make an individual Datum out of the data list info, focused at position loc. |
void |
printProbsDocument(java.util.List<IN> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
void |
retrain(ObjectBank<java.util.List<IN>> doc)
|
void |
retrain(ObjectBank<java.util.List<IN>> featureLabels,
Index<java.lang.String> featureIndex,
Index<java.lang.String> labelIndex)
|
Counter<java.lang.String> |
scoresOf(java.util.List<IN> lineInfos,
int pos)
|
void |
serializeClassifier(java.lang.String serializePath)
Serialize a sequence classifier to a file on the given path. |
void |
train(java.util.Collection<java.util.List<IN>> wordInfos,
DocumentReaderAndWriter<IN> readerAndWriter)
Trains a classifier from a Collection of sequences. |
void |
trainSemiSup()
|
double |
weight(java.lang.String feature,
java.lang.String label)
|
double[][] |
weights()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String DEFAULT_CLASSIFIER
Constructor Detail |
---|
protected CMMClassifier()
public CMMClassifier(java.util.Properties props)
Method Detail |
---|
public java.util.Set<java.lang.String> getTags()
public java.util.List<IN> classify(java.util.List<IN> document)
List
of CoreLabel
s.
classify
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s
to be classified.
List
, but with the elements annotated with their
answers (stored under the
CoreAnnotations.AnswerAnnotation
key).protected java.lang.String classOf(java.util.List<IN> lineInfos, int pos)
public double loglikelihood(java.util.List<IN> lineInfos)
public SequenceModel getSequenceModel(java.util.List<IN> document)
getSequenceModel
in class AbstractSequenceClassifier<IN extends CoreLabel>
public void adapt(java.lang.String filename, Dataset<java.lang.String,java.lang.String> trainDataset, DocumentReaderAndWriter<IN> readerWriter)
filename
- adaptation filetrainDataset
- original dataset (used in training)public void adapt(ObjectBank<java.util.List<IN>> featureLabels, Dataset<java.lang.String,java.lang.String> trainDataset)
featureLabels
- adaptation docstrainDataset
- original dataset (used in training)public void retrain(ObjectBank<java.util.List<IN>> featureLabels, Index<java.lang.String> featureIndex, Index<java.lang.String> labelIndex)
featureLabels
- retrain docsfeatureIndex
- featureIndex of original dataset (used in training)labelIndex
- labelIndex of original dataset (used in training)public void retrain(ObjectBank<java.util.List<IN>> doc)
public void train(java.util.Collection<java.util.List<IN>> wordInfos, DocumentReaderAndWriter<IN> readerAndWriter)
AbstractSequenceClassifier
train
in class AbstractSequenceClassifier<IN extends CoreLabel>
wordInfos
- An ObjectBank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filespublic Index<java.lang.String> getFeaturesAboveThreshhold(Dataset<java.lang.String,java.lang.String> dataset, double thresh)
public Dataset<java.lang.String,java.lang.String> getDataset(java.util.Collection<java.util.List<IN>> data)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.
public Dataset<java.lang.String,java.lang.String> getDataset(java.util.Collection<java.util.List<IN>> data, Index<java.lang.String> featureIndex, Index<java.lang.String> classIndex)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.classIndex
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDataset
public Dataset<java.lang.String,java.lang.String> getBiasedDataset(ObjectBank<java.util.List<IN>> data, Index<java.lang.String> featureIndex, Index<java.lang.String> classIndex)
public Dataset<java.lang.String,java.lang.String> getDataset(ObjectBank<java.util.List<IN>> data, Dataset<java.lang.String,java.lang.String> origDataset)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.origDataset
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDataset
public Dataset<java.lang.String,java.lang.String> getDataset(Dataset<java.lang.String,java.lang.String> oldData, Index<java.lang.String> goodFeatures)
oldData
- This Dataset
represents data for which we which to
some features, specifically those features not in the Index
goodFeatures.goodFeatures
- An Index
of features we wish to retain.
Dataset
wheres each datapoint contains only features
which were in goodFeatures.public void serializeClassifier(java.lang.String serializePath)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
serializePath
- The path/filename to write the classifier to.public void loadDefaultClassifier()
public static CMMClassifier getDefaultClassifier()
public void loadClassifier(java.io.ObjectInputStream ois, java.util.Properties props) throws java.lang.ClassCastException, java.io.IOException, java.lang.ClassNotFoundException
loadClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
ois
- The ObjectInputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifier
java.io.IOException
- If there are problems accessing the input stream
java.lang.ClassCastException
- If there are problems interpreting the serialized data
java.lang.ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CMMClassifier getClassifierNoExceptions(java.io.File file)
public static CMMClassifier getClassifier(java.io.File file) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static CMMClassifier getClassifierNoExceptions(java.lang.String loadPath)
public static CMMClassifier getClassifier(java.lang.String loadPath) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public static CMMClassifier getClassifierNoExceptions(java.io.InputStream in)
public static CMMClassifier getClassifier(java.io.InputStream in) throws java.io.IOException, java.lang.ClassCastException, java.lang.ClassNotFoundException
java.io.IOException
java.lang.ClassCastException
java.lang.ClassNotFoundException
public Datum<java.lang.String,java.lang.String> makeDatum(java.util.List<IN> info, int loc, FeatureFactory<IN> featureFactory)
info
- A List of WordInfo objectsloc
- The position in the info list to focus feature creation onfeatureFactory
- The factory that constructs features out of the item
public void trainSemiSup()
public Counter<java.lang.String> scoresOf(java.util.List<IN> lineInfos, int pos)
public void printProbsDocument(java.util.List<IN> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
TODO: Finish or delete this method!
printProbsDocument
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s.public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
public double weight(java.lang.String feature, java.lang.String label)
public double[][] weights()
public java.util.List<IN> classifyWithGlobalInformation(java.util.List<IN> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.
classifyWithGlobalInformation
in class AbstractSequenceClassifier<IN extends CoreLabel>
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |