|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ie.AbstractSequenceClassifier<IN>
edu.stanford.nlp.ie.ner.CMMClassifier<IN>
public class CMMClassifier<IN extends CoreLabel>
Does Sequence Classification using a Conditional Markov Model.
It could be used for other purposes, but the provided features
are aimed at doing Named Entity Recognition.
The code has functionality for different document encodings, but when
using the standard ColumnDocumentReader
,
input files are expected to
be one word per line with the columns indicating things like the word,
POS, chunk, and class.
For running a trained model with a provided serialized classifier:
java -server -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or runtime):
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier -prop propFile
To train and test a model from the command line:
java -mx1000m edu.stanford.nlp.ie.ner.CMMClassifier
-trainFile trainFile -testFile testFile -goodCoNLL > output
FeatureFactory
; the
FeatureFactory
which is used by default is
NERFeatureFactory
, and you should look there for feature templates.
Features are specified either by a Properties file (which is the
recommended method) or on the command line. The features are read into
a SeqClassifierFlags
object, which the
user need not know much about, unless one wishes to add new features.
CMMClassifier may also be used programmatically. When creating a new instance, you
must specify a properties file. The other way to get a CMMClassifier is to
deserialize one via getClassifier(String)
, which returns a
deserialized classifier. You may then tag sentences using either the assorted
test
or testSentence
methods.
Field Summary | |
---|---|
static String |
DEFAULT_CLASSIFIER
Default place to look in Jar file for classifier. |
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier |
---|
classIndex, featureFactory, flags, knownLCWords, pad, windowSize |
Constructor Summary | |
---|---|
protected |
CMMClassifier()
|
|
CMMClassifier(Properties props)
|
Method Summary | ||
---|---|---|
void |
adapt(ObjectBank<List<IN>> featureLabels,
Dataset<String,String> trainDataset)
|
|
void |
adapt(String filename,
Dataset<String,String> trainDataset,
DocumentReaderAndWriter<IN> readerWriter)
|
|
List<IN> |
classify(List<IN> document)
Classify a List of CoreLabel s. |
|
List<IN> |
classifyWithGlobalInformation(List<IN> tokenSeq,
CoreMap doc,
CoreMap sent)
Classify a List of something that extends CoreMap using as
additional information whatever is stored in the document and sentence. |
|
protected String |
classOf(List<IN> lineInfos,
int pos)
Returns the most likely class for the word at the given position. |
|
Dataset<String,String> |
getBiasedDataset(ObjectBank<List<IN>> data,
Index<String> featureIndex,
Index<String> classIndex)
|
|
static CMMClassifier |
getClassifier(File file)
|
|
static CMMClassifier |
getClassifier(InputStream in)
|
|
static CMMClassifier |
getClassifier(String loadPath)
|
|
static CMMClassifier |
getClassifierNoExceptions(File file)
|
|
static CMMClassifier |
getClassifierNoExceptions(InputStream in)
|
|
static CMMClassifier |
getClassifierNoExceptions(String loadPath)
|
|
Dataset<String,String> |
getDataset(Collection<List<IN>> data)
Build a Dataset from some data. |
|
Dataset<String,String> |
getDataset(Collection<List<IN>> data,
Index<String> featureIndex,
Index<String> classIndex)
Build a Dataset from some data. |
|
Dataset<String,String> |
getDataset(Dataset<String,String> oldData,
Index<String> goodFeatures)
Build a Dataset from some data. |
|
Dataset<String,String> |
getDataset(ObjectBank<List<IN>> data,
Dataset<String,String> origDataset)
Build a Dataset from some data. |
|
static CMMClassifier |
getDefaultClassifier()
Used to obtain the default classifier which is stored inside a jar file. |
|
Index<String> |
getFeaturesAboveThreshhold(Dataset<String,String> dataset,
double thresh)
|
|
SequenceModel |
getSequenceModel(List<IN> document)
|
|
Set<String> |
getTags()
Returns the Set of entities recognized by this Classifier. |
|
void |
loadClassifier(ObjectInputStream ois,
Properties props)
Load a classifier from the given Stream. |
|
void |
loadDefaultClassifier()
Used to load the default supplied classifier. |
|
double |
loglikelihood(List<IN> lineInfos)
Returns the log conditional likelihood of the given dataset. |
|
static void |
main(String[] args)
Command-line version of the classifier. |
|
|
makeDatum(List<IN> info,
int loc,
FeatureFactory featureFactory)
Make an individual Datum out of the data list info, focused at position loc. |
|
void |
printProbsDocument(List<IN> document)
Takes a List of CoreLabel s and prints the likelihood
of each possible label at each point. |
|
List<WordTag> |
process(List list)
Assigns NER labels to the words in the given List. |
|
Document<?,?,WordTag> |
processDocument(Document in)
Assigns NER labels to the words in the given Document. |
|
void |
retrain(ObjectBank<List<IN>> doc)
|
|
void |
retrain(ObjectBank<List<IN>> featureLabels,
Index<String> featureIndex,
Index<String> labelIndex)
|
|
Counter<String> |
scoresOf(List<IN> lineInfos,
int pos)
|
|
void |
serializeClassifier(String serializePath)
Serialize a sequence classifier to a file on the given path. |
|
void |
train(Collection<List<IN>> wordInfos,
DocumentReaderAndWriter<IN> readerAndWriter)
Trains a classifier from a Collection of sequences. |
|
void |
trainSemiSup()
|
|
double |
weight(String feature,
String label)
|
|
double[][] |
weights()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_CLASSIFIER
Constructor Detail |
---|
protected CMMClassifier()
public CMMClassifier(Properties props)
Method Detail |
---|
public Set<String> getTags()
public List<IN> classify(List<IN> document)
List
of CoreLabel
s.
classify
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s
to be classified.
List
, but with the elements annotated with their
answers (stored under the CoreAnnotations.AnswerAnnotation
key).protected String classOf(List<IN> lineInfos, int pos)
public double loglikelihood(List<IN> lineInfos)
public SequenceModel getSequenceModel(List<IN> document)
getSequenceModel
in class AbstractSequenceClassifier<IN extends CoreLabel>
public void adapt(String filename, Dataset<String,String> trainDataset, DocumentReaderAndWriter<IN> readerWriter)
filename
- adaptation filetrainDataset
- original dataset (used in training)public void adapt(ObjectBank<List<IN>> featureLabels, Dataset<String,String> trainDataset)
featureLabels
- adaptation docstrainDataset
- original dataset (used in training)public void retrain(ObjectBank<List<IN>> featureLabels, Index<String> featureIndex, Index<String> labelIndex)
featureLabels
- retrain docsfeatureIndex
- featureIndex of original dataset (used in training)labelIndex
- labelIndex of original dataset (used in training)public void retrain(ObjectBank<List<IN>> doc)
public void train(Collection<List<IN>> wordInfos, DocumentReaderAndWriter<IN> readerAndWriter)
AbstractSequenceClassifier
train
in class AbstractSequenceClassifier<IN extends CoreLabel>
wordInfos
- An Objectbank or a collection of sequences of INreaderAndWriter
- A DocumentReaderAndWriter to use when loading test filespublic Index<String> getFeaturesAboveThreshhold(Dataset<String,String> dataset, double thresh)
public Dataset<String,String> getDataset(Collection<List<IN>> data)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.
public Dataset<String,String> getDataset(Collection<List<IN>> data, Index<String> featureIndex, Index<String> classIndex)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.classIndex
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDataset
public Dataset<String,String> getBiasedDataset(ObjectBank<List<IN>> data, Index<String> featureIndex, Index<String> classIndex)
public Dataset<String,String> getDataset(ObjectBank<List<IN>> data, Dataset<String,String> origDataset)
data
- This variable is a list of lists of CoreLabel. That is,
it is a collection of documents, each of which is represented
as a sequence of CoreLabel objects.origDataset
- if you want to get a Dataset based on featureIndex and
classIndex in an existing origDataset
public Dataset<String,String> getDataset(Dataset<String,String> oldData, Index<String> goodFeatures)
oldData
- This Dataset
represents data for which we which to
some features, specifically those features not in the Index
goodFeatures.goodFeatures
- An Index
of features we wish to retain.
Dataset
wheres each datapoint contains only features
which were in goodFeatures.public void serializeClassifier(String serializePath)
AbstractSequenceClassifier
serializeClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
serializePath
- The path/filename to write the classifier to.public void loadDefaultClassifier()
public static CMMClassifier getDefaultClassifier()
public void loadClassifier(ObjectInputStream ois, Properties props) throws ClassCastException, IOException, ClassNotFoundException
loadClassifier
in class AbstractSequenceClassifier<IN extends CoreLabel>
ois
- The ObjectInputStream to load the serialized classifier fromprops
- This Properties object will be used to update the
SeqClassifierFlags which are read from the serialized classifier
IOException
- If there are problems accessing the input stream
ClassCastException
- If there are problems interpreting the serialized data
ClassNotFoundException
- If there are problems interpreting the serialized datapublic static CMMClassifier getClassifierNoExceptions(File file)
public static CMMClassifier getClassifier(File file) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static CMMClassifier getClassifierNoExceptions(String loadPath)
public static CMMClassifier getClassifier(String loadPath) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public static CMMClassifier getClassifierNoExceptions(InputStream in)
public static CMMClassifier getClassifier(InputStream in) throws IOException, ClassCastException, ClassNotFoundException
IOException
ClassCastException
ClassNotFoundException
public <T extends CoreLabel> Datum<String,String> makeDatum(List<IN> info, int loc, FeatureFactory featureFactory)
info
- A List of WordInfo objectsloc
- The position in the info list to focus feature creation onfeatureFactory
- The factory that constructs features out of the item
public void trainSemiSup()
public Counter<String> scoresOf(List<IN> lineInfos, int pos)
public void printProbsDocument(List<IN> document)
List
of CoreLabel
s and prints the likelihood
of each possible label at each point.
TODO: Finish or delete this method!
printProbsDocument
in class AbstractSequenceClassifier<IN extends CoreLabel>
document
- A List
of CoreLabel
s.public static void main(String[] args) throws Exception
Exception
public Document<?,?,WordTag> processDocument(Document in)
DocumentProcessor
interface. Outputs a new document
with the same meta-data as the old one, but whose contents are a
List of WordTag
s, where the tags are the NER labels assigned
to the word.
processDocument
in interface DocumentProcessor
FunctionProcessor
public List<WordTag> process(List list)
ListProcessor
interface. Checks the input
for instances of HasWord
and HasTag
, or uses the
toString() method, the HasWord check fails. Outputs a list of
WordTag
s, where the tag is the NER label assigned to the word.
process
in interface ListProcessor<Object,WordTag>
public double weight(String feature, String label)
public double[][] weights()
public List<IN> classifyWithGlobalInformation(List<IN> tokenSeq, CoreMap doc, CoreMap sent)
AbstractSequenceClassifier
List
of something that extends CoreMap
using as
additional information whatever is stored in the document and sentence.
This is needed for SUTime (NumberSequenceClassifier), which requires
the document date to resolve relative dates.
classifyWithGlobalInformation
in class AbstractSequenceClassifier<IN extends CoreLabel>
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |