Package edu.stanford.nlp.classify

The classify package provides facilities for training classifiers.

See:
          Description

Interface Summary
Classifier<L,F> A simple interface for classifying and scoring data points, implemented by most of the classifiers in this package.
ClassifierCreator<L,F> Creates a classifier with given weights
ClassifierFactory<L,F,C extends Classifier<L,F>> A simple interface for training a Classifier from a Dataset of training examples.
ProbabilisticClassifier<L,F>  
ProbabilisticClassifierCreator<L,F> Creates a probablic classifier with given weights
RVFClassifier<L,F> A simple interface for classifying and scoring data points with real-valued features.
 

Class Summary
AbstractLinearClassifierFactory<L,F> Shared methods for training a LinearClassifier.
AdaptedGaussianPriorObjectiveFunction<L,F> Adapt the mean of the Gaussian Prior by shifting the mean to the previously trained weights
BiasedLogConditionalObjectiveFunction Maximizes the conditional likelihood with a given prior.
BiasedLogisticObjectiveFunction  
ClassifierExample Sample code that illustrates the training and use of a linear classifier.
ColumnDataClassifier ColumnDataClassifier provides a command-line interface for doing context-free (independent) classification of a series of data items, where each data item is represented by a line of a file, as a list of String variables, in tab-separated columns.
CrossValidator<L,F> This class is meant to simplify performing cross validation on classifiers for hyper-parameters.
CrossValidator.SavedState  
Dataset<L,F> An interfacing class for ClassifierFactory that incrementally builds a more memory-efficient representation of a List of Datum objects for the purposes of training a Classifier with a ClassifierFactory.
GeneralDataset<L,F> The purpose of this interface is to unify Dataset and RVFDataset.
GeneralizedExpectationObjectiveFunction<L,F> Implementation of Generalized Expectation Objective function for an I.I.D.
LinearClassifier<L,F> Implements a multiclass linear classifier.
LinearClassifierFactory<L,F> Builds various types of linear classifiers, with functionality for setting objective function, optimization method, and other parameters.
LinearClassifierFactory.LinearClassifierCreator<L,F>  
LogConditionalEqConstraintFunction Maximizes the conditional likelihood with a given prior.
LogConditionalObjectiveFunction<L,F> Maximizes the conditional likelihood with a given prior.
LogisticClassifier<L,F> A classifier for binary logistic regression problems.
LogisticClassifierFactory<L,F>  
LogisticObjectiveFunction Maximizes the conditional likelihood with a given prior.
LogPrior A Prior for functions.
NaiveBayesClassifier<L,F>  
NaiveBayesClassifierFactory<L,F>  
NBLinearClassifierFactory<L,F> Provides a medium-weight implementation of Bernoulli (or binary) Naive Bayes via a linear classifier.
NominalDataReader  
PRCurve  
RVFDataset<L,F> An interfacing class for ClassifierFactory that incrementally builds a more memory-efficient representation of a List of RVFDatum objects for the purposes of training a Classifier with a ClassifierFactory.
SemiSupervisedLogConditionalObjectiveFunction Maximizes the conditional likelihood with a given prior.
WeightedDataset<L,F>  
 

Enum Summary
LogPrior.LogPriorType  
 

Package edu.stanford.nlp.classify Description

The classify package provides facilities for training classifiers. In this package, data points are viewed as single instances, not sequences. The most commonly used classifier is the log-linear classifier with binary features. More classifiers, such as SVM and Naive Bayes, are also available in this package.

The Classifier contract only guarantees routines for getting a classification for an example, and the scores assigned to each class for that example. Note that training is dependent upon the individual classifier.

Classifiers operate over Datum objects. A Datum is a list of descriptive features and a class label; features and labels can be any object, but usually Strings are used. Datum objects are grouped using Dataset objects. Some classifiers use Dataset objects as a way of grouping inputs.

Following is a set of examples outlining how to create, train, and use each of the different classifier types.

Linear Classifiers

To build a classifier, one first creates a GeneralDataset, which is a list to Datum objects. A Datum is a list of descriptive features, along with a label; features and labels can be any object, though we usually use strings.

GeneralDataset dataSet=new Dataset();
while (more datums to make) {
  ... make featureList: e.g., ["PrevWord=at","CurrentTag=NNP","isUpperCase"]
  ... make label: e.g., ["PLACE"];
  Datum d = new BasicDatum(featureList, label);
  dataSet.add(d);
}

There are some useful methods in GeneralDataset such as:

dataSet.applyFeatureCountThreshold(int cutoff);
dataSet.summaryStatistics(); // dumps the number of features and datums

Next, one makes a LinearClassifierFactory and calls its trainClassifier(GeneralDataset dataSet) method:

  LinearClassifierFactory lcFactory = new LinearClassifierFactory();
  LinearClassifier c = lcFactory.trainClassifier(dataSet);

LinearClassifierFactory has options for different optimizers (default: QNminimizer), the converge threshold for minimization, etc. Check the class description for detailed information.

A classifier, once built, can be used to classify new Datum instances:

Object label = c.classOf(mysteryDatum);
If you want scores instead, you can ask:

Counter scores = c.scoresOf(mysteryDatum);

The scores which are returned by the log-linear classifiers are the feature-weight dot products, not the normalized probabilities.

There are some other useful methods like justificationOf(Datum d), and logProbabilityOf(Datum d), also various methods for visualizing the weights and the most highly weighted features. This concludes the log-linear classifiers with binary features.

We can also train log-linear classifiers with real-valued features. In this case, RVFDatum should be used.

Real Valued Classifiers

Real Valued Classifiers (RVF) operate over RVFDatum objects. A RVFDatum is composed of a set of feature and real-value pairs. RVFDatums are grouped using a RVFDataset.

To assemble an RVFDatum by using a Counter and assigning an Object label to it.

  Counter features = new Counter();
  features.incrementCount("FEATURE_A", 1.2);
  features.incrementCount("FEATURE_B", 2.3);
  features.incrementCount("FEATURE_C", 0.5);
  RVFDatum rvfDatum = new RVFDatum(features, "DATUM_LABEL");

RVFDataset objects are representations of RVFDatum objects that efficiently store the data with which to train the classifier. This type of dataset only accepts RVFDatum objects via its add method (other Datum objects that are not instances of RVFDatum will be ignored), and is equivalent to a Dataset if all RVFDatum objects have only features with value 1.0. Since it is a subclass of GeneralDataset, the methods shown above as applied to the GeneralDataset can also be applied to the RVFDataset.

(TODO) An example for LinearType2Classifier.

(TODO) Saving Classifier out to file (from LearningExperiment)

 private static void saveClassifierToFile(LinearClassifier classifier, String serializePath) {
    System.err.print("Serializing classifier to " + serializePath + "...");

    try {
      ObjectOutputStream oos;
      if (serializePath.endsWith(".gz")) {
        oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(serializePath))));
      } else {
        oos = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(serializePath)));
      }

      oos.writeObject(classifier);

      oos.close();
      System.err.println("done.");

    } catch (Exception e) {
      e.printStackTrace();
      throw new RuntimeException("Serialization failed: "+e.getMessage());
    }

  }

Alternately, if your features are Strings, and you wish to serialize to a human readable text file, you can use saveToFilename in LinearClassifier and reconstitute using loadFromFilename in LinearClassifierFactory. Though the format is not as compact as a serialized object, and implicitly presumes the features are Strings, this is useful for debugging purposes.

Author:
Dan Klein, Eric Yeh


Stanford NLP Group