Software/Classifier

From NLPWiki
Revision as of 19:50, 25 October 2012 by ChrisManning (Talk | contribs)

Jump to: navigation, search

Contents

The Stanford Classifier

The Stanford Classifier is a general purpose classifier - something that takes a set of input data and assigns each of them to one of a set of categories. It does this by generating features from each datum which are associated with positive or negative numeric "votes" (weights) for each class. In principle, the weights could be set by hand, but the expected use is for the weights to be learned automatically based on hand-classified training data items. (This is referred to as "supervised learning".) The classifier can work with (scaled) real-valued and categorical inputs, and supports several machine learning algorithms. It also supports several forms of regularization, which is generally needed when building models with very large numbers of predictive features.

You can use the classifier on any sort of data, including standard statistics and machine learning data sets. But for small data sets and numeric predictors, you'd generally be better off using another tool such as R or Weka. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings. However, if you've also got a few numeric variables, you can throw them in at the same time.

Small Examples

Cheese-Disease: A small textual example

While you can specify most options on the command line, normally the easiest way to train and test models with the Stanford Classifier is through use of properties files that record all the options used. You can find a couple of example data sets and properties files in the examples folder of the Stanford Classifier distribution.

The Cheese-Disease dataset is a play on the MTV game show Idiot Savants from the late 1990s, which had a trivia category of Cheese or Disease? (I guess you had to be there...). The goal is to distinguish cheese names from disease names. Look at the file examples/cheeseDisease.train to see what the data looks like. The first column is the category (1=cheese, 2=disease). The number coding of classes was arbitrary. The two classes could have been called "cheese" and "disease". The second column is the name. The columns are separated by a tab character. Here there is just one class column and one predictive column. This is the minimum for training a classifier, but you can have any number of predictive columns and specify which column has what role.

In the top level folder of the Stanford Classifier, the following command will build a model for this data set and test it on the test data set in the simplest possible way:

java -jar stanford-classifier.jar -prop examples/cheese2007.prop

This prints a lot of information. The first part shows a little bit about the data set. The second part shows the process of optimization (choosing feature weights in training a classifier on the training data). The next part then shows the results of testing the model on a separate test set of data, and the final 5 lines give the test results:

196 examples in test set
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
Micro-averaged accuracy/F1: 0.93367
Macro-averaged F1: 0.92603

For each class, the results show the number of true positives, false negatives, false positives, and true negatives, the class accuracy, precision, recall and F1 measure. It then gives a summary F1 over the whole data set, either micro-averaged (each test item counts equally) or macro-averaged (each class counts equally). Distinguishing cheeses and diseases isn't too hard for the classifier!

What features does the classifier use, and what is useful in making a decision? Mostly the system is using character n-grams - short subsequences of characters - though it also has a couple of other features that include a class frequency prior and a feature for the bucketed length of the name. Also, in the above example, the -jar command runs the default class in the jar file, which is edu.stanford.nlp.classify.ColumnDataClassifier. In this example we'll show running the command explicitly. Also, often it is useful to mix a properties files and some command-line flags: if running a series of experiments, you might have the baseline classifier configuration in a properties file but put differences in properties for a series of experiments on the command-line. Things specified on the command-line override specifications in the properties file. We'll add a command-line flag to print features with high weights:

java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop -printClassifier HighWeight

This now prints out the features with high weights. (This form of output is especially easily interpretable for categorical features.) You see that most of the clearest, best features are particular character n-grams that indicate disease words, such as: ia$, ma$, sis (where $ indicates the end of string). For example, the highest weight feature is:

(1-#E-ia,2)        1.0975

which says that the feature is a string final (#E) bigram of ia from the String in column 1. For that feature, the weight for class 2 (disease) is 1.0975 - this is a strong positive vote for this feature indicating a disease not a cheese.

The commonest features to use in text classification are word features, and you might think of adding them here (even though most of these names are 3 or less words, and many are only 1). You can fairly easily do this by adding a couple more flags for features:

java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop -printClassifier HighWeight -1.useSplitWords -1.splitWordsRegexp "\s"

However, at least in this case, accuracy isn't improved beyond just using character n-grams. We get the same results as before:

196 examples in test set
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
Micro-averaged accuracy/F1: 0.93367
Macro-averaged F1: 0.92603

Iris data set

Fisher's Iris data set is one of the most famous data sets in statistics and machine learning [1]. Three species of Iris are described by four numeric variables. We show it both as a simple example of numeric classification and as an example of using multiple columns of inputs for each data item. In the download, there is a version of the 150 item data set divided into 130 training examples and 20 test examples, and a properties file suitable for training a classifier from it.

Note that the provided properties file is set up to run from the top-level folder of the Stanford classifier distribution. We will asssume that STANFORD_CLASSIFIER_HOME points to it. You can do something like:

 STANFORD_CLASSIFIER_HOME=/Users/manning/Software/stanford-classifier-2008-04-18

Here is the provided properties file:

#
# Features
#
# Data format by column is:
# species     sepalLength	sepalWidth	petalLength	petalWidth
#
useClassFeature=true
1.realValued=true
2.realValued=true
3.realValued=true
4.realValued=true

printClassifier=AllWeights

#
# Training input
#
trainFile=./examples/iris.train
testFile=./examples/iris.test

The four predictor variables are all specified as real valued. There are other flags that will let you use numeric variables with a few simple transforms, such as logTransform or logitTransform.

If you run this model:

cd $STANFORD_CLASSIFIER_HOME
java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/iris2007.prop

Then you'll find that you get the test set completely right!

Built this classifier: Linear classifier with the following weights
        Iris-setosa     Iris-versicolor Iris-virginica 
3-Value -2.27            0.03            2.26          
CLASS    0.34            0.65           -1.01          
4-Value -1.07           -0.91            1.99          
2-Value  1.60           -0.13           -1.43          
1-Value  0.69            0.42           -1.23          
Total:  -0.72            0.05            0.57          
Prob:    0.15            0.32            0.54          

Output format: dataColumn1 goldAnswer classifierAnswer P(classifierAnswer)
5	Iris-setosa	Iris-setosa	0.969786023975717
4.6	Iris-setosa	Iris-setosa	0.9922589089843827
5.1	Iris-setosa	Iris-setosa	0.9622434861270637
4.9	Iris-setosa	Iris-setosa	0.9515812390773056
5.4	Iris-setosa	Iris-setosa	0.9811482146433487
4.4	Iris-setosa	Iris-setosa	0.9682526103461551
5.3	Iris-setosa	Iris-setosa	0.9832118698970074
6.1	Iris-versicolor	Iris-versicolor	0.7091015073390197
6	Iris-versicolor	Iris-versicolor	0.7601066690047942
5.5	Iris-versicolor	Iris-versicolor	0.723249991884404
6.5	Iris-versicolor	Iris-versicolor	0.7913325733592043
6.8	Iris-versicolor	Iris-versicolor	0.8416723165037595
6.2	Iris-versicolor	Iris-versicolor	0.8854234492113978
6.7	Iris-virginica	Iris-virginica	0.8440929745353494
6.4	Iris-virginica	Iris-virginica	0.7816139993113614
5.7	Iris-virginica	Iris-virginica	0.9352983975779943
6.7	Iris-virginica	Iris-virginica	0.8626420107509875
6.8	Iris-virginica	Iris-virginica	0.9442955376893006
7.7	Iris-virginica	Iris-virginica	0.8866439920995643
7.3	Iris-virginica	Iris-virginica	0.8633450387282207

20 examples in test set
Cls Iris-setosa: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
Cls Iris-versicolor: TP=6 FN=0 FP=0 TN=14; Acc 1.000 P 1.000 R 1.000 F1 1.000
Cls Iris-virginica: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
Micro-averaged accuracy/F1: 1.00000
Macro-averaged F1: 1.00000

This is a fairly easy, well-separated classification problem. Indeed you might think that the model is overparameterized, and it is. The number of examples in each class is roughly balanced, so there is presumably little value in the useClassFeature property which puts in a feature that models the overall distribution of classes. You also don't need to use all the numeric features. See the plots on the Wikipedia Iris flower data set page. You can instead, delete features for columns 2 and 4 and just use the sepal and petal lengths rather than also widths, and also still get 100% accuracy on our test set. (However, it happens that if you delete both the classFeature and the two width features, then the model that is built only gets 19/20 of the test set examples right....)

Larger Examples

20 Newsgroups