Difference between revisions of "Software/Classifier"

From NLPWiki
Jump to: navigation, search
(20 Newsgroups)
(Cheese-Disease: A small textual example)
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
= The Stanford Classifier =
 
= The Stanford Classifier =
  
The Stanford Classifier is a general purpose classifier - something that takes input data and assigns it to one of a number of categories. It can work with (scaled) real-valued and categorical inputs, supports several machine learning algorithms. It also supports several forms of regularization, which is generally needed when building models with very large numbers of predictive features.
+
The [http://nlp.stanford.edu/software/classifier.shtml Stanford Classifier] is a general purpose classifier - something that takes a set of input data and assigns each of them to one of a set of categories. It does this by generating features from each datum which are associated with positive or negative numeric "votes" (weights) for each class.  In principle, the weights could be set by hand, but the expected use is for the weights to be learned automatically based on hand-classified training data items.  (This is referred to as "supervised learning".)  The classifier can work with (scaled) real-valued and categorical inputs, and supports several machine learning algorithms. It also supports several forms of regularization, which is generally needed when building models with very large numbers of predictive features.
  
You can use the classifier on any sort of data, including standard statistics and machine learning data sets.  But for small data sets and numeric predictors, you'd generally be better off using another tool such as [http://www.cs.waikato.ac.nz/ml/weka/ Weka] or [http://www.r-project.org/ R]. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings.  But if you've also got a few numeric variables, you can throw them in at the same time.
+
You can use the classifier on any sort of data, including standard statistics and machine learning data sets.  But for small data sets and numeric predictors, you'd generally be better off using another tool such as [http://www.r-project.org/ R] or [http://www.cs.waikato.ac.nz/ml/weka/ Weka]. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings.  However, if you've also got a few numeric variables, you can throw them in at the same time.
  
== Examples ==
+
== Small Examples ==
  
=== 20 Newsgroups ===
+
=== Cheese-Disease: A small textual example ===
  
Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset. There are several versions of 20 Newsgroups.  We'll use Jason Rennie's "bydate" version from [http://people.csail.mit.edu/jrennie/20Newsgroups/].  The precise commands shown below should work on Linux or Mac OS X systems.  The Java parts should also be fine under Windows, but you'll need to do the downloading and reformatting a little differently.
+
While you can specify most options on the command line, normally the easiest way to train and test models with the Stanford Classifier is through use of properties files that record all the options used. You can find a couple of example data sets and properties files in the <tt>examples</tt> folder of the Stanford Classifier distribution.
  
First we download the corpus:
+
The Cheese-Disease dataset is a play on the MTV game show Idiot Savants from the late 1990s, which had a trivia category of Cheese or Disease? (I guess you had to be there...).  The goal is to distinguish cheese names from disease namesLook at the file <tt>examples/cheeseDisease.train</tt> to see what the data looks likeThe first column is the category (1=cheese, 2=disease)The number coding of classes was arbitraryThe two classes could have been called "cheese" and "disease"The second column is the name.  The columns are separated by a tab characterHere there is just one class column and one predictive column.  This is the minimum for training a classifier, but you can have any number of predictive columns and specify which column has what role.
curl -O http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
+
Then we unpack it:
+
  tar -xzf 20news-bydate.tar.gz
+
The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text filesWe convert it into this latter format with a simple shell script:
+
curl -O http://nlp.stanford.edu/software/classifier/convert-to-stanford-classifier.csh
+
  chmod 755 convert-to-stanford-classifier.csh
+
  ./convert-to-stanford-classifier.csh
+
We do this by converting line endings to spacesThis loses line break information which could easily have some value in classification(We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.) As part of the conversion, we also convert the original 8-bit newsgroup posts to utf-8.  It's 2010 now.
+
  
Check that everything worked and you have the right number of documents:
+
In the top level folder of the Stanford Classifier, the following command will build a model for this data set and test it on the test data set in the simplest possible way:
  wc -l 20news-bydate*-stanford-classifier.txt
+
  java -jar stanford-classifier.jar -prop examples/cheese2007.prop
    7532 20news-bydate-test-stanford-classifier.txt
+
This prints a lot of information.  The first part shows a little bit about the data set.  The second part shows the process of optimization (choosing feature weights in training a classifier on the training data).  The next part then shows the results of testing the model on a separate test set of data, and the final 5 lines give the test results:
  11314 20news-bydate-train-stanford-classifier.txt
+
196 examples in test set
  18846 total
+
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
The correct number should be as shown.
+
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
 +
Micro-averaged accuracy/F1: 0.93367
 +
Macro-averaged F1: 0.92603
 +
For each class, the results show the number of true positives, false negatives, false positives, and true negatives, the class accuracy, precision, recall and F1 measure.  It then gives a summary F1 over the whole data set, either micro-averaged (each test item counts equally) or macro-averaged (each class counts equally). For skewed data sets, macro-averaged F1 is a good measure of how well a classifier does on uncommon classes.
 +
Distinguishing cheeses and diseases isn't too hard for the classifier!
  
Next we split the training data into initial training data and a development test set. <i>(Methodological note: people often fail to do this.  But if you are going to build a sequence of models trying to find the best classifier, and moreover if you are likely to be looking at the test results to see where your model fails and how you might fix it, then it is vital that you use a development test set that is distinct from your final test setOtherwise you <b>will</b> overfit on your final test set and report unrealistically rosy results.)</i>
+
What features does the classifier use, and what is useful in making a decision? Mostly the system is using ''character n-grams'' - short subsequences of characters - though it also has a couple of other features that include a class frequency prior and a feature for the bucketed length of the nameIn the above example,
  grep -P '^\S+\s[0-9]*[1-8]\s' 20news-bydate-train-stanford-classifier.txt > 20news-bydate-devtrain-stanford-classifier.txt
+
the <tt>-jar</tt> command runs the default class in the jar file, which is <tt>edu.stanford.nlp.classify.ColumnDataClassifier</tt>. In this example we'll show running the command explicitly. Also, often it is useful to mix a properties file and some command-line flags: if running a series of experiments, you might have the baseline classifier configuration in a properties file but put differences in properties for a series of experiments on the command-line.  Things specified on the command-line override specifications in the properties file.  We'll add a command-line flag to print features with high weights:
  grep -P '^\S+\s[0-9]*[90]\s' 20news-bydate-train-stanford-classifier.txt > 20news-bydate-devtest-stanford-classifier.txt
+
java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop -printClassifier HighWeight
This gives a roughly random partition of 80% of the data in devtrain and 20% in devtest based on the final digit of the newsgroup item number.
+
This now prints out the features with high weights. (This form of output is especially easily interpretable for categorical features.)  You see that most of the clearest, best features are particular character n-grams that indicate disease words, such as: ia$, ma$, sis (where $ indicates the end of string). For example, the highest weight feature is:
 +
(1-#E-ia,2)        1.0975
 +
which says that the feature is a string final (#E) bigram of ''ia'' from the String in column 1.  For that feature, the weight for class 2 (disease) is 1.0975 - this is a strong positive vote for this feature indicating a disease not a cheese.
  
We'll assume that <code>$STANFORD_CLASSIFIER_JAR</code> points at the Stanford Classifier jar.  So, depending on your shell, do something like:
+
The commonest features to use in text classification are word features, and you might think of adding them here (even though most of these names are 3 or less words, and many are only 1).  You can fairly easily do this by adding a couple more flags for features:
  STANFORD_CLASSIFIER_JAR=/Users/manning/Software/stanford-classifier-2008-04-18/stanford-classifier.jar
+
java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop \
 +
        -printClassifier HighWeight -1.useSplitWords -1.splitWordsRegexp "\s"
 +
However, at least in this case, accuracy isn't improved beyond just using character n-grams.  We get the same results as before:
 +
196 examples in test set
 +
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
 +
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
 +
Micro-averaged accuracy/F1: 0.93367
 +
Macro-averaged F1: 0.92603
  
This next command builds pretty much the simplest classifier that you could.  It divides the input documents on white space and then trains a classifier on the resulting tokens. The command is normally entered as all one line without the trailing backslashes, but we've split it so it formats better on this page.
+
=== Iris data set ===
java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier \
+
-trainFile 20news-bydate-devtrain-stanford-classifier.txt -testFile 20news-bydate-devtest-stanford-classifier.txt \
+
-2.useSplitWords -2.splitWordsRegexp "\\s+"
+
Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. (There are some methods for reducing memory usage that we'll discuss later on.)  [Also, if you have any problems with this command, it's probably an issue with symbol escaping in your shell, and so you should probably just skip it and go on to the next example that uses a properties file.]
+
  
This command generates a lot of output. The last part shows the accuracy of the classifier:
+
Fisher's Iris data set is one of the most famous data sets in statistics and machine learning [http://en.wikipedia.org/wiki/Iris_flower_data_set]. Three species of Iris are described by four numeric variables. We show it both as a simple example of numeric classification and as an example of using multiple columns of inputs for each data item. In the download, there is a version of the 150 item data set divided into 130 training examples and 20 test examples, and a properties file suitable for training a classifier from it.
2241 examples in test set
+
Cls alt.atheism: TP=80 FN=20 FP=5 TN=2136; Acc 0.989 P 0.941 R 0.800 F1 0.865
+
Cls comp.graphics: TP=82 FN=33 FP=30 TN=2096; Acc 0.972 P 0.732 R 0.713 F1 0.722
+
Cls comp.os.ms-windows.misc: TP=91 FN=24 FP=20 TN=2106; Acc 0.980 P 0.820 R 0.791 F1 0.805
+
Cls comp.sys.ibm.pc.hardware: TP=87 FN=31 FP=32 TN=2091; Acc 0.972 P 0.731 R 0.737 F1 0.734
+
Cls comp.sys.mac.hardware: TP=90 FN=24 FP=23 TN=2104; Acc 0.979 P 0.796 R 0.789 F1 0.793
+
Cls comp.windows.x: TP=112 FN=8 FP=20 TN=2101; Acc 0.988 P 0.848 R 0.933 F1 0.889
+
Cls misc.forsale: TP=100 FN=17 FP=47 TN=2077; Acc 0.971 P 0.680 R 0.855 F1 0.758
+
Cls rec.autos: TP=95 FN=19 FP=21 TN=2106; Acc 0.982 P 0.819 R 0.833 F1 0.826
+
Cls rec.motorcycles: TP=98 FN=14 FP=14 TN=2115; Acc 0.988 P 0.875 R 0.875 F1 0.875
+
Cls rec.sport.baseball: TP=112 FN=7 FP=13 TN=2109; Acc 0.991 P 0.896 R 0.941 F1 0.918
+
Cls rec.sport.hockey: TP=113 FN=4 FP=4 TN=2120; Acc 0.996 P 0.966 R 0.966 F1 0.966
+
Cls sci.crypt: TP=108 FN=8 FP=5 TN=2120; Acc 0.994 P 0.956 R 0.931 F1 0.943
+
Cls sci.electronics: TP=93 FN=24 FP=24 TN=2100; Acc 0.979 P 0.795 R 0.795 F1 0.795
+
Cls sci.med: TP=104 FN=20 FP=8 TN=2109; Acc 0.988 P 0.929 R 0.839 F1 0.881
+
Cls sci.space: TP=113 FN=8 FP=9 TN=2111; Acc 0.992 P 0.926 R 0.934 F1 0.930
+
Cls soc.religion.christian: TP=107 FN=12 FP=22 TN=2100; Acc 0.985 P 0.829 R 0.899 F1 0.863
+
Cls talk.politics.guns: TP=96 FN=10 FP=5 TN=2130; Acc 0.993 P 0.950 R 0.906 F1 0.928
+
Cls talk.politics.mideast: TP=104 FN=7 FP=3 TN=2127; Acc 0.996 P 0.972 R 0.937 F1 0.954
+
Cls talk.politics.misc: TP=87 FN=12 FP=5 TN=2137; Acc 0.992 P 0.946 R 0.879 F1 0.911
+
Cls talk.religion.misc: TP=49 FN=18 FP=10 TN=2164; Acc 0.988 P 0.831 R 0.731 F1 0.778
+
Micro-averaged accuracy/F1: 0.85721
+
Macro-averaged F1: 0.85670
+
We see the statistics for each class and averaged over all the data. For each class, we see the four cells of counts in a contingency table, and then the accuracy and precision, recall and F-measure calculated for them. This model already seems to perform quite well.  But we can do a little better.
+
  
As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything.  Indeed, some options you can only successfully set with a properties file.  One of the first things to address seems to be better tokenization.  Tokenizing on whitespace is fairly naive.  One can usually write a rough-and-ready but usable tokenizer inside <code>ColumnDataClassifier</code> by using the <code>splitWordsTokenizerRegexp</code> propertyAnother alternative would be to use a tool like the Stanford tokenizer to pre-tokenize the dataIn general, this will probably work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can [http://nlp.stanford.edu/software/classifier/20news1.prop download]:
+
Note that the provided properties file is set up to run from the top-level folder of the Stanford classifier distributionWe will asssume that STANFORD_CLASSIFIER_HOME points to itYou can do something like:
trainFile=20news-bydate-devtrain-stanford-classifier.txt
+
  STANFORD_CLASSIFIER_HOME=/Users/manning/Software/stanford-classifier-2008-04-18
testFile=20news-bydate-devtest-stanford-classifier.txt
+
2.useSplitWords=true
+
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
+
2.splitWordsIgnoreRegexp=\\s+
+
This tokenizer recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter.  The whitespace tokens are then ignored.
+
  
Just a bit of work on tokenization gives us about 2%!
+
Here is the provided properties file:
  java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop
+
  #
  ...
+
# Features
  Micro-averaged accuracy/F1: 0.87773
+
#
  Macro-averaged F1: 0.87619
+
# Data format by column is:
 +
# species    sepalLength sepalWidth petalLength petalWidth
 +
#
 +
useClassFeature=true
 +
1.realValued=true
 +
2.realValued=true
 +
3.realValued=true
 +
  4.realValued=true
 +
   
 +
printClassifier=AllWeights
 +
 +
#
 +
# Training input
 +
#
 +
trainFile=./examples/iris.train
 +
  testFile=./examples/iris.test
  
You can look at the output of the tokenizer by examining the features the classifier generatesWe can do this with this command:
+
The four predictor variables are all specified as real valuedThere are other flags that will let you use numeric variables with a few simple transforms, such as <code>logTransform</code> or <code>logitTransform</code>.
java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop -printFeatures prop1
+
Look at the resulting (very large) file <code>prop1.train</code> . You might be able to get a bit better performance by fine-tuning the tokenization, though, often, for text categorization, a fairly simple tokenization is sufficient, providing (1) it's enough to recognize most semantically contentful word units, and (2) it doesn't produce a huge number of rarely observed features.  (E.g., for this data set, there are a few uuencoded files in newsgroup postings.  Under whitespace tokenization, each line of the file became a token that almost certainly only occurred once.  Now they'll get split up on characters that aren't letters and digits.  That not only reduces the token space, but probably some of the letter strings that do result will recur, and become slightly useful features.  It seems like it might be better to remove the uuencoded content altogether, while leaving just enough to know that there was a uuencoded file in the news posting. Some much more developed processors such as the [http://www.cs.cmu.edu/~mccallum/bow/ bow] tokenizer handle recognizing and stripping uuencoded files.  The Stanford Classifier doesn't.  uuencoded text isn't so common in 2010. But we don't worry about this; it probably doesn't make much difference.)
+
  
There are many other kinds of features that you could consider putting into the classifier which might improve performance. The length of a newsgroup posting might be informative, but it probably isn't linearly related to its class, so we bin lengths into 4 categories, which become categorical features. You have to choose those cut-offs manually, but <code>ColumnDataClassifier</code> can print simple statistics of how many documents of each class fall in each bin, which can help you see if you've chosen very bad cut-offsHere's the properties file: [http://nlp.stanford.edu/software/classifier/20news2.prop] .
+
If you run this model:
  trainFile=20news-bydate-devtrain-stanford-classifier.txt
+
cd $STANFORD_CLASSIFIER_HOME
  testFile=20news-bydate-devtest-stanford-classifier.txt
+
java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/iris2007.prop
  2.useSplitWords=true
+
Then you'll find that you get the test set completely right!
  2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
+
Built this classifier: Linear classifier with the following weights
  2.splitWordsIgnoreRegexp=\\s+
+
        Iris-setosa    Iris-versicolor Iris-virginica
  2.binnedLengths=500,1500,4500,13500
+
3-Value -2.27            0.03            2.26         
  2.binnedLengthsStatistics=true
+
  CLASS    0.34            0.65          -1.01         
 +
4-Value -1.07          -0.91            1.99         
 +
2-Value  1.60          -0.13          -1.43         
 +
1-Value  0.69            0.42          -1.23         
 +
Total: -0.72            0.05            0.57         
 +
Prob:   0.15            0.32            0.54         
 +
 +
Output format: dataColumn1 goldAnswer classifierAnswer P(classifierAnswer)
 +
5 Iris-setosa Iris-setosa 0.969786023975717
 +
  4.6 Iris-setosa Iris-setosa 0.9922589089843827
 +
5.1 Iris-setosa Iris-setosa 0.9622434861270637
 +
  4.9 Iris-setosa Iris-setosa 0.9515812390773056
 +
5.4 Iris-setosa Iris-setosa 0.9811482146433487
 +
  4.4 Iris-setosa Iris-setosa 0.9682526103461551
 +
  5.3 Iris-setosa Iris-setosa 0.9832118698970074
 +
6.1 Iris-versicolor Iris-versicolor 0.7091015073390197
 +
6 Iris-versicolor Iris-versicolor 0.7601066690047942
 +
5.5 Iris-versicolor Iris-versicolor 0.723249991884404
 +
6.5 Iris-versicolor Iris-versicolor 0.7913325733592043
 +
  6.8 Iris-versicolor Iris-versicolor 0.8416723165037595
 +
6.2 Iris-versicolor Iris-versicolor 0.8854234492113978
 +
6.7 Iris-virginica Iris-virginica 0.8440929745353494
 +
6.4 Iris-virginica Iris-virginica 0.7816139993113614
 +
5.7 Iris-virginica Iris-virginica 0.9352983975779943
 +
6.7 Iris-virginica Iris-virginica 0.8626420107509875
 +
6.8 Iris-virginica Iris-virginica 0.9442955376893006
 +
7.7 Iris-virginica Iris-virginica 0.8866439920995643
 +
7.3 Iris-virginica Iris-virginica 0.8633450387282207
 +
 +
20 examples in test set
 +
Cls Iris-setosa: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
 +
  Cls Iris-versicolor: TP=6 FN=0 FP=0 TN=14; Acc 1.000 P 1.000 R 1.000 F1 1.000
 +
Cls Iris-virginica: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
 +
Micro-averaged accuracy/F1: 1.00000
 +
Macro-averaged F1: 1.00000
 +
This is a fairly easy, well-separated classification problem. Indeed you might think that the model is overparameterized, and it is. The number of examples in each class is roughly balanced, so there is presumably little value in the <code>useClassFeature</code> property which puts in a feature that models the overall distribution of classes. You also don't need to use all the numeric features. See the plots on the [http://en.wikipedia.org/wiki/Iris_flower_data_set Wikipedia Iris flower data set page].  You can instead, delete features for columns 2 and 4 and just use the sepal and petal lengths rather than also widths, and also still get 100% accuracy on our test set. (However, it happens that if you delete both the classFeature and the two width features, then the model that is built only gets 19/20 of the test set examples right....)
  
In this case, that doesn't help:
+
== Larger Examples ==
Micro-averaged accuracy/F1: 0.87773
+
Macro-averaged F1: 0.87619
+
  
Other feature types that are often good with text documents are: to use token prefix and suffixes and to use the "shape" of a token (whether it contains upper or lowercase or digits or certain kinds of symbols as equivalence classes).  We've also changed the output to show the highest weight features in the model. That's often informative to look at. This gives our next properties file:  [http://nlp.stanford.edu/software/classifier/20news3.prop] .
+
[[Software/Classifier/20 Newsgroups| 20 Newsgroups text classification]]
trainFile=20news-bydate-devtrain-stanford-classifier.txt
+
testFile=20news-bydate-devtest-stanford-classifier.txt
+
2.useSplitWords=true
+
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
+
2.splitWordsIgnoreRegexp=\\s+
+
2.useSplitPrefixSuffixNGrams=true
+
2.maxNGramLeng=4
+
2.minNGramLeng=1
+
2.splitWordShape=chris4
+
printClassifier=HighWeight
+
printClassifierParam=400
+
  
This pushes performance up another percent, roughly:
+
[[Software/Classifier/Sentiment| Sentiment Analysis]]
Micro-averaged accuracy/F1: 0.88755
+
Macro-averaged F1: 0.88588
+
 
+
As well as fiddling with features, we can also fiddle with the machine learning and optimization.  By default you get a maximum entropy (roughly, multiclass logistic regression) model with L2 regularization (a.k.a., a gaussian prior) optimized by the L-BFGS quasi-Newton method. You might be able to get a bit of improvement by adjusting the amount of regularization, which you can do by altering the sigma parameter:
+
sigma=3
+
You can also change the type of regularization altogether.  Lately, L1 regularization has been popular for producing well-performing compact models.  You can select it by specifying the L1 regularization parameter:
+
l1reg=0.1
+
We tried a couple of settings of both of these, but nothing really seemed to beat out the defaults.
+
 
+
And so we return to fiddling with features. Academic papers don't spend much time discussing fiddling with features, but in practice it's usually where most of the gains come from (once you've got a basically competent machine learning method). In the last properties file, we had it print out the highest weight features. That's often useful to look at.  Here are the top few:
+
(1-S#B-X,comp.windows.x)                  1.3971
+
(1-S#B-Win,comp.os.ms-windows.misc)        0.9708
+
(1-SW-car,rec.autos)                      0.9538
+
(1-S#E-car,rec.autos)                      0.9178
+
(1-S#B-Mac,comp.sys.mac.hardware)          0.9054
+
(1-S#E-dows,comp.os.ms-windows.misc)      0.9020
+
(1-S#B-x,comp.windows.x)                  0.8157
+
(1-S#B-car,rec.autos)                      0.7689
+
(1-S#E-ows,comp.os.ms-windows.misc)        0.7382
+
(1-S#E-ale,misc.forsale)                  0.7361
+
They basically make sense.  Note that all but one of them is a beginning or end split words n-gram feature (S#B or S#E).  This partly makes sense: these features generalize over multiple actual words, so starting with "X" will match "X" "Xwindows" or "X-windows".  It's part of what makes these features useful. But it also suggests that we might really be missing out by not collapsing case distinctions: S#E-ale is a good feature for <code>misc.forsale</code> precisely because it matches both "Sale" or "sale".  So let's try tackling that.  There are several possible variants.  One thing to try would be to just lowercase everything.  Another would be to instead put in lowercased splitWords or both the regular splitWords features <i>and</i> lowercased versions of them.  We tried several things.  Relevant properties are <code>lowercase</code>, <code>useLowercaseSplitWords</code>, and <code>lowercaseNGrams</code>.  The best thing seemed to be to put in the splitWords regular case and lowercase, but to keep the character n-grams cased.
+
 
+
<i>Technical point: the top features list also shows that many of the features are highly collinear: you get pairs like SW-car and S#E-car or S#E-dows and S#E-ows which mainly match in the same documents. This is common with textual features, and we don't try to solve this problem. The best we can do is to observe that maximum entropy models are reasonably tolerant of this sort of feature overlap: The fact that the model with both lowercased and regular case features seems to work best is indicative of this.</i>
+
 
+
We'll use that.  You might then also want to save your built classifier so you can run it on data sets later.  (You can do this either directly with <code>ColumnDataClassifier</code> or, in your own program, you'll want to load the classifier using a method like <code>LinearClassifier.readClassifier(filename)</code>. This gives us our final properties file:
+
[http://nlp.stanford.edu/software/classifier/20news4.prop] .
+
trainFile=20news-bydate-devtrain-stanford-classifier.txt
+
testFile=20news-bydate-devtest-stanford-classifier.txt
+
2.useLowercaseSplitWords=true
+
2.useSplitWords=true
+
2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
+
2.splitWordsIgnoreRegexp=\\s+
+
2.useSplitPrefixSuffixNGrams=true
+
2.maxNGramLeng=4
+
2.minNGramLeng=1
+
2.splitWordShape=chris4
+
printClassifier=HighWeight
+
printClassifierParam=400
+
serializeTo=20newsgroups4.ser.gz
+
 
+
Here are the devtest set results:
+
2241 examples in test set
+
Cls alt.atheism: TP=90 FN=10 FP=5 TN=2136; Acc 0.993 P 0.947 R 0.900 F1 0.923
+
Cls comp.graphics: TP=91 FN=24 FP=26 TN=2100; Acc 0.978 P 0.778 R 0.791 F1 0.784
+
Cls comp.os.ms-windows.misc: TP=98 FN=17 FP=20 TN=2106; Acc 0.983 P 0.831 R 0.852 F1 0.841
+
Cls comp.sys.ibm.pc.hardware: TP=90 FN=28 FP=35 TN=2088; Acc 0.972 P 0.720 R 0.763 F1 0.741
+
Cls comp.sys.mac.hardware: TP=92 FN=22 FP=15 TN=2112; Acc 0.983 P 0.860 R 0.807 F1 0.833
+
Cls comp.windows.x: TP=112 FN=8 FP=16 TN=2105; Acc 0.989 P 0.875 R 0.933 F1 0.903
+
Cls misc.forsale: TP=99 FN=18 FP=29 TN=2095; Acc 0.979 P 0.773 R 0.846 F1 0.808
+
Cls rec.autos: TP=98 FN=16 FP=15 TN=2112; Acc 0.986 P 0.867 R 0.860 F1 0.863
+
Cls rec.motorcycles: TP=106 FN=6 FP=6 TN=2123; Acc 0.995 P 0.946 R 0.946 F1 0.946
+
Cls rec.sport.baseball: TP=114 FN=5 FP=11 TN=2111; Acc 0.993 P 0.912 R 0.958 F1 0.934
+
Cls rec.sport.hockey: TP=113 FN=4 FP=5 TN=2119; Acc 0.996 P 0.958 R 0.966 F1 0.962
+
Cls sci.crypt: TP=110 FN=6 FP=2 TN=2123; Acc 0.996 P 0.982 R 0.948 F1 0.965
+
Cls sci.electronics: TP=97 FN=20 FP=16 TN=2108; Acc 0.984 P 0.858 R 0.829 F1 0.843
+
Cls sci.med: TP=113 FN=11 FP=3 TN=2114; Acc 0.994 P 0.974 R 0.911 F1 0.942
+
Cls sci.space: TP=117 FN=4 FP=4 TN=2116; Acc 0.996 P 0.967 R 0.967 F1 0.967
+
Cls soc.religion.christian: TP=113 FN=6 FP=18 TN=2104; Acc 0.989 P 0.863 R 0.950 F1 0.904
+
Cls talk.politics.guns: TP=99 FN=7 FP=4 TN=2131; Acc 0.995 P 0.961 R 0.934 F1 0.947
+
Cls talk.politics.mideast: TP=107 FN=4 FP=1 TN=2129; Acc 0.998 P 0.991 R 0.964 F1 0.977
+
Cls talk.politics.misc: TP=91 FN=8 FP=4 TN=2138; Acc 0.995 P 0.958 R 0.919 F1 0.938
+
Cls talk.religion.misc: TP=49 FN=18 FP=7 TN=2167; Acc 0.989 P 0.875 R 0.731 F1 0.797
+
Micro-averaged accuracy/F1: 0.89201
+
Macro-averaged F1: 0.89099
+
 
+
This then leaves the final test where we train on the full training set and then test on the test set:
+
java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news4.prop \
+
  -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt
+
 
+
Here are the final test set results:
+
 
+
You'll notice that they are quite a bit lower. Results being a bit lower is to be suspected (after all, we were overfitting to the devtest set by doing multiple runs), but the results here are a lot lower.  I think this is because in the <code>bydate</code> version of 20 Newsgroups, the test set is all from a later time period than the training set, whereas when we subdivided the training set, we took a roughly uniform sample across it as the devtest set.  There's enough extra similarity between documents close in time and  temporal movement in what gets posted over time in most time series text data sets that you get a substantial difference like this.
+
 
+
Nevertheless, these results are pretty good!  Here are results we could find in the academic literature for the same data set:
+
 
+
{|
+
!Paper
+
!Model
+
!Micro-ave accuracy
+
!Notes
+
|-
+
|Jason D. M. Rennie, 2003, On The Value of Leave-One-Out Cross-Validation Bounds
+
|regularized least squares classifier
+
|0.8486
+
|na
+
|Optimal regularization chosen post-hoc on test set
+
|-
+
|Larochelle, H and Bengio, Y, 2008, Classification using Discriminative Restricted Boltzmann Machines
+
|hybrid discriminative RBM
+
|0.762
+
|Only 5000 most frequent tokens used as features
+
|}
+
 
+
Other results on 20 Newsgroups
+
Rennie ICML 2003 accuracy: Multinomial NB 0848 TWCNB 0.861 and SVM 0.862 on average of 80/20 splits
+
Lan, Tan, and Low AAAI 2006 accuracy: SVM 0.81, kNN 0.69
+
Gu and Zhou SIAM Datamining 2009 accuracy:
+

Revision as of 06:41, 27 October 2012

Contents

The Stanford Classifier

The Stanford Classifier is a general purpose classifier - something that takes a set of input data and assigns each of them to one of a set of categories. It does this by generating features from each datum which are associated with positive or negative numeric "votes" (weights) for each class. In principle, the weights could be set by hand, but the expected use is for the weights to be learned automatically based on hand-classified training data items. (This is referred to as "supervised learning".) The classifier can work with (scaled) real-valued and categorical inputs, and supports several machine learning algorithms. It also supports several forms of regularization, which is generally needed when building models with very large numbers of predictive features.

You can use the classifier on any sort of data, including standard statistics and machine learning data sets. But for small data sets and numeric predictors, you'd generally be better off using another tool such as R or Weka. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings. However, if you've also got a few numeric variables, you can throw them in at the same time.

Small Examples

Cheese-Disease: A small textual example

While you can specify most options on the command line, normally the easiest way to train and test models with the Stanford Classifier is through use of properties files that record all the options used. You can find a couple of example data sets and properties files in the examples folder of the Stanford Classifier distribution.

The Cheese-Disease dataset is a play on the MTV game show Idiot Savants from the late 1990s, which had a trivia category of Cheese or Disease? (I guess you had to be there...). The goal is to distinguish cheese names from disease names. Look at the file examples/cheeseDisease.train to see what the data looks like. The first column is the category (1=cheese, 2=disease). The number coding of classes was arbitrary. The two classes could have been called "cheese" and "disease". The second column is the name. The columns are separated by a tab character. Here there is just one class column and one predictive column. This is the minimum for training a classifier, but you can have any number of predictive columns and specify which column has what role.

In the top level folder of the Stanford Classifier, the following command will build a model for this data set and test it on the test data set in the simplest possible way:

java -jar stanford-classifier.jar -prop examples/cheese2007.prop

This prints a lot of information. The first part shows a little bit about the data set. The second part shows the process of optimization (choosing feature weights in training a classifier on the training data). The next part then shows the results of testing the model on a separate test set of data, and the final 5 lines give the test results:

196 examples in test set
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
Micro-averaged accuracy/F1: 0.93367
Macro-averaged F1: 0.92603

For each class, the results show the number of true positives, false negatives, false positives, and true negatives, the class accuracy, precision, recall and F1 measure. It then gives a summary F1 over the whole data set, either micro-averaged (each test item counts equally) or macro-averaged (each class counts equally). For skewed data sets, macro-averaged F1 is a good measure of how well a classifier does on uncommon classes. Distinguishing cheeses and diseases isn't too hard for the classifier!

What features does the classifier use, and what is useful in making a decision? Mostly the system is using character n-grams - short subsequences of characters - though it also has a couple of other features that include a class frequency prior and a feature for the bucketed length of the name. In the above example, the -jar command runs the default class in the jar file, which is edu.stanford.nlp.classify.ColumnDataClassifier. In this example we'll show running the command explicitly. Also, often it is useful to mix a properties file and some command-line flags: if running a series of experiments, you might have the baseline classifier configuration in a properties file but put differences in properties for a series of experiments on the command-line. Things specified on the command-line override specifications in the properties file. We'll add a command-line flag to print features with high weights:

java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop -printClassifier HighWeight

This now prints out the features with high weights. (This form of output is especially easily interpretable for categorical features.) You see that most of the clearest, best features are particular character n-grams that indicate disease words, such as: ia$, ma$, sis (where $ indicates the end of string). For example, the highest weight feature is:

(1-#E-ia,2)        1.0975

which says that the feature is a string final (#E) bigram of ia from the String in column 1. For that feature, the weight for class 2 (disease) is 1.0975 - this is a strong positive vote for this feature indicating a disease not a cheese.

The commonest features to use in text classification are word features, and you might think of adding them here (even though most of these names are 3 or less words, and many are only 1). You can fairly easily do this by adding a couple more flags for features:

java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/cheese2007.prop \
        -printClassifier HighWeight -1.useSplitWords -1.splitWordsRegexp "\s"

However, at least in this case, accuracy isn't improved beyond just using character n-grams. We get the same results as before:

196 examples in test set
Cls 2: TP=123 FN=5 FP=8 TN=60; Acc 0.934 P 0.939 R 0.961 F1 0.950
Cls 1: TP=60 FN=8 FP=5 TN=123; Acc 0.934 P 0.923 R 0.882 F1 0.902
Micro-averaged accuracy/F1: 0.93367
Macro-averaged F1: 0.92603

Iris data set

Fisher's Iris data set is one of the most famous data sets in statistics and machine learning [1]. Three species of Iris are described by four numeric variables. We show it both as a simple example of numeric classification and as an example of using multiple columns of inputs for each data item. In the download, there is a version of the 150 item data set divided into 130 training examples and 20 test examples, and a properties file suitable for training a classifier from it.

Note that the provided properties file is set up to run from the top-level folder of the Stanford classifier distribution. We will asssume that STANFORD_CLASSIFIER_HOME points to it. You can do something like:

 STANFORD_CLASSIFIER_HOME=/Users/manning/Software/stanford-classifier-2008-04-18

Here is the provided properties file:

#
# Features
#
# Data format by column is:
# species     sepalLength	sepalWidth	petalLength	petalWidth
#
useClassFeature=true
1.realValued=true
2.realValued=true
3.realValued=true
4.realValued=true

printClassifier=AllWeights

#
# Training input
#
trainFile=./examples/iris.train
testFile=./examples/iris.test

The four predictor variables are all specified as real valued. There are other flags that will let you use numeric variables with a few simple transforms, such as logTransform or logitTransform.

If you run this model:

cd $STANFORD_CLASSIFIER_HOME
java -cp stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop examples/iris2007.prop

Then you'll find that you get the test set completely right!

Built this classifier: Linear classifier with the following weights
        Iris-setosa     Iris-versicolor Iris-virginica 
3-Value -2.27            0.03            2.26          
CLASS    0.34            0.65           -1.01          
4-Value -1.07           -0.91            1.99          
2-Value  1.60           -0.13           -1.43          
1-Value  0.69            0.42           -1.23          
Total:  -0.72            0.05            0.57          
Prob:    0.15            0.32            0.54          

Output format: dataColumn1 goldAnswer classifierAnswer P(classifierAnswer)
5	Iris-setosa	Iris-setosa	0.969786023975717
4.6	Iris-setosa	Iris-setosa	0.9922589089843827
5.1	Iris-setosa	Iris-setosa	0.9622434861270637
4.9	Iris-setosa	Iris-setosa	0.9515812390773056
5.4	Iris-setosa	Iris-setosa	0.9811482146433487
4.4	Iris-setosa	Iris-setosa	0.9682526103461551
5.3	Iris-setosa	Iris-setosa	0.9832118698970074
6.1	Iris-versicolor	Iris-versicolor	0.7091015073390197
6	Iris-versicolor	Iris-versicolor	0.7601066690047942
5.5	Iris-versicolor	Iris-versicolor	0.723249991884404
6.5	Iris-versicolor	Iris-versicolor	0.7913325733592043
6.8	Iris-versicolor	Iris-versicolor	0.8416723165037595
6.2	Iris-versicolor	Iris-versicolor	0.8854234492113978
6.7	Iris-virginica	Iris-virginica	0.8440929745353494
6.4	Iris-virginica	Iris-virginica	0.7816139993113614
5.7	Iris-virginica	Iris-virginica	0.9352983975779943
6.7	Iris-virginica	Iris-virginica	0.8626420107509875
6.8	Iris-virginica	Iris-virginica	0.9442955376893006
7.7	Iris-virginica	Iris-virginica	0.8866439920995643
7.3	Iris-virginica	Iris-virginica	0.8633450387282207

20 examples in test set
Cls Iris-setosa: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
Cls Iris-versicolor: TP=6 FN=0 FP=0 TN=14; Acc 1.000 P 1.000 R 1.000 F1 1.000
Cls Iris-virginica: TP=7 FN=0 FP=0 TN=13; Acc 1.000 P 1.000 R 1.000 F1 1.000
Micro-averaged accuracy/F1: 1.00000
Macro-averaged F1: 1.00000

This is a fairly easy, well-separated classification problem. Indeed you might think that the model is overparameterized, and it is. The number of examples in each class is roughly balanced, so there is presumably little value in the useClassFeature property which puts in a feature that models the overall distribution of classes. You also don't need to use all the numeric features. See the plots on the Wikipedia Iris flower data set page. You can instead, delete features for columns 2 and 4 and just use the sepal and petal lengths rather than also widths, and also still get 100% accuracy on our test set. (However, it happens that if you delete both the classFeature and the two width features, then the model that is built only gets 19/20 of the test set examples right....)

Larger Examples

20 Newsgroups text classification

Sentiment Analysis