Difference between revisions of "Software/Classifier"

From NLPWiki
Jump to: navigation, search
(Stanford Classifier)
m
Line 1: Line 1:
= Stanford Classifier =
+
= The Stanford Classifier =
  
The Stanford Classifier is a general purpose classifier - something that takes input data and assigns it to one of a number of categories. It can work with (scaled) real-valued and categorical inputs, supports several learning mechanisms, and supports several forms of regularization, as are generally needed when building models with large numbers of predictive features.
+
The Stanford Classifier is a general purpose classifier - something that takes input data and assigns it to one of a number of categories. It can work with (scaled) real-valued and categorical inputs, supports several machine learning algorithms, and supports several forms of regularization, as are generally needed when building models with large numbers of predictive features.
  
 
You can use it on anything, including standard statistics and machine learning data sets.  But for small data sets and numeric predictors, you'd generally be better off using another tool such as [http://www.cs.waikato.ac.nz/ml/weka/ Weka] or [http://www.r-project.org/ R]. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings.  But if you've also got a few numeric variables, you can throw them in at the same time.
 
You can use it on anything, including standard statistics and machine learning data sets.  But for small data sets and numeric predictors, you'd generally be better off using another tool such as [http://www.cs.waikato.ac.nz/ml/weka/ Weka] or [http://www.r-project.org/ R]. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings.  But if you've also got a few numeric variables, you can throw them in at the same time.
Line 9: Line 9:
 
=== 20 Newsgroups ===
 
=== 20 Newsgroups ===
  
Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset.  There are several versions of 20 Newsgroups.  We'll use Jason Rennie's "bydate" version.  The precise commands shown below should work on Linux or Mac OS X systems.  The Java parts should also be fine under Windows, but you'd need to do the downloading and reformatting a little differently.
+
Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset.  There are several versions of 20 Newsgroups.  We'll use Jason Rennie's "bydate" version from [http://people.csail.mit.edu/jrennie/20Newsgroups/].  The precise commands shown below should work on Linux or Mac OS X systems.  The Java parts should also be fine under Windows, but you'll need to do the downloading and reformatting a little differently.
  
 
First we download the corpus:
 
First we download the corpus:
 
  curl -O http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
 
  curl -O http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
 
Then we unpack it:
 
Then we unpack it:
  tar -xvzf 20news-bydate.tar.gz
+
  tar -xzf 20news-bydate.tar.gz
 
The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files.  We convert it into this latter format with a simple shell script:
 
The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files.  We convert it into this latter format with a simple shell script:
 
  curl -O http://nlp.stanford.edu/software/classifier/convert-to-stanford-classifier.csh
 
  curl -O http://nlp.stanford.edu/software/classifier/convert-to-stanford-classifier.csh
 
  chmod 755 convert-to-stanford-classifier.csh
 
  chmod 755 convert-to-stanford-classifier.csh
 
  ./convert-to-stanford-classifier.csh
 
  ./convert-to-stanford-classifier.csh
Note that we do this by converting line endings to spaces.  This loses line break information which could easily have some value in classification.  (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.)
+
We do this by converting line endings to spaces.  This loses line break information which could easily have some value in classification.  (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.) As part of the conversion, we also convert the original 8-bit newsgroup posts to utf-8.  It's 2010 now.
  
 
Check that everything worked and you have the right number of documents:
 
Check that everything worked and you have the right number of documents:
Line 35: Line 35:
 
  -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt \
 
  -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt \
 
  -1.useSplitWords -1.splitWordsRegexp "\\s+"
 
  -1.useSplitWords -1.splitWordsRegexp "\\s+"
Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. We discuss options for reducing memory usage below.
+
Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. (There are some methods for reducing memory usage that we'll discuss later.)
  
There's a lot of output. The last part shows the accuracy of the classifier:
+
This command generates a lot of output. The last part shows the accuracy of the classifier:
7532 examples in test set
+
7532 examples in test set
 
  Cls alt.atheism: TP=215 FN=104 FP=91 TN=7122; Acc 0.974 P 0.703 R 0.674 F1 0.688
 
  Cls alt.atheism: TP=215 FN=104 FP=91 TN=7122; Acc 0.974 P 0.703 R 0.674 F1 0.688
 
  Cls comp.graphics: TP=257 FN=132 FP=211 TN=6932; Acc 0.954 P 0.549 R 0.661 F1 0.600
 
  Cls comp.graphics: TP=257 FN=132 FP=211 TN=6932; Acc 0.954 P 0.549 R 0.661 F1 0.600
Line 61: Line 61:
 
  Micro-averaged accuracy/F1: 0.76593  
 
  Micro-averaged accuracy/F1: 0.76593  
 
  Macro-averaged F1: 0.76098  
 
  Macro-averaged F1: 0.76098  
We see the statistics for each class and averaged over all the data. This is already quite competitive performance. Recent published papers (from 2008-10) often present a best macro-averaged F1 around 0.79.  But we can do a little better.
+
We see the statistics for each class and averaged over all the data. For each class, we see the four cells of counts in a contingency table, and then the accuracy and precision, recall and F-measure calculated for them. This model is already quite competitive in performance. Recent published academic papers present a best micro-averaged F1 around 0.81.  But we can do a little better.
  
 
As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything.  Indeed, some options you can only successfully set with a properties file.  One of the first things to address seems to be better tokenization.  Tokenizing on whitespace is fairly naive.  One can usually write a rough-and-ready but usable tokenizer inside <code>ColumnDataClassifier</code> by using the <code>splitWordsTokenizerRegexp</code> property.  Another alternative would be to use the Stanford tokenizer to pre-tokenize the data.  In general, this will work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can [http://nlp.stanford.edu/software/classifier/20news1.prop download]:
 
As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything.  Indeed, some options you can only successfully set with a properties file.  One of the first things to address seems to be better tokenization.  Tokenizing on whitespace is fairly naive.  One can usually write a rough-and-ready but usable tokenizer inside <code>ColumnDataClassifier</code> by using the <code>splitWordsTokenizerRegexp</code> property.  Another alternative would be to use the Stanford tokenizer to pre-tokenize the data.  In general, this will work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can [http://nlp.stanford.edu/software/classifier/20news1.prop download]:
Line 69: Line 69:
 
  1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
 
  1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
 
  1.splitWordsIgnoreRegexp=\\s+
 
  1.splitWordsIgnoreRegexp=\\s+
This tokenize recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter.  The whitespace tokens are then ignored.
+
This tokenizer recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter.  The whitespace tokens are then ignored.
  
 
Just a bit of work on tokenization gives us almost 3%!
 
Just a bit of work on tokenization gives us almost 3%!
 +
java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop
 +
...
 
  Micro-averaged accuracy/F1: 0.79501
 
  Micro-averaged accuracy/F1: 0.79501
 
  Macro-averaged F1: 0.78963  
 
  Macro-averaged F1: 0.78963  
  
 
You can look at the output of the tokenizer by examining the features the classifier generates.  We can do this with this command:
 
You can look at the output of the tokenizer by examining the features the classifier generates.  We can do this with this command:
  java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop /Users/manning/Projects/Classification-20Newsgroups/20news1.prop -printFeatures prop1
+
  java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop -printFeatures prop1
Look at the resulting (very large) file <code>prop1.train</code> . You might be able to get a bit better performance by fine-tuning the tokenization, but, often, for text categorization, a fairly simple tokenization is sufficient, providing it's enough to recognize most semantically contentful word units, and doesn't produce a huge number of rarely observed features.  (E.g., for this data set, there are a few uuencoded files in newsgroup postings.  Under whitespace tokenization, each line of the file became a token that almost certainly only occurred once.  Now they'll get split up on characters that aren't letters and digits.  That not only reduces the token space, but probably some of the letter strings that do result will recur, and become slightly useful features.)
+
Look at the resulting (very large) file <code>prop1.train</code> . You might be able to get a bit better performance by fine-tuning the tokenization, but, often, for text categorization, a fairly simple tokenization is sufficient, providing it's enough to recognize most semantically contentful word units, and doesn't produce a huge number of rarely observed features.  (E.g., for this data set, there are a few uuencoded files in newsgroup postings.  Under whitespace tokenization, each line of the file became a token that almost certainly only occurred once.  Now they'll get split up on characters that aren't letters and digits.  That not only reduces the token space, but probably some of the letter strings that do result will recur, and become slightly useful features. Though quite likely it would be better to remove the uuencoded content altogether. Some much more developed processors such as the [http://www.cs.cmu.edu/~mccallum/bow/ bow] tokenizer do this. But we don't.  uuencoded text isn't so common in 2010.)
  
 
There are many other kinds of features that you could consider putting into the classifier which might improve performance. The length of a newsgroup posting might be informative, but it probably isn't linearly related to its class, so we bin lengths into 4 categories, which become categorical features. You have to choose those cut-offs manually, but <code>ColumnDataClassifier</code> can print simple statistics of how many documents of each class fall in each bin, which can help you see if you've chosen very bad cut-offs.  Here's the properties file: [http://nlp.stanford.edu/software/classifier/20news2.prop] .
 
There are many other kinds of features that you could consider putting into the classifier which might improve performance. The length of a newsgroup posting might be informative, but it probably isn't linearly related to its class, so we bin lengths into 4 categories, which become categorical features. You have to choose those cut-offs manually, but <code>ColumnDataClassifier</code> can print simple statistics of how many documents of each class fall in each bin, which can help you see if you've chosen very bad cut-offs.  Here's the properties file: [http://nlp.stanford.edu/software/classifier/20news2.prop] .
Line 92: Line 94:
 
  Macro-averaged F1: 0.78853  
 
  Macro-averaged F1: 0.78853  
  
Other good feature ideas might be: to use token prefix and suffixes and to use the "shape" of a token (whether it contains upper or lowercase or digits or certain kinds of symbols as equivalence classes.  We also turn off the printing of the documents in the output so that the output is not quite so voluminous.  This gives our next properties file:  [http://nlp.stanford.edu/software/classifier/20news3.prop] .
+
Other good feature ideas might be: to use token prefix and suffixes and to use the "shape" of a token (whether it contains upper or lowercase or digits or certain kinds of symbols as equivalence classes).  We also turn off the printing of the documents in the output so that the output is not quite so voluminous.  This gives our next properties file:  [http://nlp.stanford.edu/software/classifier/20news3.prop] .
 
  trainFile=20news-bydate-train-stanford-classifier.txt
 
  trainFile=20news-bydate-train-stanford-classifier.txt
 
  testFile=20news-bydate-test-stanford-classifier.txt
 
  testFile=20news-bydate-test-stanford-classifier.txt
Line 103: Line 105:
 
  1.minNGramLeng=1
 
  1.minNGramLeng=1
 
  1.splitWordShape=chris4
 
  1.splitWordShape=chris4
 +
printClassifier=HighWeight
 +
printClassifierParam=400
 +
 
This pushes performance up a tiny bit further:
 
This pushes performance up a tiny bit further:
 
  Micro-averaged accuracy/F1: 0.80324
 
  Micro-averaged accuracy/F1: 0.80324
Line 109: Line 114:
 
As well as fiddling with features, we can also fiddle with the machine learning and optimization.  By default you get a maximum entropy (roughly, multiclass logistic regression) model with L2 regularization (a.k.a., a gaussian prior) optimized by the L-BFGS quasi-Newton method. You might be able to get a bit of improvement by adjusting the amount of regularization, which you can do by altering the sigma parameter:
 
As well as fiddling with features, we can also fiddle with the machine learning and optimization.  By default you get a maximum entropy (roughly, multiclass logistic regression) model with L2 regularization (a.k.a., a gaussian prior) optimized by the L-BFGS quasi-Newton method. You might be able to get a bit of improvement by adjusting the amount of regularization, which you can do by altering the sigma parameter:
 
  sigma=3
 
  sigma=3
You can also change the type of regularization altogether.  Lately, L1 regularization has been popular for producing well-performing compact models.  We'll use it.  You might then also want to save your built classifier so you can run it on data sets later.  (You can do this either directly with <code>ColumnDataClassifier</code> or, in your own program, you'll want to load the classifier using a method like <code>LinearClassifier.readClassifier(filename)</code>. This gives us our final properties file:
+
You can also change the type of regularization altogether.  Lately, L1 regularization has been popular for producing well-performing compact models.  You can select it by specifying the L1 regularization parameter:
 +
l1reg=0.1
 +
We tried a couple of settings of both of these, but nothing really seemed to beat out the defaults.
 +
 
 +
And so we return to fiddling with features. Academic papers don't spend much time discussing fiddling with features, but in practice it's usually where most of the gains come from (once you've got a basically competent method). In the last properties file, we had it print out the highest weight features. That's often useful to look at.  Here are the top few:
 +
(1-S#B-X,comp.windows.x)                  1.3971
 +
(1-S#B-Win,comp.os.ms-windows.misc)        0.9708
 +
(1-SW-car,rec.autos)                      0.9538
 +
(1-S#E-car,rec.autos)                      0.9178
 +
(1-S#B-Mac,comp.sys.mac.hardware)          0.9054
 +
(1-S#E-dows,comp.os.ms-windows.misc)      0.9020
 +
(1-S#B-x,comp.windows.x)                  0.8157
 +
(1-S#B-car,rec.autos)                      0.7689
 +
(1-S#E-ows,comp.os.ms-windows.misc)        0.7382
 +
(1-S#E-ale,misc.forsale)                  0.7361
 +
They basically make sense.  Note that all but one of them is a beginning or end split words n-gram feature (S#B or S#E).  This partly makes sense: these features generalize over multiple actual words, so starting with "X" will match "X" "Xwindows" or "X-windows".  It's part of what makes these features useful. But it also suggests that we might really be missing out by not collapsing case distinctions: S#E-ale is a good feature precisely because it matches both "Sale" or "sale".  So let's try tackling that.  One thing to try would be to just lowercase everything.  Another would be to put in both the regular splitWords features AND lowercased versions of them.  We tried both. 
 +
 
 +
As a more technical point, the top features list also shows that many of the features are highly collinear: you get pairs like SW-car and S#E-car or S#E-dows and S#E-ows which mainly match in the same documents. This is common with textual features, and we don't try to solve this problem. The best we can do is to observe that maximum entropy models are reasonably tolerant of this sort of feature overlap.
 +
 
 +
 
 +
 
 +
We'll use it.  You might then also want to save your built classifier so you can run it on data sets later.  (You can do this either directly with <code>ColumnDataClassifier</code> or, in your own program, you'll want to load the classifier using a method like <code>LinearClassifier.readClassifier(filename)</code>. This gives us our final properties file:
 
[http://nlp.stanford.edu/software/classifier/20news4.prop] .
 
[http://nlp.stanford.edu/software/classifier/20news4.prop] .
 +
 +
Other results on 20 Newsgroups
 +
Rennie ICML 2003 accuracy: Multinomial NB 0848 TWCNB 0.861 and SVM 0.862 on average of 80/20 splits
 +
Lan, Tan, and Low AAAI 2006 accuracy: SVM 0.81, kNN 0.69
 +
Gu and Zhou SIAM Datamining 2009 accuracy:

Revision as of 07:08, 16 September 2010

The Stanford Classifier

The Stanford Classifier is a general purpose classifier - something that takes input data and assigns it to one of a number of categories. It can work with (scaled) real-valued and categorical inputs, supports several machine learning algorithms, and supports several forms of regularization, as are generally needed when building models with large numbers of predictive features.

You can use it on anything, including standard statistics and machine learning data sets. But for small data sets and numeric predictors, you'd generally be better off using another tool such as Weka or R. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings. But if you've also got a few numeric variables, you can throw them in at the same time.

Examples

20 Newsgroups

Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset. There are several versions of 20 Newsgroups. We'll use Jason Rennie's "bydate" version from [1]. The precise commands shown below should work on Linux or Mac OS X systems. The Java parts should also be fine under Windows, but you'll need to do the downloading and reformatting a little differently.

First we download the corpus:

curl -O http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Then we unpack it:

tar -xzf 20news-bydate.tar.gz

The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files. We convert it into this latter format with a simple shell script:

curl -O http://nlp.stanford.edu/software/classifier/convert-to-stanford-classifier.csh
chmod 755 convert-to-stanford-classifier.csh
./convert-to-stanford-classifier.csh

We do this by converting line endings to spaces. This loses line break information which could easily have some value in classification. (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.) As part of the conversion, we also convert the original 8-bit newsgroup posts to utf-8. It's 2010 now.

Check that everything worked and you have the right number of documents:

wc -l 20news-bydate*-stanford-classifier.txt
   7532 20news-bydate-test-stanford-classifier.txt
  11314 20news-bydate-train-stanford-classifier.txt
  18846 total

The correct number should be as shown.

We'll assume that $STANFORD_CLASSIFIER_JAR points at the Stanford Classifier jar. So, depending on your shell, do something like:

 STANFORD_CLASSIFIER_JAR=/Users/manning/Software/stanford-classifier-2008-04-18/stanford-classifier.jar

This next command builds pretty much the simplest classifier that you could. It divides the input documents on white space and then trains a classifier on the resulting tokens. The command is normally entered as all one line without the trailing backslashes, but we've split it so it formats better on this page.

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier \
-trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt \
-1.useSplitWords -1.splitWordsRegexp "\\s+"

Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. (There are some methods for reducing memory usage that we'll discuss later.)

This command generates a lot of output. The last part shows the accuracy of the classifier:

7532 examples in test set
Cls alt.atheism: TP=215 FN=104 FP=91 TN=7122; Acc 0.974 P 0.703 R 0.674 F1 0.688
Cls comp.graphics: TP=257 FN=132 FP=211 TN=6932; Acc 0.954 P 0.549 R 0.661 F1 0.600
Cls comp.os.ms-windows.misc: TP=260 FN=134 FP=90 TN=7048; Acc 0.970 P 0.743 R 0.660 F1 0.699
Cls comp.sys.ibm.pc.hardware: TP=264 FN=128 FP=158 TN=6982; Acc 0.962 P 0.626 R 0.673 F1 0.649
Cls comp.sys.mac.hardware: TP=278 FN=107 FP=108 TN=7039; Acc 0.971 P 0.720 R 0.722 F1 0.721
Cls comp.windows.x: TP=300 FN=95 FP=85 TN=7052; Acc 0.976 P 0.779 R 0.759 F1 0.769
Cls misc.forsale: TP=346 FN=44 FP=114 TN=7028; Acc 0.979 P 0.752 R 0.887 F1 0.814
Cls rec.autos: TP=306 FN=90 FP=82 TN=7054; Acc 0.977 P 0.789 R 0.773 F1 0.781
Cls rec.motorcycles: TP=358 FN=40 FP=52 TN=7082; Acc 0.988 P 0.873 R 0.899 F1 0.886
Cls rec.sport.baseball: TP=340 FN=57 FP=87 TN=7048; Acc 0.981 P 0.796 R 0.856 F1 0.825
Cls rec.sport.hockey: TP=357 FN=42 FP=32 TN=7101; Acc 0.990 P 0.918 R 0.895 F1 0.906
Cls sci.crypt: TP=328 FN=68 FP=23 TN=7113; Acc 0.988 P 0.934 R 0.828 F1 0.878
Cls sci.electronics: TP=271 FN=122 FP=133 TN=7006; Acc 0.966 P 0.671 R 0.690 F1 0.680
Cls sci.med: TP=288 FN=108 FP=73 TN=7063; Acc 0.976 P 0.798 R 0.727 F1 0.761
Cls sci.space: TP=328 FN=66 FP=41 TN=7097; Acc 0.986 P 0.889 R 0.832 F1 0.860
Cls soc.religion.christian: TP=354 FN=44 FP=104 TN=7030; Acc 0.980 P 0.773 R 0.889 F1 0.827
Cls talk.politics.guns: TP=310 FN=54 FP=131 TN=7037; Acc 0.975 P 0.703 R 0.852 F1 0.770
Cls talk.politics.mideast: TP=294 FN=82 FP=16 TN=7140; Acc 0.987 P 0.948 R 0.782 F1 0.857
Cls talk.politics.misc: TP=172 FN=138 FP=59 TN=7163; Acc 0.974 P 0.745 R 0.555 F1 0.636
Cls talk.religion.misc: TP=143 FN=108 FP=73 TN=7208; Acc 0.976 P 0.662 R 0.570 F1 0.612
Micro-averaged accuracy/F1: 0.76593 
Macro-averaged F1: 0.76098 

We see the statistics for each class and averaged over all the data. For each class, we see the four cells of counts in a contingency table, and then the accuracy and precision, recall and F-measure calculated for them. This model is already quite competitive in performance. Recent published academic papers present a best micro-averaged F1 around 0.81. But we can do a little better.

As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything. Indeed, some options you can only successfully set with a properties file. One of the first things to address seems to be better tokenization. Tokenizing on whitespace is fairly naive. One can usually write a rough-and-ready but usable tokenizer inside ColumnDataClassifier by using the splitWordsTokenizerRegexp property. Another alternative would be to use the Stanford tokenizer to pre-tokenize the data. In general, this will work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can download:

trainFile=20news-bydate-train-stanford-classifier.txt
testFile=20news-bydate-test-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+

This tokenizer recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter. The whitespace tokens are then ignored.

Just a bit of work on tokenization gives us almost 3%!

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop
...
Micro-averaged accuracy/F1: 0.79501
Macro-averaged F1: 0.78963 

You can look at the output of the tokenizer by examining the features the classifier generates. We can do this with this command:

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop -printFeatures prop1

Look at the resulting (very large) file prop1.train . You might be able to get a bit better performance by fine-tuning the tokenization, but, often, for text categorization, a fairly simple tokenization is sufficient, providing it's enough to recognize most semantically contentful word units, and doesn't produce a huge number of rarely observed features. (E.g., for this data set, there are a few uuencoded files in newsgroup postings. Under whitespace tokenization, each line of the file became a token that almost certainly only occurred once. Now they'll get split up on characters that aren't letters and digits. That not only reduces the token space, but probably some of the letter strings that do result will recur, and become slightly useful features. Though quite likely it would be better to remove the uuencoded content altogether. Some much more developed processors such as the bow tokenizer do this. But we don't. uuencoded text isn't so common in 2010.)

There are many other kinds of features that you could consider putting into the classifier which might improve performance. The length of a newsgroup posting might be informative, but it probably isn't linearly related to its class, so we bin lengths into 4 categories, which become categorical features. You have to choose those cut-offs manually, but ColumnDataClassifier can print simple statistics of how many documents of each class fall in each bin, which can help you see if you've chosen very bad cut-offs. Here's the properties file: [2] .

trainFile=20news-bydate-train-stanford-classifier.txt
testFile=20news-bydate-test-stanford-classifier.txt
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
1.binnedLengths=500,2500,12500
1.binnedLengthsStatistics=true

In this case, that doesn't help:

Micro-averaged accuracy/F1: 0.79408
Macro-averaged F1: 0.78853 

Other good feature ideas might be: to use token prefix and suffixes and to use the "shape" of a token (whether it contains upper or lowercase or digits or certain kinds of symbols as equivalence classes). We also turn off the printing of the documents in the output so that the output is not quite so voluminous. This gives our next properties file: [3] .

trainFile=20news-bydate-train-stanford-classifier.txt
testFile=20news-bydate-test-stanford-classifier.txt
displayedColumn=-1
1.useSplitWords=true
1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
1.splitWordsIgnoreRegexp=\\s+
1.useSplitPrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.splitWordShape=chris4
printClassifier=HighWeight
printClassifierParam=400

This pushes performance up a tiny bit further:

Micro-averaged accuracy/F1: 0.80324
Macro-averaged F1: 0.79777

As well as fiddling with features, we can also fiddle with the machine learning and optimization. By default you get a maximum entropy (roughly, multiclass logistic regression) model with L2 regularization (a.k.a., a gaussian prior) optimized by the L-BFGS quasi-Newton method. You might be able to get a bit of improvement by adjusting the amount of regularization, which you can do by altering the sigma parameter:

sigma=3

You can also change the type of regularization altogether. Lately, L1 regularization has been popular for producing well-performing compact models. You can select it by specifying the L1 regularization parameter:

l1reg=0.1

We tried a couple of settings of both of these, but nothing really seemed to beat out the defaults.

And so we return to fiddling with features. Academic papers don't spend much time discussing fiddling with features, but in practice it's usually where most of the gains come from (once you've got a basically competent method). In the last properties file, we had it print out the highest weight features. That's often useful to look at. Here are the top few:

(1-S#B-X,comp.windows.x)                   1.3971
(1-S#B-Win,comp.os.ms-windows.misc)        0.9708
(1-SW-car,rec.autos)                       0.9538
(1-S#E-car,rec.autos)                      0.9178
(1-S#B-Mac,comp.sys.mac.hardware)          0.9054
(1-S#E-dows,comp.os.ms-windows.misc)       0.9020
(1-S#B-x,comp.windows.x)                   0.8157
(1-S#B-car,rec.autos)                      0.7689
(1-S#E-ows,comp.os.ms-windows.misc)        0.7382
(1-S#E-ale,misc.forsale)                   0.7361

They basically make sense. Note that all but one of them is a beginning or end split words n-gram feature (S#B or S#E). This partly makes sense: these features generalize over multiple actual words, so starting with "X" will match "X" "Xwindows" or "X-windows". It's part of what makes these features useful. But it also suggests that we might really be missing out by not collapsing case distinctions: S#E-ale is a good feature precisely because it matches both "Sale" or "sale". So let's try tackling that. One thing to try would be to just lowercase everything. Another would be to put in both the regular splitWords features AND lowercased versions of them. We tried both.

As a more technical point, the top features list also shows that many of the features are highly collinear: you get pairs like SW-car and S#E-car or S#E-dows and S#E-ows which mainly match in the same documents. This is common with textual features, and we don't try to solve this problem. The best we can do is to observe that maximum entropy models are reasonably tolerant of this sort of feature overlap.


We'll use it. You might then also want to save your built classifier so you can run it on data sets later. (You can do this either directly with ColumnDataClassifier or, in your own program, you'll want to load the classifier using a method like LinearClassifier.readClassifier(filename). This gives us our final properties file: [4] .

Other results on 20 Newsgroups Rennie ICML 2003 accuracy: Multinomial NB 0848 TWCNB 0.861 and SVM 0.862 on average of 80/20 splits Lan, Tan, and Low AAAI 2006 accuracy: SVM 0.81, kNN 0.69 Gu and Zhou SIAM Datamining 2009 accuracy: