From NLPWiki
Revision as of 15:19, 15 September 2010 by ChrisManning (Talk | contribs)

Jump to: navigation, search

Stanford Classifier


20 Newsgroups

Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset. There are several versions of 20 Newsgroups. We'll use Jason Rennie's "bydate" version. The precise commands shown below should work on Linux or Mac OS X systems. The Java parts should also be fine under Windows, but you'd need to do the downloading and reformatting a little differently.

First we download the corpus:

curl -O

Then we unpack it:

tar -xvzf 20news-bydate.tar.gz

The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files. We convert it into this latter format with a simple shell script:

 curl -O
chmod 755 convert-to-stanford-classifier.csh

Note that we do this by converting line endings to spaces. This loses line break information which could easily have some value in classification. (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.)

Check that everything worked and you have the right number of documents:

wc -l 20news-bydate*-stanford-classifier.txt
   7532 20news-bydate-test-stanford-classifier.txt
  11314 20news-bydate-train-stanford-classifier.txt
  18846 total

The correct number should be as shown.

We'll assume that $STANFORD_CLASSIFIER_JAR points at the Stanford Classifier jar. So, depending on your shell, do something like:


This command builds pretty much the simplest classifier that you could. It divides the input documents on white space and then trains a classifier on the resulting tokens:

 java -mx1500m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt -1.useSplitWords -1.splitWordsRegexp "\\s+"

Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. We discuss options for reducing memory usage below.

There's a lot of output. The last part shows the accuracy of the classifier: 7532 examples in test set

Cls alt.atheism: TP=215 FN=104 FP=91 TN=7122; Acc 0.974 P 0.703 R 0.674 F1 0.688
Cls TP=257 FN=132 FP=211 TN=6932; Acc 0.954 P 0.549 R 0.661 F1 0.600
Cls TP=260 FN=134 FP=90 TN=7048; Acc 0.970 P 0.743 R 0.660 F1 0.699
Cls TP=264 FN=128 FP=158 TN=6982; Acc 0.962 P 0.626 R 0.673 F1 0.649
Cls comp.sys.mac.hardware: TP=278 FN=107 FP=108 TN=7039; Acc 0.971 P 0.720 R 0.722 F1 0.721
Cls TP=300 FN=95 FP=85 TN=7052; Acc 0.976 P 0.779 R 0.759 F1 0.769
Cls TP=346 FN=44 FP=114 TN=7028; Acc 0.979 P 0.752 R 0.887 F1 0.814
Cls TP=306 FN=90 FP=82 TN=7054; Acc 0.977 P 0.789 R 0.773 F1 0.781
Cls TP=358 FN=40 FP=52 TN=7082; Acc 0.988 P 0.873 R 0.899 F1 0.886
Cls TP=340 FN=57 FP=87 TN=7048; Acc 0.981 P 0.796 R 0.856 F1 0.825
Cls TP=357 FN=42 FP=32 TN=7101; Acc 0.990 P 0.918 R 0.895 F1 0.906
Cls sci.crypt: TP=328 FN=68 FP=23 TN=7113; Acc 0.988 P 0.934 R 0.828 F1 0.878
Cls sci.electronics: TP=271 FN=122 FP=133 TN=7006; Acc 0.966 P 0.671 R 0.690 F1 0.680
Cls TP=288 FN=108 FP=73 TN=7063; Acc 0.976 P 0.798 R 0.727 F1 0.761
Cls TP=328 FN=66 FP=41 TN=7097; Acc 0.986 P 0.889 R 0.832 F1 0.860
Cls soc.religion.christian: TP=354 FN=44 FP=104 TN=7030; Acc 0.980 P 0.773 R 0.889 F1 0.827
Cls talk.politics.guns: TP=310 FN=54 FP=131 TN=7037; Acc 0.975 P 0.703 R 0.852 F1 0.770
Cls talk.politics.mideast: TP=294 FN=82 FP=16 TN=7140; Acc 0.987 P 0.948 R 0.782 F1 0.857
Cls talk.politics.misc: TP=172 FN=138 FP=59 TN=7163; Acc 0.974 P 0.745 R 0.555 F1 0.636
Cls talk.religion.misc: TP=143 FN=108 FP=73 TN=7208; Acc 0.976 P 0.662 R 0.570 F1 0.612
Micro-averaged accuracy/F1: 0.7659320233669676
Macro-averaged F1: 0.7609758953314028

We see the statistics for each class and averaged over all the data. This is already quite competitive performance. Recent published papers (from 2008-10) often present a best macro-averaged F1 around 0.79. But we can do a little better.

As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything. Indeed, some options you can only successfully set with a properties file. One of the first things to address seems to be better tokenization. Tokenizing on whitespace is fairly naive. One can usually write a rough-and-ready but usable tokenizer inside ColumnDataClassifier by using the splitWordsTokenizerRegexp property. Another alternative would be to use the Stanford tokenizer to pre-tokenize the data. In general, this will work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can download:

1.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.

This tokenize recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter. The whitespace tokens are then ignored.

Just a bit of work on tokenization gives us 2.5%!

Micro-averaged accuracy/F1: 0.7922198619224642
Macro-averaged F1: 0.7865248366245726

You can look at the output of the tokenizer by examining the features the classifier generates. We can do this with this command:

java -mx1500m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop /Users/manning/Projects/Classification-20Newsgroups/20news1.prop -printFeatures prop1

Look at the resulting (very large) file prop1.train . You might be able to get a bit better performance by fine-tuning the tokenization, but, often, for text categorization, a fairly simple tokenization is sufficient, providing it's enough to recognize most semantically contentful word units, and doesn't produce a huge number of rarely observed features. (E.g., for this data set, there are a few uuencoded files in newsgroup postings. Under whitespace tokenization, each line of the file became a token that almost certainly only occurred once. Now they'll get split up on characters that aren't letters and digits. That not only reduces the token space, but probably some of the letter strings that do result will recur, and become slightly useful features.)