Software/Classifier/20 Newsgroups

From NLPWiki
Jump to: navigation, search

20 Newsgroups Text Classification Dataset

Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset.

Getting set up with the data

There are several versions of 20 Newsgroups. We will use Jason Rennie's "bydate" version from [1]. The precise commands shown below should work on Linux or Mac OS X systems. The Java parts should also be fine under Windows, but you'll need to do the downloading and reformatting a little differently.

First we download the corpus. The first version is good for Mac OS X or BSD, and the second works on Linux systems

curl -O

Then we unpack it:

tar -xzf 20news-bydate.tar.gz

The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files. We convert it into this latter format with a simple shell script:

curl -O
chmod 755 convert-to-stanford-classifier.csh

We do this by converting line endings to spaces. This loses line break information which could easily have some value in classification. (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.) As part of the conversion, we also convert the original 8-bit newsgroup posts to utf-8. It's 2010 now.

Check that everything worked and you have the right number of documents (18846):

wc -l 20news-bydate*-stanford-classifier.txt
   7532 20news-bydate-test-stanford-classifier.txt
  11314 20news-bydate-train-stanford-classifier.txt
  18846 total

The correct number should be as shown.

Next we split the training data into initial training data and a development test set. (Methodological note: people often fail to do this. But if you are going to build a sequence of models trying to find the best classifier, then unless you do this, you will be gradually overfitting on the test data and will get and report falsely good results. Moreover, if you are likely to be looking at the test results to see where your model fails and how you might fix it, then it is even more vital that you use a development test set that is distinct from your final test set for these purposes. Otherwise you will hugely overfit on your final test set and report unrealistically rosy results.)

grep -P '^\S+\s[0-9]*[1-8]\s' 20news-bydate-train-stanford-classifier.txt > 20news-bydate-devtrain-stanford-classifier.txt
grep -P '^\S+\s[0-9]*[90]\s' 20news-bydate-train-stanford-classifier.txt > 20news-bydate-devtest-stanford-classifier.txt

This gives a roughly random partition of 80% of the data in devtrain and 20% in devtest based on the final digit of the newsgroup item number.

Building classifiers

We'll assume that $STANFORD_CLASSIFIER_JAR points at the Stanford Classifier jar. So, depending on your shell, do something like:


This next command builds pretty much the simplest classifier that you could. It divides the input documents on white space and then trains a classifier on the resulting tokens. The command is normally entered as all one line without the trailing backslashes, but we've split it so it formats better on this page (on Unix systems you can say a command continues by putting a backslash at the end of a line; but if you're turning this back into a single line command, then you should delete the backslashes at the ends of the lines).

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier \
       -trainFile 20news-bydate-devtrain-stanford-classifier.txt -testFile 20news-bydate-devtest-stanford-classifier.txt \
       -2.useSplitWords -2.splitWordsRegexp "\\s+"

Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. (There are some methods for reducing memory usage that we'll discuss later on.) [Also, if you have any problems with this command, it's probably an issue with symbol escaping in your shell, and so you should probably just skip it and go on to the next example that uses a properties file.]

This command generates a lot of output. The last part shows the accuracy of the classifier:

2241 examples in test set
Cls alt.atheism: TP=80 FN=20 FP=5 TN=2136; Acc 0.989 P 0.941 R 0.800 F1 0.865
Cls TP=82 FN=33 FP=30 TN=2096; Acc 0.972 P 0.732 R 0.713 F1 0.722
Cls TP=91 FN=24 FP=20 TN=2106; Acc 0.980 P 0.820 R 0.791 F1 0.805
Cls TP=87 FN=31 FP=32 TN=2091; Acc 0.972 P 0.731 R 0.737 F1 0.734
Cls comp.sys.mac.hardware: TP=90 FN=24 FP=23 TN=2104; Acc 0.979 P 0.796 R 0.789 F1 0.793
Cls TP=112 FN=8 FP=20 TN=2101; Acc 0.988 P 0.848 R 0.933 F1 0.889
Cls TP=100 FN=17 FP=47 TN=2077; Acc 0.971 P 0.680 R 0.855 F1 0.758
Cls TP=95 FN=19 FP=21 TN=2106; Acc 0.982 P 0.819 R 0.833 F1 0.826
Cls TP=98 FN=14 FP=14 TN=2115; Acc 0.988 P 0.875 R 0.875 F1 0.875
Cls TP=112 FN=7 FP=13 TN=2109; Acc 0.991 P 0.896 R 0.941 F1 0.918
Cls TP=113 FN=4 FP=4 TN=2120; Acc 0.996 P 0.966 R 0.966 F1 0.966
Cls sci.crypt: TP=108 FN=8 FP=5 TN=2120; Acc 0.994 P 0.956 R 0.931 F1 0.943
Cls sci.electronics: TP=93 FN=24 FP=24 TN=2100; Acc 0.979 P 0.795 R 0.795 F1 0.795
Cls TP=104 FN=20 FP=8 TN=2109; Acc 0.988 P 0.929 R 0.839 F1 0.881
Cls TP=113 FN=8 FP=9 TN=2111; Acc 0.992 P 0.926 R 0.934 F1 0.930
Cls soc.religion.christian: TP=107 FN=12 FP=22 TN=2100; Acc 0.985 P 0.829 R 0.899 F1 0.863
Cls talk.politics.guns: TP=96 FN=10 FP=5 TN=2130; Acc 0.993 P 0.950 R 0.906 F1 0.928
Cls talk.politics.mideast: TP=104 FN=7 FP=3 TN=2127; Acc 0.996 P 0.972 R 0.937 F1 0.954
Cls talk.politics.misc: TP=87 FN=12 FP=5 TN=2137; Acc 0.992 P 0.946 R 0.879 F1 0.911
Cls talk.religion.misc: TP=49 FN=18 FP=10 TN=2164; Acc 0.988 P 0.831 R 0.731 F1 0.778
Micro-averaged accuracy/F1: 0.85721
Macro-averaged F1: 0.85670

We see the statistics for each class and averaged over all the data. For each class, we see the four cells of counts in a contingency table, and then the accuracy and precision, recall and F-measure calculated for them. This model already seems to perform quite well (85% accuracy!). But we can do a little better.

As soon as you want to start specifying a lot of options, you'll probably want a properties file to specify everything. Indeed, some options you can only successfully set with a properties file. One of the first things to address seems to be better tokenization. Tokenizing on whitespace is fairly naive. One can usually write a rough-and-ready but usable tokenizer inside ColumnDataClassifier by using the splitWordsTokenizerRegexp property. Another alternative would be to use a tool like the Stanford tokenizer to pre-tokenize the data. In general, this will probably work a bit better for English-language text, but is beyond what we consider here. Here's a simple properties file which you can download:

2.splitWordsTokenizerRegexp=[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.

This tokenizer recognizes tokens starting with letters followed by letters and ASCII digits, or some number, money, and percent expressions, whitespace or a single letter. The whitespace tokens are then ignored.

Just a bit of work on tokenization gives us about 2%!

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop
Micro-averaged accuracy/F1: 0.87773
Macro-averaged F1: 0.87619

You can look at the output of the tokenizer by examining the features the classifier generates. We can do this with this command:

java -mx1800m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news1.prop -printFeatures prop1

Look at the resulting (very large) file prop1.train - the things after "2-SW" are the word features generated. You might be able to get a bit better performance by fine-tuning the general English tokenization, though, often, for text categorization, a fairly simple tokenization is sufficient, providing (1) it's enough to recognize most semantically contentful word units, and (2) it doesn't produce a huge number of rarely observed features. In general, anything more than this is probably approaching the point of diminishing returns. But if you dig through the data, there is one particular problem. The newsgroup posts contain a bunch of uuencoded binary files, which are actually split across multiple posts (because you were only meant to post short posts in those days). Uuencoding differs from the more current BASE64 encoding in that space is used as an encoding letter. So the binary coding is divided into tokens, but results in a very large number of additional semantically junky tokens, many of which probably only occur once. About the only meaning the classifier can get from them is to learn which newsgroups tend to have binary files posted to them. But it doesn't need hundreds of thousands of features to do that. It might be better to remove the uuencoded content altogether, while leaving just enough signal to know that there was a uuencoded file in the news posting. Some processors such as the bow tokenizer handle recognizing and stripping uuencoded files. The Stanford Classifier doesn't. Uuencoded text really isn't so common in 2010. But if you check up on how uuencode works, you will find that all lines of a uuencoded file except the first and the last consist of lines that start with capital "M" and then are followed by precisely 60 characters from the ASCII subset from " " to "`" and then a newline (which we've mapped to space). So, we can tokenize and then choose to ignore those lines with the following regular expression, and it is unlikely that any text lines will match this pattern (since it doesn't include the lowercase letters):

/M[ -`]{60} /

This gives us the properties file: download .

# The first option is for a (full) uuencoded line, which is then ignored
2.splitWordsTokenizerRegexp=M[ -`]{60} |[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=M[ -`]{60} |\\s+

With this tokenization, the classifier works a fraction better:

Micro-averaged accuracy/F1: 0.87818
Macro-averaged F1: 0.87671

But that difference may well not be significant. The main advantage is that we've gotten rid of a lot of junk features. Before we had 127994 features, which we've reduced by over 15% to 106380 features.

There are many other kinds of features that you could consider putting into the classifier which might improve performance. The length of a newsgroup posting might be informative, but it probably isn't linearly related to its class, so we bin lengths into 4 categories, which become categorical features. You have to choose those cut-offs manually, but ColumnDataClassifier can print simple statistics of how many documents of each class fall in each bin, which can help you see if you've chosen very bad cut-offs. Here's the properties file: [2] .

2.splitWordsTokenizerRegexp=M[ -_]{60} |[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=M[ -_]{60} |\\s+

In this case, this doesn't really help:

Micro-averaged accuracy/F1: 0.87907
Macro-averaged F1: 0.87808

Other feature types that are often good with text documents are: to use token prefixes and suffixes and to use the "shape" of a token (whether it contains upper or lowercase or digits or certain kinds of symbols as equivalence classes). We've also changed the output to show the highest weight features in the model. That's often informative to look at. This gives our next properties file: [3]

2.splitWordsTokenizerRegexp=M[ -`]{60} |[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=M[ -`]{60} |\\s+

This pushes performance up another percent, roughly:

Micro-averaged accuracy/F1: 0.88844
Macro-averaged F1: 0.88681

As well as fiddling with features, we can also fiddle with the machine learning and optimization. By default you get a maximum entropy (roughly, multiclass logistic regression) model with L2 regularization (a.k.a., a gaussian prior) optimized by the L-BFGS quasi-Newton method. You might be able to get a bit of improvement by adjusting the amount of regularization, which you can do by altering the sigma parameter:


You can also change the type of regularization altogether. We tried a couple of settings of both of these, but nothing really seemed to beat out the defaults.

And so we return to fiddling with features. Academic papers don't spend much time discussing fiddling with features, but in practice it's usually where most of the gains come from (once you've got a basically competent machine learning method). In the last properties file, we had it print out the highest weight features. That's often useful to look at. Here are the top few:

(2-S#B-X,                 1.0797
(2-SW-car,                     0.7447
(2-S#B-Win,      0.7444
(2-S#E-car,                    0.7243
(2-S#E-dows,     0.7177
(2-S#B-Mac,comp.sys.mac.hardware)        0.6749
(2-S#B-car,                    0.6303
(2-S#E-hics,               0.6172
(2-S#E-ows,      0.6138
(2-S#E-ale,                 0.6130

They basically make sense. Note that all but one of them is a beginning or end split words n-gram feature (S#B or S#E). This partly makes sense: these features generalize over multiple actual words, so starting with "X" will match "X" "Xwindows" or "X-windows". It's part of what makes these features useful. But it also suggests that we might really be missing out by not collapsing case distinctions: S#E-ale is a good feature for precisely because it matches both "Sale" or "sale". So let's try tackling that. There are several possible variants. One thing to try would be to just lowercase everything. Another would be to instead put in lowercased splitWords or both the regular splitWords features and lowercased versions of them. We tried several things. Relevant properties are lowercase, useLowercaseSplitWords, and lowercaseNGrams. The best thing seemed to be to put in the splitWords regular case and lowercase, but to keep the character n-grams cased.

Technical point: the top features list also shows that many of the features are highly collinear: you get pairs like SW-car and S#E-car or S#E-dows and S#E-ows which mainly match in the same documents. This is common with textual features, and we don't try to solve this problem. The best we can do is to observe that maximum entropy models are reasonably tolerant of this sort of feature overlap: The fact that the model with both lowercased and regular case features seems to work best is indicative of this.

You might then also want to save your built classifier so you can run it on other data sets later. You can run it later directly with ColumnDataClassifier using the -loadClassifier<code> option. Or, you can write your own program and load the classifier using a method like <code>LinearClassifier.readClassifier(filename). You will most likely still want to use the ColumnDataClassifier code to featurize your examples in the same way. See the example code in in the distribution.

This gives us the properties file: [4] .

2.splitWordsTokenizerRegexp=M[ -`]{60} |[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=M[ -`]{60} |\\s+

Here are the devtest set results:

Micro-averaged accuracy/F1: 0.89157
Macro-averaged F1: 0.89058

We can continue to improve the results, but at the cost of building much larger models (note the increased memory requirement below!). Rather than just using token prefix and suffixes, we can throw in all document n-grams up to a certain length. This might be useful in part because they will span short tokens, such as single character tokens that are not letters or digits, and may extract useful patterns in them. [5]

2.splitWordsTokenizerRegexp=M[ -`]{60} |[\\p{L}][\\p{L}0-9]*|(?:\\$ ?)?[0-9]+(?:\\.[0-9]{2})?%?|\\s+|[\\x80-\\uFFFD]|.
2.splitWordsIgnoreRegexp=M[ -`]{60} |\\s+
java -mx10g -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news5.prop
Micro-averaged accuracy/F1: 0.90361
Macro-averaged F1: 0.90277

This gives another percent.

If we stop our model search here, this leaves doing a run on the test set, where we train on the full training set and then test on the test set:

java -mx10g -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -prop 20news5.prop \
  -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt

Here are the final test set results:

7532 examples in test set
Cls alt.atheism: TP=237 FN=82 FP=68 TN=7145; Acc 0.980 P 0.777 R 0.743 F1 0.760
Cls TP=288 FN=101 FP=140 TN=7003; Acc 0.968 P 0.673 R 0.740 F1 0.705
Cls TP=290 FN=104 FP=88 TN=7050; Acc 0.975 P 0.767 R 0.736 F1 0.751
Cls TP=276 FN=116 FP=122 TN=7018; Acc 0.968 P 0.693 R 0.704 F1 0.699
Cls comp.sys.mac.hardware: TP=314 FN=71 FP=98 TN=7049; Acc 0.978 P 0.762 R 0.816 F1 0.788
Cls TP=321 FN=74 FP=47 TN=7090; Acc 0.984 P 0.872 R 0.813 F1 0.841
Cls TP=354 FN=36 FP=80 TN=7062; Acc 0.985 P 0.816 R 0.908 F1 0.859
Cls TP=334 FN=62 FP=54 TN=7082; Acc 0.985 P 0.861 R 0.843 F1 0.852
Cls TP=372 FN=26 FP=32 TN=7102; Acc 0.992 P 0.921 R 0.935 F1 0.928
Cls TP=353 FN=44 FP=72 TN=7063; Acc 0.985 P 0.831 R 0.889 F1 0.859
Cls TP=373 FN=26 FP=42 TN=7091; Acc 0.991 P 0.899 R 0.935 F1 0.916
Cls sci.crypt: TP=360 FN=36 FP=26 TN=7110; Acc 0.992 P 0.933 R 0.909 F1 0.921
Cls sci.electronics: TP=272 FN=121 FP=105 TN=7034; Acc 0.970 P 0.721 R 0.692 F1 0.706
Cls TP=314 FN=82 FP=54 TN=7082; Acc 0.982 P 0.853 R 0.793 F1 0.822
Cls TP=347 FN=47 FP=36 TN=7102; Acc 0.989 P 0.906 R 0.881 F1 0.893
Cls soc.religion.christian: TP=376 FN=22 FP=68 TN=7066; Acc 0.988 P 0.847 R 0.945 F1 0.893
Cls talk.politics.guns: TP=317 FN=47 FP=126 TN=7042; Acc 0.977 P 0.716 R 0.871 F1 0.786
Cls talk.politics.mideast: TP=316 FN=60 FP=10 TN=7146; Acc 0.991 P 0.969 R 0.840 F1 0.900
Cls talk.politics.misc: TP=185 FN=125 FP=45 TN=7177; Acc 0.977 P 0.804 R 0.597 F1 0.685
Cls talk.religion.misc: TP=157 FN=94 FP=63 TN=7218; Acc 0.979 P 0.714 R 0.625 F1 0.667
Micro-averaged accuracy/F1: 0.81731
Macro-averaged F1: 0.81158

You'll notice that these results are quite a bit lower. Results being a bit lower is always to be expected (after all, we were overfitting to the devtest set by doing multiple runs), but the results here are a lot lower. This is probably because in the bydate version of 20 Newsgroups, the test set is all from a later time period than the training set, whereas, when we subdivided the training set, we took a roughly uniform sample across it as the devtest set. Topics under discussion shift over time, and so there's enough extra similarity between documents close in time versus temporal movement in what gets posted over time in most time series text data sets that you get a substantial difference in performance like this, when comparing a random selected test set with a test set from a separate time period.

Nevertheless, these results are pretty good! Below are results we could find in the academic literature for the same data set. The results reported by Rennie look surprisingly high in comparison to the results of more recent papers. If we discount those results as suspect (they minimally involve post hoc parameter selection), it seems like we're close to the state of the art with this large model.

Technical point: It can be subtle to make sure that you're doing an apples-to-apples comparison. As well as there being several versions of the 20 Newsgroups data set in general, even when you use the recommended 20news-bydate version, there can be differences. The full by date set has 18846 documents, and that is what is used here. However, various of the commonly used preprocessing scripts for the data set delete documents that are empty or very short after various preprocessing sets, giving small document sets. Some of these changes are discussed on the 20 Newsgroups page. But omitting documents like this removes the documents that are hardest to classify correctly, and so tends to boost results. For instance, the commonly used Matlab version of the data (with 18824 documents) has only 7505 documents in the test set, while the original version has 7532. So 27 hard-to-classify document are omitted. This change likely boosts the results of all systems evaluated on the matlab version by about 0.3%, a non-trivial difference.

Paper Model Micro-ave accuracy Notes
Lan, M, Tan, Chew-Lim, and Low, Hwee-Boon, 2006, Proposing a New Term Weighting Scheme for Text Categorization SVM 0.808
Larochelle, H and Bengio, Y, 2008, Classification using Discriminative Restricted Boltzmann Machines hybrid discriminative RBM 0.762 Only 5000 most frequent tokens used as features
Li, B and Vogel, C, 2010, Improving Multiclass Text Classification with Error-Correcting Output Coding and Sub-class Partitions ECOC Naive Bayes 0.818
Rennie, Jason D M, 2003, On The Value of Leave-One-Out Cross-Validation Bounds regularized least squares classifier 0.8486 Optimal regularization chosen post-hoc on test set