Now let's walk through a more realistic example of using the Stanford Classifier on the well-known 20 Newgroups dataset. There are several versions of 20 Newsgroups. We'll use Jason Rennie's "bydate" version. The precise commands shown below should work on Linux or Mac OS X systems. The Java parts should also be fine under Windows, but you'd need to do the downloading and reformatting a little differently.
First we download the corpus:
Then we unpack it:
tar -xvzf 20news-bydate.tar.gz
The 20 Newsgroups data comes in a format of one file per document, with the correct class shown by the directory name. The Stanford Classifier works with tab-delimited text files. We convert it into this latter format with a simple shell script:
curl -O http://nlp.stanford.edu/software/classifier/convert-to-stanford-classifier.csh chmod 755 convert-to-stanford-classifier.csh ./convert-to-stanford-classifier.csh
Note that we do this by converting line endings to spaces. This loses line break information which could easily have some value in classification. (We could have done something tricker like converting line endings to a vertical tab or form feed, but this will do for this example.)
Check that everything worked and you have the right number of documents:
wc -l 20news-bydate*-stanford-classifier.txt 7532 20news-bydate-test-stanford-classifier.txt 11314 20news-bydate-train-stanford-classifier.txt 18846 total
The correct number should be as shown.
We'll assume that
$STANFORD_CLASSIFIER_JAR points at the Stanford Classifier jar. So, depending on your shell, do something like:
This command builds pretty much the simplest classifier that you could. It divides the input documents on white space and then trains a classifier on the resulting tokens:
java -mx1500m -cp $STANFORD_CLASSIFIER_JAR edu.stanford.nlp.classify.ColumnDataClassifier -trainFile 20news-bydate-train-stanford-classifier.txt -testFile 20news-bydate-test-stanford-classifier.txt -1.useSplitWords -1.splitWordsRegexp "\\s+"
Note that once the dataset is reasonably large, you have to give a fair amount of memory to the classifier. We discuss options for reducing memory usage below.
There's a fair bit of output. The last part shows the accuracy of the classifier: