Software/Classifier/Sentiment
This page gives a couple of examples of building sentiment classifiers with the Stanford Classifier.
Short sentiment snippets (the Pang/Lee Rotten Tomatoes dataset)
This example shows good/bad ("fresh"/"rotten") sentiment classification based on a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee. Begin by downloading the data from http://www.cs.cornell.edu/People/pabo/movie-review-data/.
You want the file http://www.cs.cornell.edu/People/pabo/movie-review-data/rt-polaritydata.tar.gz (it's referred to on the page as "sentence polarity dataset v1.0").
Unpack it:
tar -xzf rt-polaritydata.tar.gz mv rt-polaritydata.README.1.0.txt rt-polaritydata cd rt-polaritydata ls
The data is in Windows CP1252, unfortunately. And, unfortunately, the Stanford Classifier doesn't yet support specifying a character encoding (unlike most of the rest of our packages). So, we'll convert it to utf-8.
iconv -f cp1252 -t utf-8 < rt-polarity.neg > rt-polarity.neg.utf8 iconv -f cp1252 -t utf-8 < rt-polarity.pos > rt-polarity.pos.utf8
perl -ne 'print "neg\t" . $_' < rt-polarity.neg.utf8 > rt-polarity.neg.utf8.tsv perl -ne 'print "pos\t" . $_' < rt-polarity.pos.utf8 > rt-polarity.pos.utf8.tsv
Here we'll divide off one tenth of the data to test one fold.
head -n 4798 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.train.tsv head -n 4798 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.train.tsv tail -n 533 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.test.tsv tail -n 533 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.test.tsv
Check it's all okay:
$ wc -l *.tsv 5331 rt-polarity.neg.utf8.tsv 5331 rt-polarity.pos.utf8.tsv 1066 rt-polarity.utf8.test.tsv 9596 rt-polarity.utf8.train.tsv 21324 total
Now classify it.
$ cat > rt1.prop trainFile = rt-polarity.utf8.train.tsv testFile = rt-polarity.utf8.test.tsv useClassFeature = true 1.splitWordsRegexp = \\s 1.useSplitWords = true useNB = true goldAnswerColumn = 0
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt1.prop
1066 examples in test set Cls neg: TP=431 FN=102 FP=107 TN=426; Acc 0.804 P 0.801 R 0.809 F1 0.805 Cls pos: TP=426 FN=107 FP=102 TN=431; Acc 0.804 P 0.807 R 0.799 F1 0.803 Micro-averaged accuracy/F1: 0.80394 Macro-averaged F1: 0.80394
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt2.prop
1066 examples in test set Cls neg: TP=446 FN=87 FP=110 TN=423; Acc 0.815 P 0.802 R 0.837 F1 0.819 Cls pos: TP=423 FN=110 FP=87 TN=446; Acc 0.815 P 0.829 R 0.794 F1 0.811 Micro-averaged accuracy/F1: 0.81520 Macro-averaged F1: 0.81511
Short sentiment snippets (the Kaggle competition version of the Stanford Sentiment Treebank)
This example is on the same Rotten Tomatoes data, but available in the forum of judgments on constituents of a parse of the examples, done initially for the Stanford Sentiment Dataset, but also distributed as a Kaggle competition.
You first neeed to create a Kaggle account if you don't have one at [1]. Then you can get the data from: Kaggle movie reviews data.
If you download and unpack the data, you will have files train.tsv and test.tsv [really a devtest set, hopefully]. These both have a header row, which the Stanford Classifier doesn't by default know how to ignore, so you should edit the two files and delete the first row entirely.
The data is a tsv file with 4 columns: columns 0 and 1 are phrase and sentence ID and then columns 2 and 3 give a phrase and its sentiment score (from 0 through 4, ranging from negative to positive).
Since this data was already tokenized (with the Stanford Tokenizer), we can probably just use whitespace tokenization. (Unless maybe we wanted to try being clever like splitting on some hyphens.)
So, here's our first attempt at a prop file. We will do the initial experiments using 10-fold cross validation on the training data, since we don't have answers for the devtest data. This properties file gives us a Naive Bayes model:
$ cat krt1.prop trainFile = train.tsv crossValidationFolds = 10 testFile = test.tsv useClassFeature = true 2.splitWordsRegexp = \\s 2.useSplitWords = true useNB = true goldAnswerColumn = 3
$ java -cp "*" edu.stanford.nlp.classify.ColumnDataClassifier -prop krt1.prop