From NLPWiki
Jump to: navigation, search

This page gives a couple of examples of building sentiment classifiers with the Stanford Classifier.

Short sentiment snippets (the Pang/Lee Rotten Tomatoes dataset)

This example shows good/bad ("fresh"/"rotten") sentiment classification based on a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee. Begin by downloading the data from

You want the file (it's referred to on the page as "sentence polarity dataset v1.0").

Unpack it:

tar -xzf rt-polaritydata.tar.gz
mv rt-polaritydata.README.1.0.txt rt-polaritydata
cd rt-polaritydata

The data is in Windows CP1252, unfortunately. And, unfortunately, the Stanford Classifier doesn't yet support specifying a character encoding (unlike most of the rest of our packages). So, we'll convert it to utf-8.

iconv -f cp1252 -t utf-8 < rt-polarity.neg > rt-polarity.neg.utf8
iconv -f cp1252 -t utf-8 < rt-polarity.pos > rt-polarity.pos.utf8
perl -ne 'print "neg\t" . $_' <  rt-polarity.neg.utf8 > rt-polarity.neg.utf8.tsv
perl -ne 'print "pos\t" . $_' <  rt-polarity.pos.utf8 > rt-polarity.pos.utf8.tsv

Here we'll divide off one tenth of the data to test one fold.

head -n 4798 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.train.tsv
head -n 4798 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.train.tsv
tail -n 533 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.test.tsv
tail -n 533 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.test.tsv

Check it's all okay:

$ wc -l *.tsv
   5331 rt-polarity.neg.utf8.tsv
   5331 rt-polarity.pos.utf8.tsv
   1066 rt-polarity.utf8.test.tsv
   9596 rt-polarity.utf8.train.tsv
  21324 total

Now classify it.

$ cat > rt1.prop
trainFile = rt-polarity.utf8.train.tsv
testFile = rt-polarity.utf8.test.tsv
useClassFeature = true
1.splitWordsRegexp = \\s
1.useSplitWords = true
useNB = true
goldAnswerColumn = 0
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt1.prop
1066 examples in test set
Cls neg: TP=431 FN=102 FP=107 TN=426; Acc 0.804 P 0.801 R 0.809 F1 0.805
Cls pos: TP=426 FN=107 FP=102 TN=431; Acc 0.804 P 0.807 R 0.799 F1 0.803
Micro-averaged accuracy/F1: 0.80394
Macro-averaged F1: 0.80394
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt2.prop
1066 examples in test set
Cls neg: TP=446 FN=87 FP=110 TN=423; Acc 0.815 P 0.802 R 0.837 F1 0.819
Cls pos: TP=423 FN=110 FP=87 TN=446; Acc 0.815 P 0.829 R 0.794 F1 0.811
Micro-averaged accuracy/F1: 0.81520
Macro-averaged F1: 0.81511