Software/Classifier/Sentiment

From NLPWiki
Jump to: navigation, search

This page gives a couple of examples of building sentiment classifiers with the Stanford Classifier.

Short sentiment snippets (the Pang/Lee Rotten Tomatoes dataset)

This example shows good/bad ("fresh"/"rotten") sentiment classification based on a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee. Begin by downloading the data from http://www.cs.cornell.edu/People/pabo/movie-review-data/.

You want the file http://www.cs.cornell.edu/People/pabo/movie-review-data/rt-polaritydata.tar.gz (it's referred to on the page as "sentence polarity dataset v1.0").

Unpack it:

tar -xzf rt-polaritydata.tar.gz
mv rt-polaritydata.README.1.0.txt rt-polaritydata
cd rt-polaritydata
ls

The data is in Windows CP1252, unfortunately. And, unfortunately, the Stanford Classifier doesn't yet support specifying a character encoding (unlike most of the rest of our packages). So, we'll convert it to utf-8.

iconv -f cp1252 -t utf-8 < rt-polarity.neg > rt-polarity.neg.utf8
iconv -f cp1252 -t utf-8 < rt-polarity.pos > rt-polarity.pos.utf8
perl -ne 'print "neg\t" . $_' <  rt-polarity.neg.utf8 > rt-polarity.neg.utf8.tsv
perl -ne 'print "pos\t" . $_' <  rt-polarity.pos.utf8 > rt-polarity.pos.utf8.tsv

Here we'll divide off one tenth of the data to test one fold.

head -n 4798 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.train.tsv
head -n 4798 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.train.tsv
tail -n 533 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.test.tsv
tail -n 533 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.test.tsv

Check it's all okay:

$ wc -l *.tsv
   5331 rt-polarity.neg.utf8.tsv
   5331 rt-polarity.pos.utf8.tsv
   1066 rt-polarity.utf8.test.tsv
   9596 rt-polarity.utf8.train.tsv
  21324 total

Now classify it.

$ cat > rt1.prop
trainFile = rt-polarity.utf8.train.tsv
testFile = rt-polarity.utf8.test.tsv
useClassFeature = true
1.splitWordsRegexp = \\s
1.useSplitWords = true
useNB = true
goldAnswerColumn = 0
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt1.prop
1066 examples in test set
Cls neg: TP=431 FN=102 FP=107 TN=426; Acc 0.804 P 0.801 R 0.809 F1 0.805
Cls pos: TP=426 FN=107 FP=102 TN=431; Acc 0.804 P 0.807 R 0.799 F1 0.803
Micro-averaged accuracy/F1: 0.80394
Macro-averaged F1: 0.80394
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt2.prop
1066 examples in test set
Cls neg: TP=446 FN=87 FP=110 TN=423; Acc 0.815 P 0.802 R 0.837 F1 0.819
Cls pos: TP=423 FN=110 FP=87 TN=446; Acc 0.815 P 0.829 R 0.794 F1 0.811
Micro-averaged accuracy/F1: 0.81520
Macro-averaged F1: 0.81511