Software/Classifier/Sentiment

From NLPWiki
Jump to: navigation, search

This page gives a couple of examples of building sentiment classifiers with the Stanford Classifier.

Short sentiment snippets (the Pang/Lee Rotten Tomatoes dataset)

This example shows good/bad ("fresh"/"rotten") sentiment classification based on a collection of short review excerpts from Rotten Tomatoes collected by Bo Pang and Lillian Lee. Begin by downloading the data from http://www.cs.cornell.edu/People/pabo/movie-review-data/.

You want the file http://www.cs.cornell.edu/People/pabo/movie-review-data/rt-polaritydata.tar.gz (it's referred to on the page as "sentence polarity dataset v1.0").

Unpack it:

tar -xzf rt-polaritydata.tar.gz
mv rt-polaritydata.README.1.0.txt rt-polaritydata
cd rt-polaritydata
ls

The data is in Windows CP1252, unfortunately. And, unfortunately, the Stanford Classifier doesn't yet support specifying a character encoding (unlike most of the rest of our packages). So, we'll convert it to utf-8.

iconv -f cp1252 -t utf-8 < rt-polarity.neg > rt-polarity.neg.utf8
iconv -f cp1252 -t utf-8 < rt-polarity.pos > rt-polarity.pos.utf8
perl -ne 'print "neg\t" . $_' <  rt-polarity.neg.utf8 > rt-polarity.neg.utf8.tsv
perl -ne 'print "pos\t" . $_' <  rt-polarity.pos.utf8 > rt-polarity.pos.utf8.tsv

Here we'll divide off one tenth of the data to test one fold.

head -n 4798 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.train.tsv
head -n 4798 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.train.tsv
tail -n 533 < rt-polarity.neg.utf8.tsv > rt-polarity.utf8.test.tsv
tail -n 533 < rt-polarity.pos.utf8.tsv >> rt-polarity.utf8.test.tsv

Check it's all okay:

$ wc -l *.tsv
   5331 rt-polarity.neg.utf8.tsv
   5331 rt-polarity.pos.utf8.tsv
   1066 rt-polarity.utf8.test.tsv
   9596 rt-polarity.utf8.train.tsv
  21324 total

Now classify it.

$ cat > rt1.prop
trainFile = rt-polarity.utf8.train.tsv
testFile = rt-polarity.utf8.test.tsv
useClassFeature = true
1.splitWordsRegexp = \\s
1.useSplitWords = true
useNB = true
goldAnswerColumn = 0
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt1.prop
1066 examples in test set
Cls neg: TP=431 FN=102 FP=107 TN=426; Acc 0.804 P 0.801 R 0.809 F1 0.805
Cls pos: TP=426 FN=107 FP=102 TN=431; Acc 0.804 P 0.807 R 0.799 F1 0.803
Micro-averaged accuracy/F1: 0.80394
Macro-averaged F1: 0.80394
java -cp ~/Software/stanford-classifier-2011-12-16/stanford-classifier.jar edu.stanford.nlp.classify.ColumnDataClassifier -prop rt2.prop
1066 examples in test set
Cls neg: TP=446 FN=87 FP=110 TN=423; Acc 0.815 P 0.802 R 0.837 F1 0.819
Cls pos: TP=423 FN=110 FP=87 TN=446; Acc 0.815 P 0.829 R 0.794 F1 0.811
Micro-averaged accuracy/F1: 0.81520
Macro-averaged F1: 0.81511


Short sentiment snippets (the Kaggle competition version of the Stanford Sentiment Treebank)

This example is on the same Rotten Tomatoes data, but available in the forum of judgments on constituents of a parse of the examples, done initially for the Stanford Sentiment Dataset, but also distributed as a Kaggle competition.

You first neeed to create a Kaggle account if you don't have one at [1]. Then you can get the data from: Kaggle movie reviews data.

If you download and unpack the data, you will have files train.tsv and test.tsv [really a devtest set, hopefully]. These both have a header row, which the Stanford Classifier doesn't by default know how to ignore, so you should edit the two files and delete the first row entirely.

The data is a tsv file with 4 columns: columns 0 and 1 are phrase and sentence ID and then columns 2 and 3 give a phrase and its sentiment score (from 0 through 4, ranging from negative to positive).

Since this data was already tokenized (with the Stanford Tokenizer), we can probably just use whitespace tokenization. (Unless maybe we wanted to try being clever like splitting on some hyphens.)

So, here's our first attempt at a prop file. We will do the initial experiments using 10-fold cross validation on the training data, since we don't have answers for the devtest data. This properties file gives us a Naive Bayes model:

$ cat krt1.prop 
trainFile = train.tsv
crossValidationFolds = 10
testFile = test.tsv
useClassFeature = true
2.splitWordsRegexp = \\s
2.useSplitWords = true
useNB = true
goldAnswerColumn = 3
$  java -cp "*" edu.stanford.nlp.classify.ColumnDataClassifier -prop krt1.prop