Sentiment Extraction and Classification of Movie Reviews
Final Project Write-Up
By Kavi Goel, Anthony Hui
CS 224N, Spring 2004
1.
Past
Literature
Several past attempts have been made to develop systems that can automatically classify movie reviews as “Thumbs Up” or “Thumbs Down.” In Turney (2002), a system is described that classifies reviews of several different types of products including movies. The system identifies two word sequences containing particular combinations of nouns, adjectives, and adverbs and calculates their semantic orientation by looking for co-occurrence frequency of the phrase with the words “excellent” and “poor” in WWW-based documents. While the technique was applied more successfully in some of the other problem domains (such as car reviews), performance was lowest with movie reviews (movie review classification was correct 66% of the time, vs. 84% for automobile review classifications).
Pang et al (2002) applied several standard machine learning techniques to the movie review problem. The best accuracy achieved was using a bag-of-words feature list with performance in the range of 81-83% for all machine learning techniques applied. Performance was best when all words were considered and no structural elements of sentences were used in the feature-set.
Both papers mention that a possible problem is “thwarted expectations,” situations in which the reviewer deliberately contrasts her overall opinions with evidence that opposes this opinion, as in the sentence (taken from Pang et al): “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can't hold up.” Such contrasts are very common in the movie review domain.
To our knowledge, no attempts have been published in which sophisticated sentence analyses (structural or semantic) have been applied to the movie review classification problem.
2. Algorithm
Overview
For our project, we decided to explore a different method of classifying movie reviews. Our hypothesis was that movie reviews tend to have certain key sentences that express the overall sentiment of the author. Successfully extracting such key sentences could help to counter the noise present from sentiment expressed that is irrelevant to the movie (i.e., “The protagonist is a great leader.”) or sentiment which is counter to the overall opinion of the reviewer, as expressed in the “thwarted expectations” situation previously.
Our classification task thus broke down into two components, extracting key sentences and classifying those sentences as positive (thumbs up) or negative (thumbs down).
2.1 Key Sentence
Extraction
Our algorithm begins by scoring each sentence in the review according to presumed sentence importance. The review is passed through a sentence boundary detector and a POS tagger, and each tagged sentence is scored. The sentences are then ranked according to score. We looked for three types of features when scoring sentences.
For classification, we calculated a new score for relevant sentences. The score was based on the z-statistic of the adjectives in the sentence. However, we took extra measures to help deal with misleading situations. In particular, we accounted for “concessions” (or “contrast phrases”) and negations. We define concessions as sentences with subordinate clauses or prepositional phrases opposite in meaning to the overall sentiment, such as “Whereas the characters in the movie were endearing, they could not make up for a lousy storyline.” In this sentence the whereas clause would be ignored in computing the sentence score. Sentences with the coordinating conjunction “but” account for another common type of concession. In sentences with “but,” clauses previous to the appearance of the word are ignored in scoring.
Negations are of the form, “This is not a good movie.” We implemented a rather simplistic strategy to handle negations. In particular, we ignored any adjectives that appeared in the sentence that appear after the words “not,” “never,” or “no.” This strategy is not able to capture more sophisticated forms of negation such as “The movie lacked pizzazz.” where the verb “lacked” implies negation. Despite this obvious limitation, to our knowledge there is no movie review classification system that has been developed which employs a more sophisticated negation detection system.
3.
Implementation
3.1
Obtaining
data
We obtained our data from the Cornell movie review database at http://www.cs.cornell.edu/people/pabo/movie-review-data/
Our training data consisted of 1400 pre-classified movie
reviews, 700 positive and 700 negative.
However, for much of our testing we used a smaller (200-300) random
subset of these reviews. Our test data
is a small subset of the enormous IMDb movie review
archive of HTML reviews (also available on the Cornell website). We wrote a script (described in the “Files in
our Submission” section) to clean the files so that they were suitable for analyzing.
3.2
Sentence
Boundary Detection and Tagging
We developed a simple heuristic-based sentence boundary detection module to break up test reviews into sentences. The algorithm looks for periods, exclamation and question marks as sentence boundary points and also take into account common misleading situations such as period followed by a quotation, period in the middle of common abbreviations (such as Mr, Mrs or Ms.). It works for the vast majority of reviews – we have yet to come across a case when the sentence boundary detector gives a wrong boundary.
We used the Brill tagger to
Part-of-Speech tag each sentence (both in the training and test data sets).
3.3
Scoring
Words using the Z-Statistic
As mentioned above, we made use of our training set to come up with the positive and negative counts for each adjective encountered and come up with an adjective score based on those two counts. The scoring metric we used is called Z-statistic, the formula of which is presented as follows:
z = (observedRatio − expectedRatio)/SE
= (pi – po) / √ ((po)(1
– po)/n)
(SE = Standard Error = √
((po)(1 – po)/n))
The observed ratio pi is the ratio of positive
counts to total (positive + negative) counts, while the expected ratio po is the prior expected value of pi,
which we set to be 0.5. n is the population size, in this case the sum of positive
and negative counts for the adjective concerned.
The more the
observed ratio deviates from the expected ratio, the higher the magnitude of
the score is, given a constant standard error.
Since the calculation of the standard error takes into account the
population size, the larger the total number of counts for the word, the lower
the standard error and thus the higher the magnitude of the Z-statistic, and
vice versa. Thus
a confidence measure is inherent in the calculation of the Z-statistic.
3.4
Sentence
ranking
Before classifying a piece of review, we break the piece of
review into separate sentences and then rank each sentence based on its
importance. To measure the importance of
a given sentence, we take into account several features – the presence of
adjectives in the sentence (their Z-stat scores), the presence of movie titles
(and synonyms) as well as the presence of contrast words as discussed above. More specifically, the score of a given
sentence is calculated as follows. First we sum up the
absolute values of the Z-stat scores of all the adjectives present in the
sentence. Then we give hard-coded bonus
points based on the presence of movie titles/synonyms and contrast words. For sentences that specifically mention the
title of the movie, we give a bonus of 3 points. For sentences that do not mention the title of
the movie but do mention some sort of synonyms of the word “movie”, we give a
bonus of 2 points. Finally if the
sentence also mentions some contrast words such as “whereas” and “although”, we
give a bonus of 2 points.
3.5
Classification
of Sentences
As mentioned in the Algorithm Overview section, to classify a sentence as positive or negative, the primary feature we looked at was the adjectives present in the sentence. Each adjective will be assigned a score using the Z-statistic formula mentioned above and the score of the entire sentence will be the sum of the scores of all the adjectives.
To target sentences with negations and concessions, we also implemented specific features to avoid counting adjectives that are contrary to the overall opinion of the sentence. For example, the clause immediately following the contrast word “although” will be ignored in classifying the sentence.
The overall score of the sentence will determine its
sentiment orientation – a positive score will indicate a positive orientation,
and a negative score will indicate a negative sentiment.
3.6
Files in
our Submission
Besides the executables listed below we have submitted other necessary java files as well as training and testing data.
Java Executables
ReviewClassifier.class –
Usage: java ReviewClassifier RATINGS
This is the main executable to run our algorithm. It will train on the data set in traindata/ and test on the data in testdata/. The argument RATINGS is a file containing a list of review numbers (corresponding to files in testdata/) and the correct classifications of those reviews. ReviewClassifier will output a summary of its classifications and overall accuracy to the file ClassificationStats.out.
There are a number of constants in ReviewClassifier.java that can be used to turn on and off various features in the training and testing process. In addition, there is a constant array of strings where you can specify which parts of speech the algorithm will use for learning and testing. In general, we used {“JJ”}. See ReviewClassifier.java for more details.
NOTE: When modifying these values, be sure to “make clean” before “make”-ing. Sometimes the make process does not remake all the files that it should unless make clean is performed first.
Review.class –
Usage: java Review reviewFileName reviewFileNameTagged
Example: java Review testdata/new5010 testdasta/tagged5010_sentences
Executing this file will train on the data in traindata/ and then evaluate a specific review (in this case review #5010), printing to the screen the details of the calculations used to classify the review. Review.java contains two constant arrays of Strings which contain the list of “concession” words (i.e. “despite,” “whereas,” etc.) and the list of negation words (i.e. “no,” “not,” etc.). These can be modified as desired.
Scripts
testdata/cleanup –
Usage: ./cleanup firstFile lastFile
Example: ./cleanup 5001 5099
This is a script for preprocessing HTML files. It will perform the “clean up” duties of removing HTML tags and metadata. In the above example, cleanup will cleanup files named 5001.html up to 5099.html
tagTestFiles –
Usage: ./tagTestFiles
Calls programs to do sentence boundary detection and tagging
of all files named testdata/new50xx where xx ranges
from 00 to 99. The formatted versions of
each file are named tagged50xx_sentences. This should be run
on files created by cleanup to prepare them for reviewing by Review.class or ReviewClassifier.class.
We observed that most of the adjective were scored appropriately by our scoring metrics, in the sense that most positive adjectives were assigned positive scores and most negative adjectives were assigned negative scores. Because of the way Z-statistic is calculated, adjectives which occur more often were given higher scores than adjectives which occur less often. Some examples of the adjective scores obtained with a training set of 125 positive and negative reviews were given below:
Adjective |
Score |
Occurrences in positive reviews |
Occurrences in negative reviews |
bad |
-5.777 |
54 |
133 |
trashy |
-1.732 |
0 |
3 |
uninteresting |
-4 |
0 |
16 |
terrible |
-3.1277 |
4 |
19 |
good |
1.27 |
134 |
114 |
terrific |
1.667 |
7 |
2 |
outstanding |
2.646 |
7 |
0 |
superb |
1.069 |
9 |
5 |
By manually going over the adjective scores we believe that the score were assigned appropriately. Z-statistic takes into account the population size for each adjective, so a confidence measure is inherent in the score.
However we did notice a few wrongly classified adjectives: adjectives which express a negative sentiment were assigned a positive scores, and vice versa. This might be due to noise, and also the fact that we did not take into account into account while collecting statistics. In other words, “not bad” in a positive review in our training set will contribute to the positive occurrences of the adjective “bad” in system, which could be misleading. Another potential reason is the size of our training set is still relatively small and using a larger training set will help counter the noise present in sentiment which is counter to the overall opinion of the reviewer.
We did most of our experiments with a training set of 125 positive reviews and 125 negative reviews. We found that training with larger training sets decreases our overall accuracy.
We believed that one major reason why our performance degraded on larger training sets is that the bonus points for the presence of movie titles or synonyms as well as for the presence of movie titles or synonyms and the presence of contrast words are not scaled according to the scores of the strongly indicative adjectives. Because of the way Z-Statistic is calculated, increasing the size of training set will lead to a higher average score among the words because of the higher average number of negative and positive samples for each adjective (thus higher confidence).
Highest Accuracy occurs when the training set is around 250 (125 positive reviews and 125 negative reviews), at which the accuracy is 76% (54 over 71 test cases) with all the extra features turned on and a classification approach that considers the top three sentences in each review.
We experimented with other larger training set sizes but they all gave rise to lower percentage accuracies over our test set:
Training Set Size |
Accuracy |
250 |
76% (54/71) |
300 |
70% (50/71) |
400 |
66% (47/71) |
600 |
58% (41/71) |
1400 |
66% (47/71) |
We obtained the above results using a classification approach of looking at the top three sentences for each review and with our extra features turned on.
We implemented extra features on concession and negation sentence structures to enable our classifier to improve on sentence extraction and classification and we also experimented with different approaches in classifying reviews.
We essentially only look at the top three most important sentences in the review to arrive at the classification result. Each of the top three sentences will have a sentence score indicating its sentiment orientation. A positive sentence score under our model means that the sentence expresses a positive sentiment orientation and a negative sentence score indicates that the sentence expresses a negative sentiment orientation. A positive sentence will cast a positive vote and a negative sentence will case a negative vote. We use an approach of majority rules – the review is classified as a positive (“thumbs up”) review if there are more positive votes than negative votes, and vice versa. In cases when one of the top three sentences has a score of zero and thus does not express a positive or negative sentiment, we break ties by looking at the aggregate score of all the sentences. A positive aggregate score means that the review is positive, while a negative aggregate score means that the review is negative.
Using this approach of focusing on the top three sentences of the review and using a training set of 125 positive review and 125 negative reviews and with all our extra features turned on, we obtained an accuracy percentage of 76% - 54 correctly classified reviews out of a total of 71 reviews in the testing set. This is the best accuracy rate we have obtained to date.
We also experimented with different settings by changing the number of top sentences that we consider in classifying the reviews as well as turning the extra features on and off. The following table summarized our results:
|
Extra Features Turned On |
Extra Features Turned Off |
top two |
75% (53/71) |
69% (49/71) |
top three |
76% (54/71) |
65% (46/71) |
top four |
75% (53/71) |
68% (48/71) |
top five |
70% (50/71) |
65% (46/71) |
top six |
67% (48/71) |
66% (47/71) |
all |
62% (44/71) |
59% (42/71) |
We obtained all our results above using a training set consisting of 125 positive and 125 negative reviews. As expected, the accuracy level increases significantly when the extra features were turned on. In addition, the accuracy also decreased with the number of top sentences we considered in classifying each piece of review. This confirmed our belief that looking at the key sentences to classify a review is a more effective approach in classification than looking the entire set of sentences in the review, probably due to the noise present from sentiment expressed that is irrelevant to the movie (i.e., “The protagonist is a great leader.”) or sentiment which is counter to the overall opinion of the reviewer, as expressed in the “thwarted expectations.”
In addition to the above approach used, we also experimented with other approaches of classification in which the number of sentences that we look at in each review is dynamic. There are three similar approaches which we tried. In approach B, we look at the first two sentences in the review, and see they together give a majority vote. If there is a majority, we stop and classify based on the majority vote. If on the other hand the first two sentences both give a neutral classification or there is a tie, we will go on to look at the vote of the third sentence on our list. If the third sentence expresses a neutral classification, we will go on to look at the fourth most important sentence and so on. We stop once there is a majority vote.
We also experimented with a variant of this approach in which we looked at two sentences at a time – if the first two do not give a majority we will go on to look at the next two, and if there is still no majority, we will go on to look at the first six, and so on and so forth. This is approach C. Finally we also experimented the approach in which we classify based on the single most important sentence, and only jump to look at the next two if it gives a neutral classification. This is approach A.
The following table gives our experiment results with these three approaches. Again we obtained the following results using a training set of 125 positive reviews and 125 negative reviews.
|
Extra Features Turned On |
Extra Features Turned Off |
A: start with the
top sentence, then 2 sentences at a time |
73% (52/71) |
69% (49/71) |
B: start with
first 2, then jump 1 at a time |
72% (51/71) |
64% (46/71) |
C: start with
first 2, then jump 2 at a time |
68% (48/71) |
64% (46/71) |
Dynamically changing the number of sentences considered in classifying a particular review did not seem to have a significant effect on our performance. The underlying reason is that most of time it suffices to look at the very top sentences in the ranked list to arrive at a correct classification. Our experiments with approach A indicated that most of the time we only have to look at the very first sentence to arrive at a correct classification.
The above results in 3.3.1 and 3.3.2 all illustrated that our advanced extra features that target sentences with negation and concessions are effective in significantly in raising the accuracy of our system. This is in accordance with our expectations, since writers often resort to using concession sentence structures and negation words in writing reviews, and being able to recognize those structures and adjusting our scoring metrics is important in capturing the right key sentences and making the right classification for each piece of review.
5.1
Automatic learning of weight parameters
One major difficulty that we encountered in this project is to assign appropriate weights to all the features that we employed in determining the importance of and classifying sentences. There is a good number of weight parameters in the system, and it is not at all clear how we should set these values, for example the bonus points for the presence of movie titles and synonyms, the bonus points for the presence of contrast phrases in a sentence and potentially the relative weights of different parts of speech. One major source of improvement would be to implement automatic learning of weight parameters. This would probably enable us to use other parts of speech beyond adjectives such as adverbs and verbs to extract sentiment more effectively.
Another possible advantage would be to incorporate phrases longer than one word, such as bigrams. Though data is sparser which would generate more noise, we could compensate this by weighting bigrams appropriately.
Finally, it would be useful to incorporate a better negation handling system. Given more time, this would have been an interesting area of further exploration.