Sentiment Extraction and Classification of Movie Reviews

Final Project Write-Up

By Kavi Goel, Anthony Hui

CS 224N, Spring 2004

June 4, 2004

1. Past Literature

Several past attempts have been made to develop systems that can automatically classify movie reviews as “Thumbs Up” or “Thumbs Down.” In Turney (2002), a system is described that classifies reviews of several different types of products including movies. The system identifies two word sequences containing particular combinations of nouns, adjectives, and adverbs and calculates their semantic orientation by looking for co-occurrence frequency of the phrase with the words “excellent” and “poor” in WWW-based documents. While the technique was applied more successfully in some of the other problem domains (such as car reviews), performance was lowest with movie reviews (movie review classification was correct 66% of the time, vs. 84% for automobile review classifications).

Pang et al (2002) applied several standard machine learning techniques to the movie review problem. The best accuracy achieved was using a bag-of-words feature list with performance in the range of 81-83% for all machine learning techniques applied. Performance was best when all words were considered and no structural elements of sentences were used in the feature-set.

Both papers mention that a possible problem is “thwarted expectations,” situations in which the reviewer deliberately contrasts her overall opinions with evidence that opposes this opinion, as in the sentence (taken from Pang et al): “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can't hold up.” Such contrasts are very common in the movie review domain.

To our knowledge, no attempts have been published in which sophisticated sentence analyses (structural or semantic) have been applied to the movie review classification problem.

2. Algorithm Overview

For our project, we decided to explore a different method of classifying movie reviews. Our hypothesis was that movie reviews tend to have certain key sentences that express the overall sentiment of the author. Successfully extracting such key sentences could help to counter the noise present from sentiment expressed that is irrelevant to the movie (i.e., “The protagonist is a great leader.”) or sentiment which is counter to the overall opinion of the reviewer, as expressed in the “thwarted expectations” situation previously.

Our classification task thus broke down into two components, extracting key sentences and classifying those sentences as positive (thumbs up) or negative (thumbs down).

2.1 Key Sentence Extraction

Our algorithm begins by scoring each sentence in the review according to presumed sentence importance. The review is passed through a sentence boundary detector and a POS tagger, and each tagged sentence is scored. The sentences are then ranked according to score. We looked for three types of features when scoring sentences.

Presence of movie title in sentence: in general a sentence which contains the movie title should be more likely to express sentiment about the movie as a whole than a sentence which does not contain the movie title. We also assigned points to a sentence score for presence of a synonym, such as the words “movie” or “film.”
Presence of strongly indicative adjectives: past research has shown that adjectives tend to be good indicators of sentiment. A more generalized algorithm could have examined other word types (and we explored this idea briefly), but adjectives-only provided the best performance out-of-the-box. We trained our system over a number of pre-classified reviews collecting statistics of the frequencies that each adjective appeared in positive and negative reviews. Adjectives in the sentence under consideration that tended to appear predominately in positive reviews or predominately in negative reviews contributed to the sentence score. The formula we used is the z-statistic, described later.
Presence of contrast words: Sentences containing words such as “despite,” “whereas,” or “but” which express contrasting sentiment were given a higher score. These sentences often contain important indicators towards the overall sentiment of the reviewer.

2.2 Classification of Key Sentences

For classification, we calculated a new score for relevant sentences. The score was based on the z-statistic of the adjectives in the sentence. However, we took extra measures to help deal with misleading situations. In particular, we accounted for “concessions” (or “contrast phrases”) and negations. We define concessions as sentences with subordinate clauses or prepositional phrases opposite in meaning to the overall sentiment, such as “Whereas the characters in the movie were endearing, they could not make up for a lousy storyline.” In this sentence the whereas clause would be ignored in computing the sentence score. Sentences with the coordinating conjunction “but” account for another common type of concession. In sentences with “but,” clauses previous to the appearance of the word are ignored in scoring.

Negations are of the form, “This is not a good movie.” We implemented a rather simplistic strategy to handle negations. In particular, we ignored any adjectives that appeared in the sentence that appear after the words “not,” “never,” or “no.” This strategy is not able to capture more sophisticated forms of negation such as “The movie lacked pizzazz.” where the verb “lacked” implies negation. Despite this obvious limitation, to our knowledge there is no movie review classification system that has been developed which employs a more sophisticated negation detection system.

3. Implementation

3.1 Obtaining data

We obtained our data from the Cornell movie review database at http://www.cs.cornell.edu/people/pabo/movie-review-data/

Our training data consisted of 1400 pre-classified movie reviews, 700 positive and 700 negative. However, for much of our testing we used a smaller (200-300) random subset of these reviews. Our test data is a small subset of the enormous IMDb movie review archive of HTML reviews (also available on the Cornell website). We wrote a script (described in the “Files in our Submission” section) to clean the files so that they were suitable for analyzing.

3.2 Sentence Boundary Detection and Tagging

We developed a simple heuristic-based sentence boundary detection module to break up test reviews into sentences. The algorithm looks for periods, exclamation and question marks as sentence boundary points and also take into account common misleading situations such as period followed by a quotation, period in the middle of common abbreviations (such as Mr, Mrs or Ms.). It works for the vast majority of reviews – we have yet to come across a case when the sentence boundary detector gives a wrong boundary.

We used the Brill tagger to Part-of-Speech tag each sentence (both in the training and test data sets).

3.3 Scoring Words using the Z-Statistic

As mentioned above, we made use of our training set to come up with the positive and negative counts for each adjective encountered and come up with an adjective score based on those two counts. The scoring metric we used is called Z-statistic, the formula of which is presented as follows:

z = (observedRatio − expectedRatio)/SE

= (p_i – p_o) / √ ((p_o)(1 – p_o)/n)

(SE = Standard Error = √ ((p_o)(1 – p_o)/n))

The observed ratio p_iis the ratio of positive counts to total (positive + negative) counts, while the expected ratio p_o is the prior expected value of p_i, which we set to be 0.5. n is the population size, in this case the sum of positive and negative counts for the adjective concerned.

The more the observed ratio deviates from the expected ratio, the higher the magnitude of the score is, given a constant standard error. Since the calculation of the standard error takes into account the population size, the larger the total number of counts for the word, the lower the standard error and thus the higher the magnitude of the Z-statistic, and vice versa. Thus a confidence measure is inherent in the calculation of the Z-statistic.

3.4 Sentence ranking

Before classifying a piece of review, we break the piece of review into separate sentences and then rank each sentence based on its importance. To measure the importance of a given sentence, we take into account several features – the presence of adjectives in the sentence (their Z-stat scores), the presence of movie titles (and synonyms) as well as the presence of contrast words as discussed above. More specifically, the score of a given sentence is calculated as follows. First we sum up the absolute values of the Z-stat scores of all the adjectives present in the sentence. Then we give hard-coded bonus points based on the presence of movie titles/synonyms and contrast words. For sentences that specifically mention the title of the movie, we give a bonus of 3 points. For sentences that do not mention the title of the movie but do mention some sort of synonyms of the word “movie”, we give a bonus of 2 points. Finally if the sentence also mentions some contrast words such as “whereas” and “although”, we give a bonus of 2 points.

3.5 Classification of Sentences

As mentioned in the Algorithm Overview section, to classify a sentence as positive or negative, the primary feature we looked at was the adjectives present in the sentence. Each adjective will be assigned a score using the Z-statistic formula mentioned above and the score of the entire sentence will be the sum of the scores of all the adjectives.

To target sentences with negations and concessions, we also implemented specific features to avoid counting adjectives that are contrary to the overall opinion of the sentence. For example, the clause immediately following the contrast word “although” will be ignored in classifying the sentence.

The overall score of the sentence will determine its sentiment orientation – a positive score will indicate a positive orientation, and a negative score will indicate a negative sentiment.

3.6 Files in our Submission

Besides the executables listed below we have submitted other necessary java files as well as training and testing data.

Java Executables

ReviewClassifier.class –

Usage: java ReviewClassifier RATINGS

This is the main executable to run our algorithm. It will train on the data set in traindata/ and test on the data in testdata/. The argument RATINGS is a file containing a list of review numbers (corresponding to files in testdata/) and the correct classifications of those reviews. ReviewClassifier will output a summary of its classifications and overall accuracy to the file ClassificationStats.out.

There are a number of constants in ReviewClassifier.java that can be used to turn on and off various features in the training and testing process. In addition, there is a constant array of strings where you can specify which parts of speech the algorithm will use for learning and testing. In general, we used {“JJ”}. See ReviewClassifier.java for more details.

NOTE: When modifying these values, be sure to “make clean” before “make”-ing. Sometimes the make process does not remake all the files that it should unless make clean is performed first.

Review.class –

Usage: java Review reviewFileName reviewFileNameTagged

Example: java Review testdata/new5010 testdasta/tagged5010_sentences

Executing this file will train on the data in traindata/ and then evaluate a specific review (in this case review #5010), printing to the screen the details of the calculations used to classify the review. Review.java contains two constant arrays of Strings which contain the list of “concession” words (i.e. “despite,” “whereas,” etc.) and the list of negation words (i.e. “no,” “not,” etc.). These can be modified as desired.

Scripts

testdata/cleanup –

Usage: ./cleanup firstFile lastFile

Example: ./cleanup 5001 5099

This is a script for preprocessing HTML files. It will perform the “clean up” duties of removing HTML tags and metadata. In the above example, cleanup will cleanup files named 5001.html up to 5099.html

tagTestFiles –

Usage: ./tagTestFiles

Calls programs to do sentence boundary detection and tagging of all files named testdata/new50xx where xx ranges from 00 to 99. The formatted versions of each file are named tagged50xx_sentences. This should be run on files created by cleanup to prepare them for reviewing by Review.class or ReviewClassifier.class.

4. Evaluation

4.1 Adjective Scoring

We observed that most of the adjective were scored appropriately by our scoring metrics, in the sense that most positive adjectives were assigned positive scores and most negative adjectives were assigned negative scores. Because of the way Z-statistic is calculated, adjectives which occur more often were given higher scores than adjectives which occur less often. Some examples of the adjective scores obtained with a training set of 125 positive and negative reviews were given below:

Adjective	Score	Occurrences in positive reviews	Occurrences in negative reviews
bad	-5.777	54	133
trashy	-1.732	0	3
uninteresting	-4	0	16
terrible	-3.1277	4	19
good	1.27	134	114
terrific	1.667	7	2
outstanding	2.646	7	0
superb	1.069	9	5

By manually going over the adjective scores we believe that the score were assigned appropriately. Z-statistic takes into account the population size for each adjective, so a confidence measure is inherent in the score.

However we did notice a few wrongly classified adjectives: adjectives which express a negative sentiment were assigned a positive scores, and vice versa. This might be due to noise, and also the fact that we did not take into account into account while collecting statistics. In other words, “not bad” in a positive review in our training set will contribute to the positive occurrences of the adjective “bad” in system, which could be misleading. Another potential reason is the size of our training set is still relatively small and using a larger training set will help counter the noise present in sentiment which is counter to the overall opinion of the reviewer.

4.2 Training with larger training sets

We did most of our experiments with a training set of 125 positive reviews and 125 negative reviews. We found that training with larger training sets decreases our overall accuracy.

We believed that one major reason why our performance degraded on larger training sets is that the bonus points for the presence of movie titles or synonyms as well as for the presence of movie titles or synonyms and the presence of contrast words are not scaled according to the scores of the strongly indicative adjectives. Because of the way Z-Statistic is calculated, increasing the size of training set will lead to a higher average score among the words because of the higher average number of negative and positive samples for each adjective (thus higher confidence).

Highest Accuracy occurs when the training set is around 250 (125 positive reviews and 125 negative reviews), at which the accuracy is 76% (54 over 71 test cases) with all the extra features turned on and a classification approach that considers the top three sentences in each review.

We experimented with other larger training set sizes but they all gave rise to lower percentage accuracies over our test set:

Training Set Size	Accuracy
250	76% (54/71)
300	70% (50/71)
400	66% (47/71)
600	58% (41/71)
1400	66% (47/71)

We obtained the above results using a classification approach of looking at the top three sentences for each review and with our extra features turned on.

4.3 Performance of Extra Features and Different Classification Approaches

We implemented extra features on concession and negation sentence structures to enable our classifier to improve on sentence extraction and classification and we also experimented with different approaches in classifying reviews.

Classification Approach I – Looking at a fixed number of sentences

We essentially only look at the top three most important sentences in the review to arrive at the classification result. Each of the top three sentences will have a sentence score indicating its sentiment orientation. A positive sentence score under our model means that the sentence expresses a positive sentiment orientation and a negative sentence score indicates that the sentence expresses a negative sentiment orientation. A positive sentence will cast a positive vote and a negative sentence will case a negative vote. We use an approach of majority rules – the review is classified as a positive (“thumbs up”) review if there are more positive votes than negative votes, and vice versa. In cases when one of the top three sentences has a score of zero and thus does not express a positive or negative sentiment, we break ties by looking at the aggregate score of all the sentences. A positive aggregate score means that the review is positive, while a negative aggregate score means that the review is negative.

Using this approach of focusing on the top three sentences of the review and using a training set of 125 positive review and 125 negative reviews and with all our extra features turned on, we obtained an accuracy percentage of 76% - 54 correctly classified reviews out of a total of 71 reviews in the testing set. This is the best accuracy rate we have obtained to date.

We also experimented with different settings by changing the number of top sentences that we consider in classifying the reviews as well as turning the extra features on and off. The following table summarized our results:

	Extra Features Turned On	Extra Features Turned Off
top two	75% (53/71)	69% (49/71)
top three	76% (54/71)	65% (46/71)
top four	75% (53/71)	68% (48/71)
top five	70% (50/71)	65% (46/71)
top six	67% (48/71)	66% (47/71)
all	62% (44/71)	59% (42/71)

We obtained all our results above using a training set consisting of 125 positive and 125 negative reviews. As expected, the accuracy level increases significantly when the extra features were turned on. In addition, the accuracy also decreased with the number of top sentences we considered in classifying each piece of review. This confirmed our belief that looking at the key sentences to classify a review is a more effective approach in classification than looking the entire set of sentences in the review, probably due to the noise present from sentiment expressed that is irrelevant to the movie (i.e., “The protagonist is a great leader.”) or sentiment which is counter to the overall opinion of the reviewer, as expressed in the “thwarted expectations.”

Classification Approach II - Dynamically changing the number of sentences considered

In addition to the above approach used, we also experimented with other approaches of classification in which the number of sentences that we look at in each review is dynamic. There are three similar approaches which we tried. In approach B, we look at the first two sentences in the review, and see they together give a majority vote. If there is a majority, we stop and classify based on the majority vote. If on the other hand the first two sentences both give a neutral classification or there is a tie, we will go on to look at the vote of the third sentence on our list. If the third sentence expresses a neutral classification, we will go on to look at the fourth most important sentence and so on. We stop once there is a majority vote.

We also experimented with a variant of this approach in which we looked at two sentences at a time – if the first two do not give a majority we will go on to look at the next two, and if there is still no majority, we will go on to look at the first six, and so on and so forth. This is approach C. Finally we also experimented the approach in which we classify based on the single most important sentence, and only jump to look at the next two if it gives a neutral classification. This is approach A.

The following table gives our experiment results with these three approaches. Again we obtained the following results using a training set of 125 positive reviews and 125 negative reviews.

	Extra Features Turned On	Extra Features Turned Off
A: start with the top sentence, then 2 sentences at a time	73% (52/71)	69% (49/71)
B: start with first 2, then jump 1 at a time	72% (51/71)	64% (46/71)
C: start with first 2, then jump 2 at a time	68% (48/71)	64% (46/71)

Dynamically changing the number of sentences considered in classifying a particular review did not seem to have a significant effect on our performance. The underlying reason is that most of time it suffices to look at the very top sentences in the ranked list to arrive at a correct classification. Our experiments with approach A indicated that most of the time we only have to look at the very first sentence to arrive at a correct classification.

The above results in 3.3.1 and 3.3.2 all illustrated that our advanced extra features that target sentences with negation and concessions are effective in significantly in raising the accuracy of our system. This is in accordance with our expectations, since writers often resort to using concession sentence structures and negation words in writing reviews, and being able to recognize those structures and adjusting our scoring metrics is important in capturing the right key sentences and making the right classification for each piece of review.

5. Ideas for further research/improvements

5.1 Automatic learning of weight parameters

One major difficulty that we encountered in this project is to assign appropriate weights to all the features that we employed in determining the importance of and classifying sentences. There is a good number of weight parameters in the system, and it is not at all clear how we should set these values, for example the bonus points for the presence of movie titles and synonyms, the bonus points for the presence of contrast phrases in a sentence and potentially the relative weights of different parts of speech. One major source of improvement would be to implement automatic learning of weight parameters. This would probably enable us to use other parts of speech beyond adjectives such as adverbs and verbs to extract sentiment more effectively.

Another possible advantage would be to incorporate phrases longer than one word, such as bigrams. Though data is sparser which would generate more noise, we could compensate this by weighting bigrams appropriately.

Finally, it would be useful to incorporate a better negation handling system. Given more time, this would have been an interesting area of further exploration.