Post on 20-Mar-2018
Sentiment Analysis
What is Sentiment Analysis?
Dan Jurafsky
Positive or negative movie review?
• unbelievably disappointing • Full of zany characters and richly applied satire, and some
great plot twists• this is the greatest screwball comedy ever filmed• It was pathetic. The worst part about it was the boxing
scenes.
2
Dan Jurafsky
• a
3
Google Shopping aspectshttps://www.google.com/shopping/product/14023500906804577211
Dan Jurafsky Twitter sentiment versus Gallup Poll of Consumer Confidence
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In ICWSM-‐2010
Dan Jurafsky
Twitter sentiment:
Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter mood predicts the stock market,Journal of Computational Science 2:1, 1-‐8. 10.1016/j.jocs.2010.12.007.
5
Dan Jurafsky
Target Sentiment on Twitter
• Twitter Sentiment App• Alec Go, Richa Bhayani, Lei Huang. 2009.
Twitter Sentiment Classification using Distant Supervision
6
Dan Jurafsky
Very fancy sentiment detectors
• http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
7
Dan Jurafsky
Sentiment analysis has many other names
• Opinion extraction• Opinion mining• Sentiment mining• Subjectivity analysis
8
Dan Jurafsky
Why sentiment analysis?
• Movie: is this review positive or negative?• Products: what do people think about the new iPhone?• Public sentiment: how is consumer confidence? Is despair increasing?
• Politics: what do people think about this candidate or issue?• Prediction: predict election outcomes or market trendsfrom sentiment
9
Dan Jurafsky
Scherer Typology of Affective States
• Emotion: brief organically synchronized … evaluation of a major event • angry, sad, joyful, fearful, ashamed, proud, elated
• Mood: diffuse non-‐caused low-‐intensity long-‐duration change in subjective feeling• cheerful, gloomy, irritable, listless, depressed, buoyant
• Interpersonal stances: affective stance toward another person in a specific interaction• friendly, flirtatious, distant, cold, warm, supportive, contemptuous
• Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons• liking, loving, hating, valuing, desiring
• Personality traits: stable personality dispositions and typical behavior tendencies• nervous, anxious, reckless, morose, hostile, jealous
Dan Jurafsky
Scherer Typology of Affective States
• Emotion: brief organically synchronized … evaluation of a major event • angry, sad, joyful, fearful, ashamed, proud, elated
• Mood: diffuse non-‐caused low-‐intensity long-‐duration change in subjective feeling• cheerful, gloomy, irritable, listless, depressed, buoyant
• Interpersonal stances: affective stance toward another person in a specific interaction• friendly, flirtatious, distant, cold, warm, supportive, contemptuous
• Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons• liking, loving, hating, valuing, desiring
• Personality traits: stable personality dispositions and typical behavior tendencies• nervous, anxious, reckless, morose, hostile, jealous
Dan Jurafsky
Sentiment Analysis
• Sentiment analysis is the detection of attitudes“enduring, affectively colored beliefs, dispositions towards objects or persons”1. Holder (source) of attitude2. Target (aspect) of attitude3. Type of attitude• From a set of types
• Like, love, hate, value, desire, etc.• Or (more commonly) simple weighted polarity:
• positive, negative, neutral, together with strength4. Text containing the attitude• Sentence or entire document12
Dan Jurafsky
Sentiment Analysis
• Simplest task:• Is the attitude of this text positive or negative?
• More complex:• Rank the attitude of this text from 1 to 5
• Advanced:• Detect the target (stance detection)• Detect source• Complex attitude types
Dan Jurafsky
Sentiment Analysis• Simplest task:• Is the attitude of this text positive or negative?
• More complex:• Rank the attitude of this text from 1 to 5
• Advanced:• Detect the target (stance detection)• Detect source• Complex attitude types
Sentiment Analysis
What is Sentiment Analysis?
Sentiment Analysis
A Baseline Algorithm
Dan Jurafsky
Sentiment Classification in Movie Reviews
• Polarity detection:• Is an IMDB movie review positive or negative?
• Data: Polarity Data 2.0: • http://www.cs.cornell.edu/people/pabo/movie-‐review-‐data
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-‐2002, 79—86.Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271-‐278
Dan Jurafsky
IMDB data in the Pang and Lee database
when _star wars_ came out some twenty years ago , the image of traveling throughout the stars has become a commonplace image . […]when han solo goes light speed , the stars change to bright lines , going towards the viewer in lines that converge at an invisible point . cool . _october sky_ offers a much simpler image–that of a single white dot , traveling horizontally across the night sky . [. . . ]
“ snake eyes ” is the most aggravating kind of movie : the kind that shows so much potential then becomes unbelievably disappointing . it’s not just because this is a briandepalma film , and since he’s a great director and one who’s films are always greeted with at least some fanfare . and it’s not even because this was a film starring nicolas cage and since he gives a brauvara performance , this film is hardly worth his talents .
✓ ✗
Dan Jurafsky
Baseline Algorithm (adapted from Pang and Lee)
• Tokenization• Feature Extraction• Classification using different classifiers• Naive Bayes• MaxEnt• SVM
Dan Jurafsky
Sentiment Tokenization Issues
• Deal with HTML and XML markup• Twitter mark-‐up (names, hash tags)• Capitalization (preserve for
words in all caps)• Phone numbers, dates• Emoticons• Useful code:
• Christopher Potts sentiment tokenizer• Brendan O’Connor twitter tokenizer20
[<>]? # optional hat/brow[:;=8] # eyes[\-o\*\']? # optional nose[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth | #### reverse orientation[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth[\-o\*\']? # optional nose[:;=8] # eyes[<>]? # optional hat/brow
Potts emoticons
Dan Jurafsky
Extracting Features for Sentiment Classification
• How to handle negation• I didn’t like this movievs• Don't dismiss this film
21
Dan Jurafsky
Negation
Add NOT_ to every word between negation and following punctuation:
didn’t like this movie , but I
didn’t NOT_like NOT_this NOT_movie but I
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA).Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
Dan Jurafsky
Extracting Features for Sentiment Classification
Which words to use?• Only adjectives• All words
All words turns out to work better, at least on this data23
Dan Jurafsky
Reminder: Naive Bayes
24
cNB = argmaxc j∈C
P(cj ) P(wi | cj )i∈positions∏
Dan Jurafsky
Reminder: Naive Bayes
25
Let Nc be number of documents with class cLet Ndoc be total number of documents
6.2 • TRAINING THE NAIVE BAYES CLASSIFIER 5
positions all word positions in test document
cNB = argmaxc2C
P(c)Y
i2positions
P(wi|c) (6.9)
Naive Bayes calculations, like calculations for language modeling, are done inlog space, to avoid underflow and increase speed. Thus Eq. 6.9 is generally insteadexpressed as
cNB = argmaxc2C
logP(c)+X
i2positions
logP(wi|c) (6.10)
By considering features in log space Eq. 6.10 computes the predicted class asa linear function of input features. Classifiers that use a linear combination ofthe inputs to make a classification decision —like naive Bayes and also logisticregression— are called linear classifiers.linear
classifiers
6.2 Training the Naive Bayes Classifier
How can we learn the probabilities P(c) and P( fi|c)? Let’s first consider the max-imum likelihood estimate. We’ll simply use the frequencies in the data. For thedocument prior P(c) we ask what percentage of the documents in our training setare in each class c. Let Nc be the number of documents in our training data withclass c and Ndoc be the total number of documents. Then:
P̂(c) =Nc
Ndoc(6.11)
(6.12)
To learn the probability P( fi|c), we’ll assume a feature is just the existence of aword in the document’s bag of words, and so we’ll want P(wi|c), which we computeas the fraction of times the word wi appears among all words in all documents oftopic c. We first concatenate all documents with category c into one big “categoryc” text. Then we use the frequency of wi in this concatenated document to give amaximum likelihood estimate of the probability:
P̂(wi|c) =count(wi,c)Pw2V count(w,c)
(6.13)
Here the vocabulary V consists of the union of all the word types in all classes,not just the words in one class c.
There is a problem, however, with maximum likelihood training. Imagine weare trying to estimate the likelihood of the word “fantastic” given class positive, butsuppose there are no training documents that both contain the word “fantastic” andare classified as positive. Perhaps the word “fantastic” happens to occur (sarcasti-cally?) in the class negative. In such a case the probability for this feature will bezero:
Dan Jurafsky
Reminder: Naive Bayes
26
• Likelihoods
• What about zeros? Suppose "fantastic" never occurs?
• Add-‐one smmoothing
6.2 • TRAINING THE NAIVE BAYES CLASSIFIER 5
positions all word positions in test document
cNB = argmaxc2C
P(c)Y
i2positions
P(wi|c) (6.9)
Naive Bayes calculations, like calculations for language modeling, are done inlog space, to avoid underflow and increase speed. Thus Eq. 6.9 is generally insteadexpressed as
cNB = argmaxc2C
logP(c)+X
i2positions
logP(wi|c) (6.10)
By considering features in log space Eq. 6.10 computes the predicted class asa linear function of input features. Classifiers that use a linear combination ofthe inputs to make a classification decision —like naive Bayes and also logisticregression— are called linear classifiers.linear
classifiers
6.2 Training the Naive Bayes Classifier
How can we learn the probabilities P(c) and P( fi|c)? Let’s first consider the max-imum likelihood estimate. We’ll simply use the frequencies in the data. For thedocument prior P(c) we ask what percentage of the documents in our training setare in each class c. Let Nc be the number of documents in our training data withclass c and Ndoc be the total number of documents. Then:
P̂(c) =Nc
Ndoc(6.11)
(6.12)
To learn the probability P( fi|c), we’ll assume a feature is just the existence of aword in the document’s bag of words, and so we’ll want P(wi|c), which we computeas the fraction of times the word wi appears among all words in all documents oftopic c. We first concatenate all documents with category c into one big “categoryc” text. Then we use the frequency of wi in this concatenated document to give amaximum likelihood estimate of the probability:
P̂(wi|c) =count(wi,c)Pw2V count(w,c)
(6.13)
Here the vocabulary V consists of the union of all the word types in all classes,not just the words in one class c.
There is a problem, however, with maximum likelihood training. Imagine weare trying to estimate the likelihood of the word “fantastic” given class positive, butsuppose there are no training documents that both contain the word “fantastic” andare classified as positive. Perhaps the word “fantastic” happens to occur (sarcasti-cally?) in the class negative. In such a case the probability for this feature will bezero:
6 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION
P̂(“fantastic”|positive) =count(“fantastic”,positive)P
w2V count(w,positive)= 0 (6.14)
But since naive Bayes naively multiplies all the feature likelihoods together, zeroprobabilities in the likelihood term for any class will cause the probability of theclass to be zero, no matter the other evidence!
The simplest solution is the add-one (Laplace) smoothing introduced in Chap-ter 4. While Laplace smoothing is usually replaced by more sophisticated smoothingalgorithms in language modeling, it is commonly used in naive Bayes text catego-rization:
P̂(wi|c) =count(wi,c)+1P
w2V (count(w,c)+1)=
count(wi,c)+1�Pw2V count(w,c)
�+ |V |
(6.15)
Note once again that it is a crucial that the vocabulary V consists of the unionof all the word types in all classes, not just the words in one class c (try to convinceyourself why this must be true; see the exercise at the end of the chapter).
What do we do about words that occur in our test data but are not in our vocab-ulary at all because they did not occur in any training document in any class? Thestandard solution for such unknown words is to ignore such words—remove themfrom the test document and not include any probability for them at all.
Finally, some systems choose to completely ignore another class of words: stop
words, very frequent words like the and a. This can be done by sorting the vocabu-stop words
lary by frequency in the training set, and defining the top 10–100 vocabulary entriesas stop words, or alternatively by using one of the many pre-defined stop word listavailable online. Then every instance of these stop words are simply removed fromboth training and test documents as if they had never occurred. In most text classi-fication applications, however, using a stop word list doesn’t improve performance,and so it is more common to make use of the entire vocabulary and not use a stopword list.
Fig. 6.2 shows the final algorithm.
6.3 Worked example
Let’s walk through an example of training and testing naive Bayes with add-onesmoothing. We’ll use a sentiment analysis domain with the two classes positive(+) and negative (-), and take the following miniature training and test documentssimplified from actual movie reviews.
Cat Documents
Training - just plain boring- entirely predictable and lacks energy- no surprises and very few laughs+ very powerful+ the most fun film of the summer
Test ? predictable with no fun
The prior P(c) for the two classes is computed via Eq. 6.12 as NcNdoc
:
6 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION
P̂(“fantastic”|positive) =count(“fantastic”,positive)P
w2V count(w,positive)= 0 (6.14)
But since naive Bayes naively multiplies all the feature likelihoods together, zeroprobabilities in the likelihood term for any class will cause the probability of theclass to be zero, no matter the other evidence!
The simplest solution is the add-one (Laplace) smoothing introduced in Chap-ter 4. While Laplace smoothing is usually replaced by more sophisticated smoothingalgorithms in language modeling, it is commonly used in naive Bayes text catego-rization:
P̂(wi|c) =count(wi,c)+1P
w2V (count(w,c)+1)=
count(wi,c)+1�Pw2V count(w,c)
�+ |V |
(6.15)
Note once again that it is a crucial that the vocabulary V consists of the unionof all the word types in all classes, not just the words in one class c (try to convinceyourself why this must be true; see the exercise at the end of the chapter).
What do we do about words that occur in our test data but are not in our vocab-ulary at all because they did not occur in any training document in any class? Thestandard solution for such unknown words is to ignore such words—remove themfrom the test document and not include any probability for them at all.
Finally, some systems choose to completely ignore another class of words: stop
words, very frequent words like the and a. This can be done by sorting the vocabu-stop words
lary by frequency in the training set, and defining the top 10–100 vocabulary entriesas stop words, or alternatively by using one of the many pre-defined stop word listavailable online. Then every instance of these stop words are simply removed fromboth training and test documents as if they had never occurred. In most text classi-fication applications, however, using a stop word list doesn’t improve performance,and so it is more common to make use of the entire vocabulary and not use a stopword list.
Fig. 6.2 shows the final algorithm.
6.3 Worked example
Let’s walk through an example of training and testing naive Bayes with add-onesmoothing. We’ll use a sentiment analysis domain with the two classes positive(+) and negative (-), and take the following miniature training and test documentssimplified from actual movie reviews.
Cat Documents
Training - just plain boring- entirely predictable and lacks energy- no surprises and very few laughs+ very powerful+ the most fun film of the summer
Test ? predictable with no fun
The prior P(c) for the two classes is computed via Eq. 6.12 as NcNdoc
:
Dan Jurafsky
Binarized (Boolean feature) Multinomial Naive Bayes
• Intuition:• For sentiment (and probably for other text classification domains)
• Word occurrence may matter more than word frequency• The occurrence of the word fantastic tells us a lot• The fact that it occurs 5 times may not tell us much more.
• "Binary Naive Bayes"• Clips all the word counts in each document at 1
27
Dan Jurafsky
Boolean Multinomial Naive Bayes: Learning
• Calculate P(cj) terms• For each cj in C do
docsj¬ all docs with class =cj
P(cj )←| docsj |
| total # documents| P(wk | cj )←nk +α
n+α |Vocabulary |
• Textj¬ single doc containing all docsj• For each word wk in Vocabulary
nk¬ # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
• Calculate P(wk | cj) terms• Remove duplicates in each doc:
• For each word type w in docj• Retain only a single instance of w
Dan Jurafsky
Boolean Multinomial Naive Bayes(Binary NB) on a test document d
29
• First remove all duplicate words from d• Then compute NB using the same equation:
cNB = argmaxc j∈C
P(cj ) P(wi | cj )i∈positions∏
Dan Jurafsky
Normal vs. Binary NB
30
8 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION
P(+)P(S|+) =25⇥ 1⇥1⇥2
293 = 3.2⇥10�5
The model thus predicts the class negative for the test sentence.
6.4 Optimizing for Sentiment Analysis
While standard naive Bayes text classification can work well for sentiment analysis,some small changes are generally employed that improve performance.
First, for sentiment classification and a number of other text classification tasks,whether a word occurs or not seems to matter more than its frequency. Thus it oftenimproves performance to clip the word counts in each document at 1. This variantis called binary multinominal naive Bayes or binary NB. The variant uses thebinary NB
same Eq. 6.10 except that for each document we remove all duplicate words beforeconcatenating them into the single big document. Fig. 6.3 shows an example inwhich a set of four documents (shortened and text-normalized for this example) areremapped to binary, with the modified counts shown in the table on the right. Theexample is worked without add-1 smoothing to make the differences clearer. Notethat the results counts need not be 1; the word great has a count of 2 even for BinaryNB, because it appears in multiple documents.
Four original documents:
� it was pathetic the worst part was theboxing scenes
� no plot twists or great scenes+ and satire and great plot twists+ great scenes great film
After per-document binarization:
� it was pathetic the worst part boxingscenes
� no plot twists or great scenes+ and satire great plot twists+ great scenes film
NB BinaryCounts Counts+ � + �
and 2 0 1 0boxing 0 1 0 1film 1 0 1 0great 3 1 2 1it 0 1 0 1no 0 1 0 1or 0 1 0 1part 0 1 0 1pathetic 0 1 0 1plot 1 1 1 1satire 1 0 1 0scenes 1 2 1 2the 0 2 0 1twists 1 1 1 1was 0 2 0 1worst 0 1 0 1
Figure 6.3 An example of binarization for the binary naive Bayes algorithm.
A second important addition commonly made when doing text classification forsentiment is to deal with negation. Consider the difference between I really like thismovie (positive) and I didn’t like this movie (negative). The negation expressed bydidn’t completely alters the inferences we draw from the predicate like. Similarly,negation can modify a negative word to produce a positive review (don’t dismiss thisfilm, doesn’t let us get bored).
A very simple baseline that is commonly used in sentiment to deal with negationis during text normalization to prepend the prefix NOT to every word after a tokenof logical negation (n’t, not, no, never) until the next punctuation mark. Thus thephrase
8 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION
P(+)P(S|+) =25⇥ 1⇥1⇥2
293 = 3.2⇥10�5
The model thus predicts the class negative for the test sentence.
6.4 Optimizing for Sentiment Analysis
While standard naive Bayes text classification can work well for sentiment analysis,some small changes are generally employed that improve performance.
First, for sentiment classification and a number of other text classification tasks,whether a word occurs or not seems to matter more than its frequency. Thus it oftenimproves performance to clip the word counts in each document at 1. This variantis called binary multinominal naive Bayes or binary NB. The variant uses thebinary NB
same Eq. 6.10 except that for each document we remove all duplicate words beforeconcatenating them into the single big document. Fig. 6.3 shows an example inwhich a set of four documents (shortened and text-normalized for this example) areremapped to binary, with the modified counts shown in the table on the right. Theexample is worked without add-1 smoothing to make the differences clearer. Notethat the results counts need not be 1; the word great has a count of 2 even for BinaryNB, because it appears in multiple documents.
Four original documents:
� it was pathetic the worst part was theboxing scenes
� no plot twists or great scenes+ and satire and great plot twists+ great scenes great film
After per-document binarization:
� it was pathetic the worst part boxingscenes
� no plot twists or great scenes+ and satire great plot twists+ great scenes film
NB BinaryCounts Counts+ � + �
and 2 0 1 0boxing 0 1 0 1film 1 0 1 0great 3 1 2 1it 0 1 0 1no 0 1 0 1or 0 1 0 1part 0 1 0 1pathetic 0 1 0 1plot 1 1 1 1satire 1 0 1 0scenes 1 2 1 2the 0 2 0 1twists 1 1 1 1was 0 2 0 1worst 0 1 0 1
Figure 6.3 An example of binarization for the binary naive Bayes algorithm.
A second important addition commonly made when doing text classification forsentiment is to deal with negation. Consider the difference between I really like thismovie (positive) and I didn’t like this movie (negative). The negation expressed bydidn’t completely alters the inferences we draw from the predicate like. Similarly,negation can modify a negative word to produce a positive review (don’t dismiss thisfilm, doesn’t let us get bored).
A very simple baseline that is commonly used in sentiment to deal with negationis during text normalization to prepend the prefix NOT to every word after a tokenof logical negation (n’t, not, no, never) until the next punctuation mark. Thus thephrase
8 CHAPTER 6 • NAIVE BAYES AND SENTIMENT CLASSIFICATION
P(+)P(S|+) =25⇥ 1⇥1⇥2
293 = 3.2⇥10�5
The model thus predicts the class negative for the test sentence.
6.4 Optimizing for Sentiment Analysis
While standard naive Bayes text classification can work well for sentiment analysis,some small changes are generally employed that improve performance.
First, for sentiment classification and a number of other text classification tasks,whether a word occurs or not seems to matter more than its frequency. Thus it oftenimproves performance to clip the word counts in each document at 1. This variantis called binary multinominal naive Bayes or binary NB. The variant uses thebinary NB
same Eq. 6.10 except that for each document we remove all duplicate words beforeconcatenating them into the single big document. Fig. 6.3 shows an example inwhich a set of four documents (shortened and text-normalized for this example) areremapped to binary, with the modified counts shown in the table on the right. Theexample is worked without add-1 smoothing to make the differences clearer. Notethat the results counts need not be 1; the word great has a count of 2 even for BinaryNB, because it appears in multiple documents.
Four original documents:
� it was pathetic the worst part was theboxing scenes
� no plot twists or great scenes+ and satire and great plot twists+ great scenes great film
After per-document binarization:
� it was pathetic the worst part boxingscenes
� no plot twists or great scenes+ and satire great plot twists+ great scenes film
NB BinaryCounts Counts+ � + �
and 2 0 1 0boxing 0 1 0 1film 1 0 1 0great 3 1 2 1it 0 1 0 1no 0 1 0 1or 0 1 0 1part 0 1 0 1pathetic 0 1 0 1plot 1 1 1 1satire 1 0 1 0scenes 1 2 1 2the 0 2 0 1twists 1 1 1 1was 0 2 0 1worst 0 1 0 1
Figure 6.3 An example of binarization for the binary naive Bayes algorithm.
A second important addition commonly made when doing text classification forsentiment is to deal with negation. Consider the difference between I really like thismovie (positive) and I didn’t like this movie (negative). The negation expressed bydidn’t completely alters the inferences we draw from the predicate like. Similarly,negation can modify a negative word to produce a positive review (don’t dismiss thisfilm, doesn’t let us get bored).
A very simple baseline that is commonly used in sentiment to deal with negationis during text normalization to prepend the prefix NOT to every word after a tokenof logical negation (n’t, not, no, never) until the next punctuation mark. Thus thephrase
Dan Jurafsky
Binary NB
• Binary works better than full word counts for sentiment classification
31
B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-‐2002, 79—86.Wang, Sida, and Christopher D. Manning. 2012. "Baselines and bigrams: Simple, good sentiment and topic classification." Proceedings of ACL, 90-‐94.
Dan Jurafsky
Cross-‐Validation
• Break up data into 10 folds• (Equal positive and negative inside each fold?)
• For each fold• Choose the fold as a temporary test set
• Train on 9 folds, compute performance on the test fold
• Report average performance of the 10 runs
TrainingTest
Test
Test
Test
Test
Training
Training Training
Training
Training
Iteration
1
2
3
4
5
Dan Jurafsky
Other issues in Classification
• Logistic Regression and SVM tend to do better than Naïve Bayes
33
Dan Jurafsky Problems: What makes reviews hard to classify?
• Subtlety:• Perfume review in Perfumes: the Guide:• “If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”
• Dorothy Parker on Katherine Hepburn• “She runs the gamut of emotions from A to B”
34
Dan Jurafsky
Thwarted Expectationsand Ordering Effects
• “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”
• Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.
35
Sentiment Analysis
A Baseline Algorithm
Sentiment Analysis
Sentiment Lexicons
Dan Jurafsky
The General Inquirer
• Home page: http://www.wjh.harvard.edu/~inquirer• List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
• Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls• Categories:
• Positiv (1915 words) and Negativ (2291 words)• Strong vs Weak, Active vs Passive, Overstated versus Understated• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
• Free for Research Use
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press
Dan Jurafsky
LIWC (Linguistic Inquiry and Word Count)Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007. Austin, TX
• Home page: http://www.liwc.net/• 2300 words, >70 classes• Affective Processes
• negative emotion (bad, weird, hate, problem, tough)• positive emotion (love, nice, sweet)
• Cognitive Processes• Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
• Pronouns, Negation (no, never), Quantifiers (few, many) • $30 or $90 fee
Dan Jurafsky
MPQA Subjectivity Cues Lexicon
• Home page: http://www.cs.pitt.edu/mpqa/subj_lexicon.html• 6885 words from 8221 lemmas
• 2718 positive• 4912 negative
• Each word annotated for intensity (strong, weak)• GNU GPL40
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
Dan Jurafsky
Bing Liu Opinion Lexicon
• Bing Liu's Page on Opinion Mining• http://www.cs.uic.edu/~liub/FBS/opinion-‐lexicon-‐English.rar
• 6786 words• 2006 positive• 4783 negative
41
Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-‐2004.
Dan Jurafsky
Analyzing the polarity of each word in IMDB
• How likely is each word to appear in each sentiment class?• Count(“bad”) in 1-‐star, 2-‐star, 3-‐star, etc.• But can’t use raw counts: • Instead, likelihood:
• Make them comparable between words• Scaled likelihood:
Potts, Christopher. 2011. On the negativity of negation. SALT 20, 636-‐659.
P(w | c) = f (w,c)f (w,c)
w∈c∑
P(w | c)P(w)
Dan Jurafsky
Analyzing the polarity of each word in IMDBPotts, Christopher. 2011. On the negativity of negation. SALT 20, 636-‐659.
Overview Data Methods Categorization Scale induction Looking ahead
Example: attenuators
IMDB – 53,775 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.050.09
0.15
Cat = 0.33 (p = 0.004)Cat^2 = -4.02 (p < 0.001)
OpenTable – 3,890 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.38
Cat = 0.11 (p = 0.707)Cat^2 = -6.2 (p = 0.014)
Goodreads – 3,424 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.19
0.36
Cat = -0.55 (p = 0.128)Cat^2 = -5.04 (p = 0.016)
Amazon/Tripadvisor – 2,060 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.12
0.28
Cat = 0.42 (p = 0.207)Cat^2 = -2.74 (p = 0.05)
somewhat/r
IMDB – 33,515 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.04
0.09
0.17
Cat = -0.13 (p = 0.284)Cat^2 = -5.37 (p < 0.001)
OpenTable – 2,829 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.31
Cat = 0.2 (p = 0.265)Cat^2 = -4.16 (p = 0.007)
Goodreads – 1,806 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.05
0.12
0.18
0.35
Cat = -0.87 (p = 0.016)Cat^2 = -5.74 (p = 0.004)
Amazon/Tripadvisor – 2,158 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.11
0.29
Cat = 0.54 (p = 0.183)Cat^2 = -3.32 (p = 0.045)
fairly/r
IMDB – 176,264 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.050.090.13
Cat = -0.43 (p < 0.001)Cat^2 = -3.6 (p < 0.001)
OpenTable – 8,982 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.140.19
0.32
Cat = -0.64 (p = 0.035)Cat^2 = -4.47 (p = 0.007)
Goodreads – 11,895 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.07
0.15
0.34
Cat = -0.71 (p = 0.072)Cat^2 = -4.59 (p = 0.018)
Amazon/Tripadvisor – 5,980 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.15
0.28
Cat = 0.26 (p = 0.496)Cat^2 = -2.23 (p = 0.131)
pretty/r
“Potts&diagrams” Potts,&Christopher.& 2011.&NSF&workshop&on&restructuring&adjectives.
good
great
excellent
disappointing
bad
terrible
totally
absolutely
utterly
somewhat
fairly
pretty
Positive scalars Negative scalars Emphatics Attenuators
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
Overview Data Methods Categorization Scale induction Looking ahead
Example: attenuators
IMDB – 53,775 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.050.09
0.15
Cat = 0.33 (p = 0.004)Cat^2 = -4.02 (p < 0.001)
OpenTable – 3,890 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.38
Cat = 0.11 (p = 0.707)Cat^2 = -6.2 (p = 0.014)
Goodreads – 3,424 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.19
0.36
Cat = -0.55 (p = 0.128)Cat^2 = -5.04 (p = 0.016)
Amazon/Tripadvisor – 2,060 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.12
0.28
Cat = 0.42 (p = 0.207)Cat^2 = -2.74 (p = 0.05)
somewhat/r
IMDB – 33,515 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.04
0.09
0.17
Cat = -0.13 (p = 0.284)Cat^2 = -5.37 (p < 0.001)
OpenTable – 2,829 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.31
Cat = 0.2 (p = 0.265)Cat^2 = -4.16 (p = 0.007)
Goodreads – 1,806 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.05
0.12
0.18
0.35
Cat = -0.87 (p = 0.016)Cat^2 = -5.74 (p = 0.004)
Amazon/Tripadvisor – 2,158 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.11
0.29
Cat = 0.54 (p = 0.183)Cat^2 = -3.32 (p = 0.045)
fairly/r
IMDB – 176,264 tokens
Category
-0.50
-0.39
-0.28
-0.17
-0.06
0.06
0.17
0.28
0.39
0.50
0.050.090.13
Cat = -0.43 (p < 0.001)Cat^2 = -3.6 (p < 0.001)
OpenTable – 8,982 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.08
0.140.19
0.32
Cat = -0.64 (p = 0.035)Cat^2 = -4.47 (p = 0.007)
Goodreads – 11,895 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.07
0.15
0.34
Cat = -0.71 (p = 0.072)Cat^2 = -4.59 (p = 0.018)
Amazon/Tripadvisor – 5,980 tokens
Category
-0.50
-0.25
0.00
0.25
0.50
0.15
0.28
Cat = 0.26 (p = 0.496)Cat^2 = -2.23 (p = 0.131)
pretty/r
“Potts&diagrams” Potts,&Christopher.& 2011.&NSF&workshop&on&restructuring&adjectives.
good
great
excellent
disappointing
bad
terrible
totally
absolutely
utterly
somewhat
fairly
pretty
Positive scalars Negative scalars Emphatics Attenuators
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
1 2 3 4 5 6 7 8 9 10rating
Dan Jurafsky
Other sentiment feature: Logical negation
• Is logical negation (no, not) associated with negative sentiment?
• Potts experiment:• Count negation (not, n’t, no, never) in online reviews• Regress against the review rating
Potts, Christopher. 2011. On the negativity of negation. SALT 20, 636-‐659.
Dan Jurafsky Potts 2011 Results:More negation in negative sentiment
a
Scaled
likelihoo
dP(w|c)/P(w)
Sentiment Analysis
Sentiment Lexicons
Sentiment Analysis
Learning Sentiment Lexicons
Dan Jurafsky
Semi-‐supervised learning of lexicons
• What to do for domains where you don't have a lexicon?
• Learn a lexicon! • Use a small amount of information
• A few labeled examples• A few hand-‐built patterns
• To bootstrap a lexicon48
Dan Jurafsky
Semi-‐supervised learning of lexicons
18.2 • SEMI-SUPERVISED INDUCTION OF SENTIMENT LEXICONS 3
The General Inquirer is a freely available web resource with lexicons of 1915 posi-tive words and 2291 negative words (and also includes other lexicons we’ll discussin the next section).
The MPQA Subjectivity lexicon (Wilson et al., 2005) has 2718 positive and4912 negative words drawn from a combination of sources, including the GeneralInquirer lists, the output of the Hatzivassiloglou and McKeown (1997) system de-scribed below, and a bootstrapped list of subjective words and phrases (Riloff andWiebe, 2003) that was then hand-labeled for sentiment. Each phrase in the lexiconis also labeled for reliability (strongly subjective or weakly subjective). The po-larity lexicon of (Hu and Liu, 2004) gives 2006 positive and 4783 negative words,drawn from product reviews, labeled using a bootstrapping method from WordNetdescribed in the next section.
Positive admire, amazing, assure, celebration, charm, eager, enthusiastic, excel-lent, fancy, fantastic, frolic, graceful, happy, joy, luck, majesty, mercy,nice, patience, perfect, proud, rejoice, relief, respect, satisfactorily, sen-sational, super, terrific, thank, vivid, wise, wonderful, zest
Negative abominable, anger, anxious, bad, catastrophe, cheap, complaint, conde-scending, deceit, defective, disappointment, embarrass, fake, fear, filthy,fool, guilt, hate, idiot, inflict, lazy, miserable, mourn, nervous, objection,pest, plot, reject, scream, silly, terrible, unfriendly, vile, wicked
Figure 18.2 Some samples of words with consistent sentiment across three sentiment lexi-cons: the General Inquirer (Stone et al., 1966), the MPQA Subjectivity lexicon (Wilson et al.,2005), and the polarity lexicon of Hu and Liu (2004).
18.2 Semi-supervised induction of sentiment lexicons
Some affective lexicons are built by having humans assign ratings to words; thiswas the technique for building the General Inquirer starting in the 1960s (Stoneet al., 1966), and for modern lexicons based on crowd-sourcing to be described inSection 18.5.1. But one of the most powerful ways to learn lexicons is to use semi-supervised learning.
In this section we introduce three methods for semi-supervised learning that areimportant in sentiment lexicon extraction. The three methods all share the sameintuitive algorithm which is sketched in Fig. 18.3.
function BUILDSENTIMENTLEXICON(posseeds,negseeds) returns poslex,neglex
poslex posseedsneglex negseedsUntil done
poslex poslex + FINDSIMILARWORDS(poslex)neglex neglex + FINDSIMILARWORDS(neglex)
poslex,neglex POSTPROCESS(poslex,neglex)
Figure 18.3 Schematic for semi-supervised sentiment lexicon induction. Different algo-rithms differ in the how words of similar polarity are found, in the stopping criterion, and inthe post-processing.49
Dan Jurafsky
Hatzivassiloglou and McKeown intuition for identifying word polarity
• Adjectives conjoined by “and” have same polarity• Fair and legitimate, corrupt and brutal• *fair and brutal, *corrupt and legitimate
• Adjectives conjoined by “but” do not• fair but brutal
50
Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the Semantic Orientation of Adjectives. ACL, 174–181
Dan Jurafsky
Hatzivassiloglou & McKeown 1997Step 1
• Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)
• 657 positive• adequate central clever famous intelligent remarkable reputed sensitive slender thriving…
• 679 negative• contagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting…
51
Dan Jurafsky
Hatzivassiloglou & McKeown 1997Step 2
• Expand seed set to conjoined adjectives
52
nice, helpful
nice, classy
Dan Jurafsky
Hatzivassiloglou & McKeown 1997Step 3
• Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph:
53
classy
nice
helpful
fair
brutal
irrationalcorrupt
Dan Jurafsky
Hatzivassiloglou & McKeown 1997Step 4
• Clustering for partitioning the graph into two
54
classy
nice
helpful
fair
brutal
irrationalcorrupt
+ -‐
Dan Jurafsky
Output polarity lexicon
• Positive• bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…
• Negative• ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…
55
Dan Jurafsky
Output polarity lexicon
• Positive• bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…
• Negative• ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…
56
Dan Jurafsky
Turney Algorithm
1. Extract a phrasal lexicon from reviews2. Learn polarity of each phrase3. Rate a review by the average polarity of its phrases
57
Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews
Dan Jurafsky
Extract two-‐word phrases with adjectives
First Word Second Word Third Word (not extracted)
JJ NN or NNS anythingRB, RBR, RBS JJ Not NN nor NNSJJ JJ Not NN or NNSNN or NNS JJ Nor NN nor NNSRB, RBR, or RBS VB, VBD, VBN, VBG anything58
Dan Jurafsky
How to measure polarity of a phrase?
• Positive phrases co-‐occur more with “excellent”• Negative phrases co-‐occur more with “poor”• But how to measure co-‐occurrence?
59
Dan Jurafsky
Pointwise Mutual Information
• Mutual information between 2 random variables X and Y
• Pointwise mutual information: • How much more do events x and y co-‐occur than if they were independent?
I(X,Y ) = P(x, y)y∑
x∑ log2
P(x,y)P(x)P(y)
PMI(X,Y ) = log2P(x,y)P(x)P(y)
Dan Jurafsky
Pointwise Mutual Information
• Pointwise mutual information: • How much more do events x and y co-‐occur than if they were independent?
• PMI between two words:• How much more do two words co-‐occur than if they were independent?
PMI(word1,word2 ) = log2P(word1,word2)P(word1)P(word2)
PMI(X,Y ) = log2P(x,y)P(x)P(y)
Dan Jurafsky
How to Estimate Pointwise Mutual Information
• Query search engine • P(word) estimated by hits(word)/N• P(word1,word2) by hits(word1 NEAR word2)/N• (More correctly the bigram denominator should be kN, because there are a total of N consecutive bigrams (word1,word2), but kN bigrams that are k words apart, but we just use N on the rest of this slide and the next.)
PMI(word1,word2 ) = log2
1Nhits(word1 NEAR word2)
1Nhits(word1) 1
Nhits(word2)
Dan Jurafsky
Does phrase appear more with “poor” or “excellent”?
63
Polarity(phrase) = PMI(phrase,"excellent")−PMI(phrase,"poor")
= log2hits(phrase NEAR "excellent")hits("poor")hits(phrase NEAR "poor")hits("excellent")!
"#
$
%&
= log2hits(phrase NEAR "excellent")
hits(phrase)hits("excellent")hits(phrase)hits("poor")
hits(phrase NEAR "poor")
= log2
1N hits(phrase NEAR "excellent")1N hits(phrase) 1
N hits("excellent")− log2
1N hits(phrase NEAR "poor")1N hits(phrase) 1
N hits("poor")
Dan Jurafsky
Phrases from a thumbs-‐up review
64
Phrase POS tags Polarityonline service JJ NN 2.8
online experience JJ NN 2.3
direct deposit JJ NN 1.3
local branch JJ NN 0.42…
low fees JJ NNS 0.33
true service JJ NN -0.73
other bank JJ NN -0.85
inconveniently located JJ NN -1.5
Average 0.32
Dan Jurafsky
Phrases from a thumbs-‐down review
65
Phrase POS tags Polaritydirect deposits JJ NNS 5.8
online web JJ NN 1.9
very handy RB JJ 1.4…
virtual monopoly JJ NN -2.0
lesser evil RBR JJ -2.3
other problems JJ NNS -2.8
low funds JJ NNS -6.8
unethical practices JJ NNS -8.5
Average -1.2
Dan Jurafsky
Results of Turney algorithm
• 410 reviews from Epinions• 170 (41%) negative• 240 (59%) positive
• Majority class baseline: 59%• Turney algorithm: 74%
• Phrases rather than words• Learns domain-‐specific information66
Dan Jurafsky
Summary on Learning Lexicons
• Why:• Learn a lexicon that is specific to a domain• Learn a lexicon with more words (more robust) than off-‐the-‐shelf
• Intuition• Start with a seed set of words (‘good’, ‘poor’)• Find other words that have similar polarity:• Using “and” and “but”• Using words that occur nearby in the same document• Add them to lexicon
Sentiment Analysis
Learning Sentiment Lexicons
Sentiment Analysis
Other Sentiment Tasks
Dan Jurafsky
Finding sentiment of a sentence
• Important for finding aspects or attributes• Target of sentiment
• The food was great but the service was awful
70
Dan Jurafsky
Finding aspect/attribute/target of sentiment
• Frequent phrases + rules• Find all highly frequent phrases across reviews (“fish tacos”)• Filter by rules like “occurs right after sentiment word”• “…great fish tacos” means fish tacos a likely aspect
Casino casino, buffet, pool, resort, bedsChildren’s Barber haircut, job, experience, kidsGreek Restaurant food, wine, service, appetizer, lambDepartment Store selection, department, sales, shop, clothing
M. Hu and B. Liu. 2004. Mining and summarizing customer reviews. In Proceedings of KDD.S. Blair-‐Goldensohn, K. Hannan, R. McDonald, T. Neylon, G. Reis, and J. Reynar. 2008. Building a Sentiment Summarizer for Local Service Reviews. WWW Workshop.
Dan Jurafsky
Finding aspect/attribute/target of sentiment
• The aspect name may not be in the sentence• For restaurants/hotels, aspects are well-‐understood• Supervised classification
• Hand-‐label a small corpus of restaurant review sentences with aspect• food, décor, service, value, NONE
• Train a classifier to assign an aspect to asentence• “Given this sentence, is the aspect food, décor, service, value, or NONE”
72
Dan Jurafsky
Putting it all together:Finding sentiment for aspects
73
ReviewsFinalSummary
Sentences& Phrases
Sentences& Phrases
Sentences& Phrases
TextExtractor
SentimentClassifier
AspectExtractor
Aggregator
S. Blair-‐Goldensohn, K. Hannan, R. McDonald, T. Neylon, G. Reis, and J. Reynar. 2008. Building a Sentiment Summarizer for Local Service Reviews. WWW Workshop
Dan Jurafsky
Results of Blair-‐Goldensohn et al. methodRooms (3/5 stars, 41 comments)
(+) The room was clean and everything worked fine – even the water pressure ...
(+) We went because of the free room and was pleasantly pleased ...
(-‐) …the worst hotel I had ever stayed at ...Service (3/5 stars, 31 comments)
(+) Upon checking out another couple was checking early due to a problem ...
(+) Every single hotel staff member treated us great and answered every ...
(-‐) The food is cold and the service gives new meaning to SLOW.
Dining (3/5 stars, 18 comments)(+) our favorite place to stay in biloxi.the food is great also the service ...(+) Offer of free buffet for joining the Play
Dan Jurafsky
Summary on Sentiment
• Generally modeled as classification or regression task• predict a binary or ordinal label
• Features:• Negation is important• Using all words (in naive bayes) works well for some tasks• Finding subsets of words may help in other tasks• Hand-‐built polarity lexicons• Use seeds and semi-‐supervised learning to induce lexicons