A Survey of Sentiment Analysis

26
A Survey of Sentiment Analysis Blockseminar “Intelligente Softwaresysteme” 2013/14 TU Berlin 7 Feb 2014 Moritz Platt

description

Sentiment Analysis refers to a set of natural language processing technologies used to extract subjective information from a body of text. While sentiment analysis offers significant insight into the public opinion, implementations still exhibit great potential for development, thus making it a nascent field of research. This survey provides a brief overview of the technologies commonly used to approach problems in sentiment analysis, taking particular challenges imposed by user-generated content in “social-media” into account. This survey will seek to demonstrate which technologies are promising in the field in general and in the realm of user generated content in particular.

Transcript of A Survey of Sentiment Analysis

Page 1: A Survey of Sentiment Analysis

A Survey of Sentiment AnalysisBlockseminar “Intelligente Softwaresysteme” 2013/14 TU Berlin7 Feb 2014 Moritz Platt

Page 2: A Survey of Sentiment Analysis

Agenda

Introduction

Algorithms

Benchmarks

Outlook

Intelligente Softwaresysteme 2013/14 2

Page 3: A Survey of Sentiment Analysis

Sentiment Analysis is an NLP Task

• Sentiment Analysis = Opinion Mining = Subjectivity Analysis

•Extract opinions on objects from text

•Working on natural language corpora•Research problem with a lot of applications•Relatively new research area, rapidly developing field•Related fields:

•Natural Language Processing• Social Media Analysis• Text Mining•Data Mining

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 3

Page 4: A Survey of Sentiment Analysis

Accessing Opinions–Now and Then

Dot-Com Era and Beyond•Huge stream of opinionated text

•1.2 million daily blog posts [Zabin2008]

•45 million daily “status up-dates” on Facebook [Thomas2010]

•Often featuring opinions towards products or persons

Pre Dot-Com Era• Extensive measures

• Surveys•Opinion polls• Focus groups

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 4

Page 5: A Survey of Sentiment Analysis

Where are today’sopionated texts coming from?

Social Networks BlogsReviews

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 5

Page 6: A Survey of Sentiment Analysis

The Relationship Between Opinion Holders and Objects

• Edges between opinion holders and features represent opinions• The time aspect is usually ommited

John

f The voice quality of a particularmodel of a cellular phone

Jack

James

Opinion Holders

“Voice quality is wonderful.”

“Voice sounds terrible.”

“Speech quality is average.”

Features

o A particular modelof a cellular phone

ObjectsOpinionated Text Sentiment Value

f

f

f

f

PositiveNegative

Neutral

•Consider a set of product reviews for a particular model of a cellular phone

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 6

Page 7: A Survey of Sentiment Analysis

The Aspects of Opinions

StructureofanopinionasdefinedbyLiu[Liu2010]:

(oj, fjk, soijkl, hi, tl)•Object oj

The target of an opinion (e.g. product, person, event, organisation, topic)• Feature fjk

Components/Attributes of an object (e.g. battery life, camera resolution)• Sentiment Value soijkl

The orientantion of an opinion from a set of possible choices (e.g. positive, negative, neutral)

•Opinion Holder hi The person expressing the opinion

• Time tl The time at which the opinion is expressed

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 7

Page 8: A Survey of Sentiment Analysis

Algorithms

Intelligente Softwaresysteme 2013/14 8

Page 9: A Survey of Sentiment Analysis

Approaching Sentiments Algorithmically

Unsupervised Methods

•No training data•Cross-domain applications

Supervised Methods

•Manually labelled training data•Usually superior to unsupervised

approaches

•Point-Wise Mutual Information

•Naïve Bayes Classification•Maximum Entropy Classification• Suppor Vector Machines

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 9

Page 10: A Survey of Sentiment Analysis

PMI-IR

•PMI: Point-wise mutual information• IR: Information retrieval

• Introduced 2002 as an unsupervised learning algorithm for classifying re-views [Turney2002]

•Based on the concept of PMI [Church1990]

•Measures the probability of the co-occurrence of words

PMI (word 1 ,w ord 2 )= log2p(word 1&word 2 )p(word 1 )p(word 2 )

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 10

Page 11: A Survey of Sentiment Analysis

PMI-IR

• Turney used the words poor and excellent as seeds for the algorithm

• SO is the sentiment orientation value•Positive SO-value for phrases more associated with excellent•Negative SO-value for phrases more associated with poor

• Improvement of results through IR component• Turney used AltaVista• uses the NEAR operator• h(query) is the number of hits returned given the query

SO(phrase )= PMI(phrase, “ excellent ”) PMI(phrase, “ poor”)

SO(phrase )= log2h(phrase NEAR“ excellent ”)h(“ poor”)h(phrase NEAR“ poor”)h(“ excellent ”)

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 11

Page 12: A Survey of Sentiment Analysis

NaïveBayesClassification

•Based on Bayes rule [Bayes1763]

• Simply trained, probalistic, effective• “Bag of words” of an input document d• Fixed set of classes C, e.g. C = {positive, negative}

• d can be reduced by omitting irrelevant words

All Words [Jurafsky2013]

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.!

Opinionated Words [Jurafsky2013]

x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxx xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxx xxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 12

Page 13: A Survey of Sentiment Analysis

Naïve Bayes at work

1. Estimate P(c) of each class c by dividing the number of words in documents in c by the total number of words in the corpus2. Estimate the P(w|c) for all words w and classes c 3. The score for a document d to be in class c is

4. The most likely class for a document is the one with the highest score[Potts2011]

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 13

Page 14: A Survey of Sentiment Analysis

MaximumEntropyClassification

Ignoranceispreferabletoerror,andheislessremotefromthetruthwhobe-lievesnothingthanhewhobelieveswhatiswrong. — Thomas Jefferson

• Find weights for the features that maximize the likelihood of the training data

•Add constraints based on training data•More constraints = less entropy = distribution is closer to data

•More difficult to implement than Naïve Bayes[Potts2011]

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 14

Page 15: A Survey of Sentiment Analysis

Support Vector Machines

•Most intuitive for two-class, separable training data sets

• Find a vector to seperate data sets maximizing the margin (A vs B)

• The margin is limited by sup-port vectors

•Applicable to more complicated problems too• n-class space• inseperable training data

through transformation in higher dimensions

y

x

A

B

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 15

Page 16: A Survey of Sentiment Analysis

Benchmarks

Intelligente Softwaresysteme 2013/14 16

Page 17: A Survey of Sentiment Analysis

Benchmarking Sentiment Analysis

•Benchmarking NB and ME with in-domain testing

[Potts2011]

•Binary classification•6.000 restaurant re-

views

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 17

Page 18: A Survey of Sentiment Analysis

Benchmarking Sentiment Analysis

•Benchmarking NB and ME with testing on a different domain

[Potts2011]

Introduction > Algorithms > Benchmarks > Outlook

•Binary classification• Trained on 6.000 res-

taurant reviews• Tested on 6.000 prod-

uct reviews

Intelligente Softwaresysteme 2013/14 18

Page 19: A Survey of Sentiment Analysis

Outlook

Intelligente Softwaresysteme 2013/14 19

Page 20: A Survey of Sentiment Analysis

Opinionated Data in the Wild

•Works well under laboratory conditions•Proper spelling•Highly opinionated•Pre-defined object

• Still common NLP problems remain•Named entity recognition•Context specific meaning• Language Ambiguity

•Benchmarking corpora do not reflect real-world data quality

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 20

Page 21: A Survey of Sentiment Analysis

Opinionated Data in the Wild

• Social media data•Highly relevant•Huge corpus•Constantly growing

• Very noisy•Questionable text quality

• Spelling•Grammar

• Spam•Unclear context• Figurative speech• Slang• Irony

Warren Scott M.your mxf format is a joke. DO NOT BUY CANON

Like

Comment 21 January at 18:31

Leon H.Why battery 6L in my Canon sx280 have pretty low life

Like

Comment 11 January at 10:58

Phil D.Youse guys did a solid on my wife's TI3- warranty expired lastmonth, but did the job good! Thanks CanonLike

Comment 11 January at 04:39

Cole J.Got a canon gl1I love it, but a little fuzzyLike

Comment 28 January 2010

Authentic status updates from https://www.facebook.com/pages/Canon-Cameras

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 21

Page 22: A Survey of Sentiment Analysis

Conclusions / Future Work

•Development of algorithms is on the right track• Evolvement beyond binary classification•Algorithms will become more robust on less homogenous sources

• Industry aims to apply algorithms to noisy data

Introduction > Algorithms > Benchmarks > Outlook

Intelligente Softwaresysteme 2013/14 22

Page 23: A Survey of Sentiment Analysis

Appendix

Intelligente Softwaresysteme 2013/14 23

Page 24: A Survey of Sentiment Analysis

References

article(Bayes1763)Bayes, T.An essay towards solving a problem in the doctrine of chancesPhil. Trans. of the Royal Soc. of London, 1763, Vol. 53, pp. 370-418

article(Church1990)Church, K.W. & Hanks, P.Word Association Norms, Mutual Information, and LexicographyComput. Linguist., MIT Press, 1990, Vol. 16(1), pp. 22-29

misc(Jurafsky2013)Dan Jurafsky, E.NaïveBayesandTextClassification2013

inproceedings(Liu2010)Liu, B.Sentiment analysis and subjectivityHandbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boca2010

misc(Potts2011)Potts, C.SentimentSymposiumTutorial:Classifiershttp://sentiment.christopherpotts.net/classifiers.html2011

Intelligente Softwaresysteme 2013/14 24

Page 25: A Survey of Sentiment Analysis

book(Thomas2010)Thomas, A. & Applegate, J.PayAttention!:HowtoListen,Respond,andProfitfromCustomerFeedbackWiley, 2010

inproceedings(Turney2002)Turney, P.D.Thumbsuporthumbsdown?SemanticorientationappliedtounsupervisedclassificationofreviewsProceedings 40th Annual Meeting of the ACL (2002)2002, pp. 417-424

misc(Zabin2008)Zabin, J. & Jefferies, A.Social Media Monitoring and Analysis: Generating Consumer Insights from Online ConversationAberdeen Group Benchmark Report, 2008

Intelligente Softwaresysteme 2013/14 25

Page 26: A Survey of Sentiment Analysis

Picture Credit

IconsPage 8:Arrow by Jamison Wieser from The Noun Project

PhotographyPage 1: “Thumbs up on diving down” by JamesHuckaby is licensed under a Creative Commons Attribution-NonCommercial-No-Derivs2.0GenericLicense.Basedonaworkathttp://www.flickr.com/photos/raveller/1117899371/.Toviewacopyofthis license, visit http://creativecommons.org/licenses/by-nc-nd/2.0/legalcode.

Page 3: “Coventry Solihull Warwickshire Sub-Regional Planning Study Questionnaire” by TheJRJamesArchive is licensed under aCreativeCommonsAttribution-NonCommercial2.0GenericLicense.Basedonaworkathttp://www.flickr.com/photos/jrjamesarchive/9371523446/. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.0/legal-code.

Page 14: “Svm intro.svg” by FabianBürgeris licensed under a Creative Commons Attribution 3.0 License. Based on a work at http://commons.wikimedia.org/wiki/File:Svm_intro.svg. To view a copy of this license, visit http://creativecommons.org/li-censes/by/3.0/legalcode.

Intelligente Softwaresysteme 2013/14 26