A Survey of Sentiment Analysis
-
Upload
moritz-p -
Category
Technology
-
view
119 -
download
3
description
Transcript of A Survey of Sentiment Analysis
A Survey of Sentiment AnalysisBlockseminar “Intelligente Softwaresysteme” 2013/14 TU Berlin7 Feb 2014 Moritz Platt
Agenda
Introduction
▼
Algorithms
▼
Benchmarks
▼
Outlook
Intelligente Softwaresysteme 2013/14 2
Sentiment Analysis is an NLP Task
• Sentiment Analysis = Opinion Mining = Subjectivity Analysis
•Extract opinions on objects from text
•Working on natural language corpora•Research problem with a lot of applications•Relatively new research area, rapidly developing field•Related fields:
•Natural Language Processing• Social Media Analysis• Text Mining•Data Mining
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 3
Accessing Opinions–Now and Then
Dot-Com Era and Beyond•Huge stream of opinionated text
•1.2 million daily blog posts [Zabin2008]
•45 million daily “status up-dates” on Facebook [Thomas2010]
•Often featuring opinions towards products or persons
Pre Dot-Com Era• Extensive measures
• Surveys•Opinion polls• Focus groups
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 4
Where are today’sopionated texts coming from?
Social Networks BlogsReviews
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 5
The Relationship Between Opinion Holders and Objects
• Edges between opinion holders and features represent opinions• The time aspect is usually ommited
John
f The voice quality of a particularmodel of a cellular phone
Jack
James
Opinion Holders
“Voice quality is wonderful.”
“Voice sounds terrible.”
“Speech quality is average.”
Features
o A particular modelof a cellular phone
ObjectsOpinionated Text Sentiment Value
f
f
f
f
PositiveNegative
Neutral
•Consider a set of product reviews for a particular model of a cellular phone
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 6
The Aspects of Opinions
StructureofanopinionasdefinedbyLiu[Liu2010]:
(oj, fjk, soijkl, hi, tl)•Object oj
The target of an opinion (e.g. product, person, event, organisation, topic)• Feature fjk
Components/Attributes of an object (e.g. battery life, camera resolution)• Sentiment Value soijkl
The orientantion of an opinion from a set of possible choices (e.g. positive, negative, neutral)
•Opinion Holder hi The person expressing the opinion
• Time tl The time at which the opinion is expressed
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 7
Algorithms
Intelligente Softwaresysteme 2013/14 8
Approaching Sentiments Algorithmically
Unsupervised Methods
•No training data•Cross-domain applications
Supervised Methods
•Manually labelled training data•Usually superior to unsupervised
approaches
•Point-Wise Mutual Information
•Naïve Bayes Classification•Maximum Entropy Classification• Suppor Vector Machines
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 9
PMI-IR
•PMI: Point-wise mutual information• IR: Information retrieval
• Introduced 2002 as an unsupervised learning algorithm for classifying re-views [Turney2002]
•Based on the concept of PMI [Church1990]
•Measures the probability of the co-occurrence of words
PMI (word 1 ,w ord 2 )= log2p(word 1&word 2 )p(word 1 )p(word 2 )
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 10
PMI-IR
• Turney used the words poor and excellent as seeds for the algorithm
• SO is the sentiment orientation value•Positive SO-value for phrases more associated with excellent•Negative SO-value for phrases more associated with poor
• Improvement of results through IR component• Turney used AltaVista• uses the NEAR operator• h(query) is the number of hits returned given the query
SO(phrase )= PMI(phrase, “ excellent ”) PMI(phrase, “ poor”)
SO(phrase )= log2h(phrase NEAR“ excellent ”)h(“ poor”)h(phrase NEAR“ poor”)h(“ excellent ”)
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 11
NaïveBayesClassification
•Based on Bayes rule [Bayes1763]
• Simply trained, probalistic, effective• “Bag of words” of an input document d• Fixed set of classes C, e.g. C = {positive, negative}
• d can be reduced by omitting irrelevant words
All Words [Jurafsky2013]
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.!
Opinionated Words [Jurafsky2013]
x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxx xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxx xxxxxxxxxxxxxxxxxxxxxxxx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 12
Naïve Bayes at work
1. Estimate P(c) of each class c by dividing the number of words in documents in c by the total number of words in the corpus2. Estimate the P(w|c) for all words w and classes c 3. The score for a document d to be in class c is
4. The most likely class for a document is the one with the highest score[Potts2011]
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 13
MaximumEntropyClassification
Ignoranceispreferabletoerror,andheislessremotefromthetruthwhobe-lievesnothingthanhewhobelieveswhatiswrong. — Thomas Jefferson
• Find weights for the features that maximize the likelihood of the training data
•Add constraints based on training data•More constraints = less entropy = distribution is closer to data
•More difficult to implement than Naïve Bayes[Potts2011]
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 14
Support Vector Machines
•Most intuitive for two-class, separable training data sets
• Find a vector to seperate data sets maximizing the margin (A vs B)
• The margin is limited by sup-port vectors
•Applicable to more complicated problems too• n-class space• inseperable training data
through transformation in higher dimensions
y
x
A
B
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 15
Benchmarks
Intelligente Softwaresysteme 2013/14 16
Benchmarking Sentiment Analysis
•Benchmarking NB and ME with in-domain testing
[Potts2011]
•Binary classification•6.000 restaurant re-
views
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 17
Benchmarking Sentiment Analysis
•Benchmarking NB and ME with testing on a different domain
[Potts2011]
Introduction > Algorithms > Benchmarks > Outlook
•Binary classification• Trained on 6.000 res-
taurant reviews• Tested on 6.000 prod-
uct reviews
Intelligente Softwaresysteme 2013/14 18
Outlook
Intelligente Softwaresysteme 2013/14 19
Opinionated Data in the Wild
•Works well under laboratory conditions•Proper spelling•Highly opinionated•Pre-defined object
• Still common NLP problems remain•Named entity recognition•Context specific meaning• Language Ambiguity
•Benchmarking corpora do not reflect real-world data quality
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 20
Opinionated Data in the Wild
• Social media data•Highly relevant•Huge corpus•Constantly growing
• Very noisy•Questionable text quality
• Spelling•Grammar
• Spam•Unclear context• Figurative speech• Slang• Irony
Warren Scott M.your mxf format is a joke. DO NOT BUY CANON
Like
•
Comment 21 January at 18:31
Leon H.Why battery 6L in my Canon sx280 have pretty low life
Like
•
Comment 11 January at 10:58
Phil D.Youse guys did a solid on my wife's TI3- warranty expired lastmonth, but did the job good! Thanks CanonLike
•
Comment 11 January at 04:39
Cole J.Got a canon gl1I love it, but a little fuzzyLike
•
Comment 28 January 2010
Authentic status updates from https://www.facebook.com/pages/Canon-Cameras
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 21
Conclusions / Future Work
•Development of algorithms is on the right track• Evolvement beyond binary classification•Algorithms will become more robust on less homogenous sources
• Industry aims to apply algorithms to noisy data
Introduction > Algorithms > Benchmarks > Outlook
Intelligente Softwaresysteme 2013/14 22
Appendix
Intelligente Softwaresysteme 2013/14 23
References
article(Bayes1763)Bayes, T.An essay towards solving a problem in the doctrine of chancesPhil. Trans. of the Royal Soc. of London, 1763, Vol. 53, pp. 370-418
article(Church1990)Church, K.W. & Hanks, P.Word Association Norms, Mutual Information, and LexicographyComput. Linguist., MIT Press, 1990, Vol. 16(1), pp. 22-29
misc(Jurafsky2013)Dan Jurafsky, E.NaïveBayesandTextClassification2013
inproceedings(Liu2010)Liu, B.Sentiment analysis and subjectivityHandbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boca2010
misc(Potts2011)Potts, C.SentimentSymposiumTutorial:Classifiershttp://sentiment.christopherpotts.net/classifiers.html2011
Intelligente Softwaresysteme 2013/14 24
book(Thomas2010)Thomas, A. & Applegate, J.PayAttention!:HowtoListen,Respond,andProfitfromCustomerFeedbackWiley, 2010
inproceedings(Turney2002)Turney, P.D.Thumbsuporthumbsdown?SemanticorientationappliedtounsupervisedclassificationofreviewsProceedings 40th Annual Meeting of the ACL (2002)2002, pp. 417-424
misc(Zabin2008)Zabin, J. & Jefferies, A.Social Media Monitoring and Analysis: Generating Consumer Insights from Online ConversationAberdeen Group Benchmark Report, 2008
Intelligente Softwaresysteme 2013/14 25
Picture Credit
IconsPage 8:Arrow by Jamison Wieser from The Noun Project
PhotographyPage 1: “Thumbs up on diving down” by JamesHuckaby is licensed under a Creative Commons Attribution-NonCommercial-No-Derivs2.0GenericLicense.Basedonaworkathttp://www.flickr.com/photos/raveller/1117899371/.Toviewacopyofthis license, visit http://creativecommons.org/licenses/by-nc-nd/2.0/legalcode.
Page 3: “Coventry Solihull Warwickshire Sub-Regional Planning Study Questionnaire” by TheJRJamesArchive is licensed under aCreativeCommonsAttribution-NonCommercial2.0GenericLicense.Basedonaworkathttp://www.flickr.com/photos/jrjamesarchive/9371523446/. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.0/legal-code.
Page 14: “Svm intro.svg” by FabianBürgeris licensed under a Creative Commons Attribution 3.0 License. Based on a work at http://commons.wikimedia.org/wiki/File:Svm_intro.svg. To view a copy of this license, visit http://creativecommons.org/li-censes/by/3.0/legalcode.
Intelligente Softwaresysteme 2013/14 26