Lexicon-Based Sentiment Analysis at GHC 2014

24
2014 Lexicon-Based Sentiment Analysis Using the Most- Mentioned Word Tree Bo-Hyun Kim, Sr. Software Engineer HP Big Data Business Unit Oct 10 th , 2014 #GHC14 2014

description

Attended Grace Hopper Celebration to present the work in Data Science Track. The presentation is on using HP Vertica Pulse and enhancing the accuracy using the right pre-processing methods and training for accuracy using the naive bayes theorem.

Transcript of Lexicon-Based Sentiment Analysis at GHC 2014

Page 1: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Lexicon-Based Sentiment Analysis

Using the Most-Mentioned

Word TreeBo-Hyun Kim, Sr. Software Engineer

HP Big Data Business Unit

Oct 10th, 2014

#GHC14

2014

Page 2: Lexicon-Based Sentiment Analysis at GHC 2014

2014

What to Expect

Sentiment Analysis− What is it?− Why is it interesting?− How HP Vertica Pulse works− Achieving greater accuracy− Different point of view using the most-

mentioned word tree

Page 3: Lexicon-Based Sentiment Analysis at GHC 2014

2014

What I Expect

A 5-star rating on GHC app

I just expect you to enjoy and learn!

Page 4: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Sentiment Analysis

In plain English− the process of automatically detecting if a text

segment contains emotional or opinionated content and determining its polarity (e.g., “thumbs up” or “thumbs down”), is a field of research that has received significant attention in recent years, both in academia and in industry. [Wright, 2009]

Page 5: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Gimme Examples!

Also known as:− Opinion Mining− Text Mining

Determine people’s general opinion− “I just got a new car, and I’m loving it ”− “My new car isn’t as fast as I thought.”

Page 6: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Why are we interested?

Increasing(every minute!) web usage− Articles− Blogs− Comments

Power of Social Media− Online Shopping− Customer Reviews− Recommended products on Amazon− How other people feel about the product

Page 7: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Product Review

Page 8: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data… Data… Data…

Page 9: Lexicon-Based Sentiment Analysis at GHC 2014

2014

HP Vertica Pulse

Page 10: Lexicon-Based Sentiment Analysis at GHC 2014

2014

How to Analyze?

Lexicon-based approach – HP Labs [Zhang et. al. 2011] Choose a product, person, event, organization, or topic

[Hu and Liu, 2004] to analyze the opinion Determine the Semantic Orientation score of opinion

lexicons

Word Semantic Orientation Value

Fabulous +3

Good +1

Bad -1

Nasty -3

Page 11: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Sentiment Scoring

Input: text or sentence Output: For each attribute or entity, generates a sentiment score

ranging from -1 to 1− -1: Negative sentiment− 0: Neutral sentiment− 1: Positive sentiment

Entity-level lexicon-based sentiment scoring

Page 12: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Limitation

Semantic Orientation value(‘missed’) = -1 Gives more weight to the closely located

word Accuracy can suffer

Page 13: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Improve accuracy

Accuracy is what we strive for! More robust pre-processing

− Prune data to fit for different types of user opinion (e.g. Twitter vs. YouTube comments)

Naïve Bayes Classifier Training Tune accordingly

Page 14: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data Set

Test dataset − Stanford students collected− In 2009− Over 3 million tweets with tested score− Analyzed 3500 tweets

Collected dataset− HP Vertica Pulse Twitter Connector− In 2014− Total of 1.2 million tweets over 30 days

Page 15: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Data Pruning

Remove − Job postings

• #job, #jobs, #tweetmyjob

− Links• http://this.is/nogood

− Duplicates − Twitter specific characters

• RT, @, #

− Emoticons• I hate my life :-), sarcasm is wide-spread disease

After pruning− ~287000 tweets, 24% of the 1.2 million tweets

Page 16: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Naïve Bayes Classifier

Supervised learning − Probabilistic classifier based on Bayes’ theorem− Requires a small amount of data− Assumes the presence/absence of a particular

feature of a class is unrelated to the presence/absence of any other feature

− Classifying the object based on its included features

− Open source found at [nltk.org]

Page 17: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Naïve Bayes Classifier

Results: − Final accuracy : 0.788

Page 18: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Tuning Pulse

Positive words Negative words Neutral words White lists Stop words Synonym mappings

Page 19: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Accuracy Comparison

Sentiment scores generated for each phase

Keyword Ideal Original Pruning Training Tuning

Healthcare -0.1515 -0.0333 -0.0833 -0.1 -0.125

Obama 0.308 0.0944 0.1535 0.1535 0.1842

Page 20: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Trend/Targeted Analysis

Targeted dataset analysis can help improve accuracy Identify the most-mentioned words

− Use the most-recurrent words to narrow the scope of analysis

Find new trends − Government healthcare (2009) vs. Obamacare (2014)

Are we looking at the targeted data?− “Solve healthcare challenges with technology!” − “Healthcare After ObamaCare”− “Get affordable healthcare at HealthCare.gov”

Page 21: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Generating Tree

Increase the relevancy of sentiment score by running the sentiment analysis on the entity, as well as on the most-recurrent words to identify: − Homonyms that machines do not understand− More accurate scores based on user interest

Generate tree using Text Search− Merge stemmer words

e.g. query, queries, querying…− Lucene - apache open source

Page 22: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Tree View

healthcare

obamacare !(Obamacare)

obama !(Obama) !(health)health

Page 23: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Thank you

[email protected]

[email protected]

Many thanks to*:Tim Donar, Solution Engineer

Beth Favini, Tech Pubs Sr. Manager

Judith Plummer, Tech Pubs Editor in Chief

* In alphabetical order

Page 24: Lexicon-Based Sentiment Analysis at GHC 2014

2014

Got Feedback?

Rate and Review the session using the GHC Mobile App

To download visit www.gracehopper.org