Lexicon-Based Sentiment Analysis at GHC 2014

2014

Lexicon-Based Sentiment Analysis

Using the Most-Mentioned

Word TreeBo-Hyun Kim, Sr. Software Engineer

HP Big Data Business Unit

Oct 10th, 2014

#GHC14

2014

2014

What to Expect

Sentiment Analysis− What is it?− Why is it interesting?− How HP Vertica Pulse works− Achieving greater accuracy− Different point of view using the most-

mentioned word tree

2014

What I Expect

A 5-star rating on GHC app

I just expect you to enjoy and learn!

2014

Sentiment Analysis

In plain English− the process of automatically detecting if a text

segment contains emotional or opinionated content and determining its polarity (e.g., “thumbs up” or “thumbs down”), is a field of research that has received significant attention in recent years, both in academia and in industry. [Wright, 2009]

2014

Gimme Examples!

Also known as:− Opinion Mining− Text Mining

Determine people’s general opinion− “I just got a new car, and I’m loving it ”− “My new car isn’t as fast as I thought.”

2014

Why are we interested?

Increasing(every minute!) web usage− Articles− Blogs− Comments

Power of Social Media− Online Shopping− Customer Reviews− Recommended products on Amazon− How other people feel about the product

2014

Product Review

2014

Data… Data… Data…

2014

HP Vertica Pulse

2014

How to Analyze?

Lexicon-based approach – HP Labs [Zhang et. al. 2011] Choose a product, person, event, organization, or topic

[Hu and Liu, 2004] to analyze the opinion Determine the Semantic Orientation score of opinion

lexicons

Word Semantic Orientation Value

Fabulous +3

Good +1

Bad -1

Nasty -3

2014

Sentiment Scoring

Input: text or sentence Output: For each attribute or entity, generates a sentiment score

ranging from -1 to 1− -1: Negative sentiment− 0: Neutral sentiment− 1: Positive sentiment

Entity-level lexicon-based sentiment scoring

2014

Limitation

Semantic Orientation value(‘missed’) = -1 Gives more weight to the closely located

word Accuracy can suffer

2014

Improve accuracy

Accuracy is what we strive for! More robust pre-processing

− Prune data to fit for different types of user opinion (e.g. Twitter vs. YouTube comments)

Naïve Bayes Classifier Training Tune accordingly

2014

Data Set

Test dataset − Stanford students collected− In 2009− Over 3 million tweets with tested score− Analyzed 3500 tweets

Collected dataset− HP Vertica Pulse Twitter Connector− In 2014− Total of 1.2 million tweets over 30 days

2014

Data Pruning

Remove − Job postings

• #job, #jobs, #tweetmyjob

− Links• http://this.is/nogood

− Duplicates − Twitter specific characters

• RT, @, #

− Emoticons• I hate my life :-), sarcasm is wide-spread disease

After pruning− ~287000 tweets, 24% of the 1.2 million tweets

2014

Naïve Bayes Classifier

Supervised learning − Probabilistic classifier based on Bayes’ theorem− Requires a small amount of data− Assumes the presence/absence of a particular

feature of a class is unrelated to the presence/absence of any other feature

− Classifying the object based on its included features

− Open source found at [nltk.org]

2014

Naïve Bayes Classifier

Results: − Final accuracy : 0.788

2014

Tuning Pulse

Positive words Negative words Neutral words White lists Stop words Synonym mappings

2014

Accuracy Comparison

Sentiment scores generated for each phase

Keyword Ideal Original Pruning Training Tuning

Healthcare -0.1515 -0.0333 -0.0833 -0.1 -0.125

Obama 0.308 0.0944 0.1535 0.1535 0.1842

2014

Trend/Targeted Analysis

Targeted dataset analysis can help improve accuracy Identify the most-mentioned words

− Use the most-recurrent words to narrow the scope of analysis

Find new trends − Government healthcare (2009) vs. Obamacare (2014)

Are we looking at the targeted data?− “Solve healthcare challenges with technology!” − “Healthcare After ObamaCare”− “Get affordable healthcare at HealthCare.gov”

2014

Generating Tree

Increase the relevancy of sentiment score by running the sentiment analysis on the entity, as well as on the most-recurrent words to identify: − Homonyms that machines do not understand− More accurate scores based on user interest

Generate tree using Text Search− Merge stemmer words

e.g. query, queries, querying…− Lucene - apache open source

2014

Tree View

healthcare

obamacare !(Obamacare)

obama !(Obama) !(health)health

2014

Thank you

[email protected]

[email protected]

Many thanks to*:Tim Donar, Solution Engineer

Beth Favini, Tech Pubs Sr. Manager

Judith Plummer, Tech Pubs Editor in Chief

* In alphabetical order

mailto:[email protected]



2014

Got Feedback?

Rate and Review the session using the GHC Mobile App

To download visit www.gracehopper.org

Lexicon-Based Sentiment Analysis at GHC 2014

Data & Analytics

Transcript of Lexicon-Based Sentiment Analysis at GHC 2014