Experiences with Sentiment Analysis with Peter Zadrozny

Peter Zadrozny

The contents of this presentation are part of the book “Big Data Analytics Using Splunk” by Peter Zadrozny and Raghu Kodali

IntroductionThe technical sideThe Splunk sentiment analysis appThe world sentiment indicator projectConclusions

Agenda

Introduction

Sentiment AnalysisIs the process of examining text or speech to find out the opinions, views or feelings of the author or speakerThis definition applies to a computer systemWhen a human does this, it’s called reading

The words in the title describe highly subjective and ambiguous concepts for a humanEven more challenging for a computer program

Opinions, Views, Beliefs, Convictions

Words or expressions have different meanings depending on the knowledge domain (domain of expertise)Example: “Go Around”

Sarcasm, jokes, etc.Domains of expertise usually have slangConclusion:Sentiment is contextual and domain dependent

Opinions, Views, Beliefs, Convictions

Analysis tends to be done byDomain of expertiseMedia channelNewspaper articles follow grammar rules, use proper words, no orthographical mistakesTweets lack sentence structure, likely use slang, include emoticons ( , ) and sometimes words are lengthened (“I looooooove chocolate”)

Sentiment Analysis

Companies want to know what theirCustomersCompetitorsGeneral public

Think about theirProductsServicesBrands

Usually associated with marketing and public relations

Commercial Uses

When done correctly, sentiment analysis is powerful“From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series”, O'Connor et al. 2010Analysis of surveys on consumer confidence and political opinion correlate to sentiment word frequencies in Twitter by as much as 80%

These results highlight the potential of text streams as a substitute and supplement for traditional polling.

Commercial Uses

When not well done…"The Hathaway Effect: How Anne Gives Warren Buffet a Rise", Dan Mirvish, Huffington Post, 2011Suspicions that some robotics trading programs in Wall Street include sentiment analysisEvery time Anne Hathaway makes the headlines, the stock of Warren Buffet’s company Berkshire-Hathaway goes up

Commercial Uses

The Technical Side

Sentiment Analysis is text categorizationThe results fall into two categoriesPolarityPositive, negative, neutral

Range of polarityRatings or rankingsExample: 1 to 5 stars for movie reviews

The Technical Side

Extracting and categorizing sentiment is based on “features”Frequency: Words that appear most often decide the polarityTerm Presence: Most unique words define polarityN-Grams: The position of a word determines polarityParts of Speech: Adjectives define the polaritySyntax: Attempts to analyze syntactic relations haven’t been very successfulNegation: Explicit negation terms reverse polarity

Text classifiers tend to use combinations of features

The Technical Side

To assign contextual polarity, you need a base polarityUse a lexicon, which provides a polarity for each wordWord Phrase Sentence Document

Use training documentsPreferred

The Technical Side

Training documentsContain a number of sentencesAre classified with a specific polarity

Polarity for each word is based on a combination of feature extractors and its appearance in the different classificationsThe more sentences, the more accurateResults are placed in a “model”

The Technical Side

Machine learning toolsNaïve Bayes ClassifierGenerally use N-grams, frequency, and term of presence. Sometimes part of speech

Maximum EntropyBayes assumes each feature is independent, ME does notAllows for overlap of words

Support Vector MachinesOne vector per featureLinear, polynomials, sigmoid and other functions are applied to the vectors

The Technical Side

The Technical Side

TrainerNeutral

Negative

Positive

TrainingCorpus

Model

TesterNeutral

Negative

Positive

TestingCorpus

Processor

Accuracy &Margin of Error

Document

Sentiment

The Splunk Sentiment Analysis App

Based on the Naïve Bayes ClassifierHas three commandsSentimentLanguageToken

Includes a training/testing program and two modelsTwitter: 190,862 positive and 37,469IMDbRange of polarity from 1 to 10Each ranking has 11 movie reviews, averaging 200 words


index=twitter lang=en¦ where like(text, “%love%”)¦ sentiment twitter text¦ stats avg(sentiment)


Love, Hate and Justin Beaver

index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)

The Beliebers Search



So that we don’t have to type entities.hashtags{x}.text everytime we want to refer to a hashtag, rename this multi-value field to hashtags



We only want the fields that contain the tweet and the hashtags



Expand the values of this multi-value field into separate Splunk events

The training corpus is key to accuracyBeware: Naïve Bayes is not an exact algorithmThe best accuracy obtained using Naïve Bayes is approximately 83%

Key factors to increase accuracySimilarity to the data being analyzedSize of the corpus

Training and Testing Data

Training and Testing Data

Test Data Size Accuracy Margin of Error

University of Michigan

1.5 Million 72.49% 1.05%

Splunk 228,000 68.79% 1.12%

Sanders 5,500 60.61% 0.76%

Love, Hate & Justin Bieber: Sanders Model

The World Sentiment Indicator Project

Based on news headlinesFrom news web sites all around the worldCollecting RSS feeds in English

The World Sentiment Indicator

Steps for this project1. Collect the RSS feeds2. Index the headlines into Splunk3. Define the sentiment corpus4. Create a visualization of the results


Collecting the RSS Feeds

Create your ownCrowd-source

University of Michigan ‒ Kaggle competitionBootstrap“Twitter Sentiment Classification Using Distant Supervision”, Go et al, 2010Uses emoticons to classify tweetsAccuracy for unigrams and bigramsNaïve Bayes 82.7%Maximum Entropy 82.7%Support Vector Machine 81.6%

Training Corpus Creation

Issues with subjectivityPope Benedict XVI announces resignationPope ‘too frail’ to carry onPope steps down as head of Catholic churchPope quits for health reasons

Average size of RSS headline 47.8 chars, 7.6 wordsTwitter average 78 characters, 14 words

Training Corpus Considerations

Create a special corpus based on news headlinesVersion 1: 100 positive, 100 negative, 100 neutralVersion 2: 200 positive, 200 negative, 200 neutral

Use an existing Twitter corpusThe one included with the Splunk appUniversity of Michigan

Use a movie review corpusPang & Lee: 1,000 positive, 1,000 negative

Training Corpus Strategy

Training Corpus Accuracy

Training Corpus Size Accuracy Margin of Error

Headlines V1 300 headlines 38.89% 1.02%

Headlines V2 600 headlines 47.22% 1.05%

Splunk Twitter 228,000 tweets 40.80% 1.16%

U of Michigan 1.5 million tweets 43.81 1.11%Movie Reviews 2,000 reviews 36.79% 1.23%

The key to accuracy is the quality of the training dataTrain with the same data you will analyzeSize of the training data improves accuracySubjectivity of crowd-sourcing tends to even out as the amount of training data increases

All machine learning tools tend to converge to similar levels of accuracyUse the easiest one for you

Conclusions

Questions?

Experiences with Sentiment Analysis with Peter Zadrozny

Technology

Transcript of Experiences with Sentiment Analysis with Peter Zadrozny