Post on 27-Jan-2015
description
Peter Zadrozny
The contents of this presentation are part of the book “Big Data Analytics Using Splunk” by Peter Zadrozny and Raghu Kodali
IntroductionThe technical sideThe Splunk sentiment analysis appThe world sentiment indicator projectConclusions
Agenda
Introduction
Sentiment AnalysisIs the process of examining text or speech to find out the opinions, views or feelings of the author or speakerThis definition applies to a computer systemWhen a human does this, it’s called reading
The words in the title describe highly subjective and ambiguous concepts for a humanEven more challenging for a computer program
Opinions, Views, Beliefs, Convictions
Words or expressions have different meanings depending on the knowledge domain (domain of expertise)Example: “Go Around”
Sarcasm, jokes, etc.Domains of expertise usually have slangConclusion:Sentiment is contextual and domain dependent
Opinions, Views, Beliefs, Convictions
Analysis tends to be done byDomain of expertiseMedia channelNewspaper articles follow grammar rules, use proper words, no orthographical mistakesTweets lack sentence structure, likely use slang, include emoticons ( , ) and sometimes words are lengthened (“I looooooove chocolate”)
Sentiment Analysis
Companies want to know what theirCustomersCompetitorsGeneral public
Think about theirProductsServicesBrands
Usually associated with marketing and public relations
Commercial Uses
When done correctly, sentiment analysis is powerful“From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series”, O'Connor et al. 2010Analysis of surveys on consumer confidence and political opinion correlate to sentiment word frequencies in Twitter by as much as 80%
These results highlight the potential of text streams as a substitute and supplement for traditional polling.
Commercial Uses
When not well done…"The Hathaway Effect: How Anne Gives Warren Buffet a Rise", Dan Mirvish, Huffington Post, 2011Suspicions that some robotics trading programs in Wall Street include sentiment analysisEvery time Anne Hathaway makes the headlines, the stock of Warren Buffet’s company Berkshire-Hathaway goes up
Commercial Uses
The Technical Side
Sentiment Analysis is text categorizationThe results fall into two categoriesPolarityPositive, negative, neutral
Range of polarityRatings or rankingsExample: 1 to 5 stars for movie reviews
The Technical Side
Extracting and categorizing sentiment is based on “features”Frequency: Words that appear most often decide the polarityTerm Presence: Most unique words define polarityN-Grams: The position of a word determines polarityParts of Speech: Adjectives define the polaritySyntax: Attempts to analyze syntactic relations haven’t been very successfulNegation: Explicit negation terms reverse polarity
Text classifiers tend to use combinations of features
The Technical Side
To assign contextual polarity, you need a base polarityUse a lexicon, which provides a polarity for each wordWord Phrase Sentence Document
Use training documentsPreferred
The Technical Side
Training documentsContain a number of sentencesAre classified with a specific polarity
Polarity for each word is based on a combination of feature extractors and its appearance in the different classificationsThe more sentences, the more accurateResults are placed in a “model”
The Technical Side
Machine learning toolsNaïve Bayes ClassifierGenerally use N-grams, frequency, and term of presence. Sometimes part of speech
Maximum EntropyBayes assumes each feature is independent, ME does notAllows for overlap of words
Support Vector MachinesOne vector per featureLinear, polynomials, sigmoid and other functions are applied to the vectors
The Technical Side
The Technical Side
TrainerNeutral
Negative
Positive
TrainingCorpus
Model
TesterNeutral
Negative
Positive
TestingCorpus
Processor
Accuracy &Margin of Error
Document
Sentiment
The Splunk Sentiment Analysis App
Based on the Naïve Bayes ClassifierHas three commandsSentimentLanguageToken
Includes a training/testing program and two modelsTwitter: 190,862 positive and 37,469IMDbRange of polarity from 1 to 10Each ranking has 11 movie reviews, averaging 200 words
The Splunk Sentiment Analysis App
index=twitter lang=en¦ where like(text, “%love%”)¦ sentiment twitter text¦ stats avg(sentiment)
The Splunk Sentiment Analysis App
Love, Hate and Justin Beaver
index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)
The Beliebers Search
index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)
The Beliebers Search
So that we don’t have to type entities.hashtags{x}.text everytime we want to refer to a hashtag, rename this multi-value field to hashtags
index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)
The Beliebers Search
We only want the fields that contain the tweet and the hashtags
index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)
The Beliebers Search
Expand the values of this multi-value field into separate Splunk events
index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)
The Beliebers Search
The training corpus is key to accuracyBeware: Naïve Bayes is not an exact algorithmThe best accuracy obtained using Naïve Bayes is approximately 83%
Key factors to increase accuracySimilarity to the data being analyzedSize of the corpus
Training and Testing Data
Training and Testing Data
Test Data Size Accuracy Margin of Error
University of Michigan
1.5 Million 72.49% 1.05%
Splunk 228,000 68.79% 1.12%
Sanders 5,500 60.61% 0.76%
Love, Hate & Justin Bieber: Sanders Model
The World Sentiment Indicator Project
Based on news headlinesFrom news web sites all around the worldCollecting RSS feeds in English
The World Sentiment Indicator
Steps for this project1. Collect the RSS feeds2. Index the headlines into Splunk3. Define the sentiment corpus4. Create a visualization of the results
The World Sentiment Indicator
Collecting the RSS Feeds
Create your ownCrowd-source
University of Michigan ‒ Kaggle competitionBootstrap“Twitter Sentiment Classification Using Distant Supervision”, Go et al, 2010Uses emoticons to classify tweetsAccuracy for unigrams and bigramsNaïve Bayes 82.7%Maximum Entropy 82.7%Support Vector Machine 81.6%
Training Corpus Creation
Issues with subjectivityPope Benedict XVI announces resignationPope ‘too frail’ to carry onPope steps down as head of Catholic churchPope quits for health reasons
Average size of RSS headline 47.8 chars, 7.6 wordsTwitter average 78 characters, 14 words
Training Corpus Considerations
Create a special corpus based on news headlinesVersion 1: 100 positive, 100 negative, 100 neutralVersion 2: 200 positive, 200 negative, 200 neutral
Use an existing Twitter corpusThe one included with the Splunk appUniversity of Michigan
Use a movie review corpusPang & Lee: 1,000 positive, 1,000 negative
Training Corpus Strategy
Training Corpus Accuracy
Training Corpus Size Accuracy Margin of Error
Headlines V1 300 headlines 38.89% 1.02%
Headlines V2 600 headlines 47.22% 1.05%
Splunk Twitter 228,000 tweets 40.80% 1.16%
U of Michigan 1.5 million tweets 43.81 1.11%Movie Reviews 2,000 reviews 36.79% 1.23%
The World Sentiment Indicator
The key to accuracy is the quality of the training dataTrain with the same data you will analyzeSize of the training data improves accuracySubjectivity of crowd-sourcing tends to even out as the amount of training data increases
All machine learning tools tend to converge to similar levels of accuracyUse the easiest one for you
Conclusions
Questions?