Experiences with Sentiment Analysis with Peter Zadrozny

Peter Zadrozny

The contents of this presentation are part of the book “Big Data Analytics Using Splunk” by Peter Zadrozny and Raghu Kodali

IntroductionThe technical sideThe Splunk sentiment analysis appThe world sentiment indicator projectConclusions

Agenda

Introduction

Sentiment AnalysisIs the process of examining text or speech to find out the opinions, views or feelings of the author or speakerThis definition applies to a computer systemWhen a human does this, it’s called reading

The words in the title describe highly subjective and ambiguous concepts for a humanEven more challenging for a computer program

Opinions, Views, Beliefs, Convictions

Words or expressions have different meanings depending on the knowledge domain (domain of expertise)Example: “Go Around”

Sarcasm, jokes, etc.Domains of expertise usually have slangConclusion:Sentiment is contextual and domain dependent

Opinions, Views, Beliefs, Convictions

Analysis tends to be done byDomain of expertiseMedia channelNewspaper articles follow grammar rules, use proper words, no orthographical mistakesTweets lack sentence structure, likely use slang, include emoticons ( , ) and sometimes words are lengthened (“I looooooove chocolate”)

Sentiment Analysis

Companies want to know what theirCustomersCompetitorsGeneral public

Think about theirProductsServicesBrands

Usually associated with marketing and public relations

Commercial Uses

When done correctly, sentiment analysis is powerful“From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series”, O'Connor et al. 2010Analysis of surveys on consumer confidence and political opinion correlate to sentiment word frequencies in Twitter by as much as 80%

These results highlight the potential of text streams as a substitute and supplement for traditional polling.

Commercial Uses

When not well done…"The Hathaway Effect: How Anne Gives Warren Buffet a Rise", Dan Mirvish, Huffington Post, 2011Suspicions that some robotics trading programs in Wall Street include sentiment analysisEvery time Anne Hathaway makes the headlines, the stock of Warren Buffet’s company Berkshire-Hathaway goes up

Commercial Uses

The Technical Side

Sentiment Analysis is text categorizationThe results fall into two categoriesPolarityPositive, negative, neutral

Range of polarityRatings or rankingsExample: 1 to 5 stars for movie reviews

The Technical Side

Extracting and categorizing sentiment is based on “features”Frequency: Words that appear most often decide the polarityTerm Presence: Most unique words define polarityN-Grams: The position of a word determines polarityParts of Speech: Adjectives define the polaritySyntax: Attempts to analyze syntactic relations haven’t been very successfulNegation: Explicit negation terms reverse polarity

Text classifiers tend to use combinations of features

The Technical Side

To assign contextual polarity, you need a base polarityUse a lexicon, which provides a polarity for each wordWord Phrase Sentence Document

Use training documentsPreferred

The Technical Side

Training documentsContain a number of sentencesAre classified with a specific polarity

Polarity for each word is based on a combination of feature extractors and its appearance in the different classificationsThe more sentences, the more accurateResults are placed in a “model”

The Technical Side

Machine learning toolsNaïve Bayes ClassifierGenerally use N-grams, frequency, and term of presence. Sometimes part of speech

Maximum EntropyBayes assumes each feature is independent, ME does notAllows for overlap of words

Support Vector MachinesOne vector per featureLinear, polynomials, sigmoid and other functions are applied to the vectors

The Technical Side

TrainerNeutral

Negative

Positive

TrainingCorpus

TesterNeutral

Negative

Positive

TestingCorpus

Processor

Accuracy &Margin of Error

Document

Sentiment

The Splunk Sentiment Analysis App

Based on the Naïve Bayes ClassifierHas three commandsSentimentLanguageToken

Includes a training/testing program and two modelsTwitter: 190,862 positive and 37,469IMDbRange of polarity from 1 to 10Each ranking has 11 movie reviews, averaging 200 words

index=twitter lang=en¦ where like(text, “%love%”)¦ sentiment twitter text¦ stats avg(sentiment)

Love, Hate and Justin Beaver

index=twitter lang=en¦ rename entities.hashtags{}.text as hashtags¦ fields text, hashtags¦ mvexpand hashtags¦ where like(hastags, “Beliebers”)¦ sentiment twitter text¦ stats avg(sentiment)

The Beliebers Search

So that we don’t have to type entities.hashtags{x}.text everytime we want to refer to a hashtag, rename this multi-value field to hashtags

We only want the fields that contain the tweet and the hashtags

Expand the values of this multi-value field into separate Splunk events

The training corpus is key to accuracyBeware: Naïve Bayes is not an exact algorithmThe best accuracy obtained using Naïve Bayes is approximately 83%

Key factors to increase accuracySimilarity to the data being analyzedSize of the corpus

Training and Testing Data

Test Data Size Accuracy Margin of Error

University of Michigan

1.5 Million 72.49% 1.05%

Splunk 228,000 68.79% 1.12%

Sanders 5,500 60.61% 0.76%

Love, Hate & Justin Bieber: Sanders Model

The World Sentiment Indicator Project

Based on news headlinesFrom news web sites all around the worldCollecting RSS feeds in English

The World Sentiment Indicator

Steps for this project1. Collect the RSS feeds2. Index the headlines into Splunk3. Define the sentiment corpus4. Create a visualization of the results

Collecting the RSS Feeds

Create your ownCrowd-source

University of Michigan ‒ Kaggle competitionBootstrap“Twitter Sentiment Classification Using Distant Supervision”, Go et al, 2010Uses emoticons to classify tweetsAccuracy for unigrams and bigramsNaïve Bayes 82.7%Maximum Entropy 82.7%Support Vector Machine 81.6%

Training Corpus Creation

Issues with subjectivityPope Benedict XVI announces resignationPope ‘too frail’ to carry onPope steps down as head of Catholic churchPope quits for health reasons

Average size of RSS headline 47.8 chars, 7.6 wordsTwitter average 78 characters, 14 words

Training Corpus Considerations

Create a special corpus based on news headlinesVersion 1: 100 positive, 100 negative, 100 neutralVersion 2: 200 positive, 200 negative, 200 neutral

Use an existing Twitter corpusThe one included with the Splunk appUniversity of Michigan

Use a movie review corpusPang & Lee: 1,000 positive, 1,000 negative

Training Corpus Strategy

Training Corpus Accuracy

Training Corpus Size Accuracy Margin of Error

Headlines V1 300 headlines 38.89% 1.02%

Headlines V2 600 headlines 47.22% 1.05%

Splunk Twitter 228,000 tweets 40.80% 1.16%

U of Michigan 1.5 million tweets 43.81 1.11%Movie Reviews 2,000 reviews 36.79% 1.23%

The key to accuracy is the quality of the training dataTrain with the same data you will analyzeSize of the training data improves accuracySubjectivity of crowd-sourcing tends to even out as the amount of training data increases

All machine learning tools tend to converge to similar levels of accuracyUse the easiest one for you

Conclusions

Questions?

Experiences with Sentiment Analysis with Peter Zadrozny

Technology

Transcript of Experiences with Sentiment Analysis with Peter Zadrozny

Sentiment urbà

Classification from Positive, Unlabeled and Biased ...proceedings.mlr.press/v97/hsieh19c/hsieh19c.pdf · butions differ. The term sample selection bias (Heckman,1979; Zadrozny,2004)

apr 2014maps sentiment 0 o O O o o 00 0 0 00 sentiment opage maps sentiment opage maps maps sentiment sentiment ©page maps sentiment opage maps maps i sent sentiment ...

Sentiment Analysis & Opinion Mining€¦ · Sentiment Analysis Sentiment Classification System Experimente Perspektiven * Abbildung dem Sinn nach entnommen aus Heyer (2006: 5). Sentiment

Sentiment Analysis and Subjectivityfaculty.baruch.cuny.edu/dluna/consling/sentanalysisch.pdf(called sentence-level sentiment classification). 3. Feature-based sentiment analysis: This

Sentiment Detection

SentiRuEval: Testing Object-Oriented Sentiment Analysis ... · Entity-oriented sentiment analysis • Sentiment analysis –In general: sentiment of the whole document, fragment or

Improves Patient Experiences through Real Time Sentiment Analysis

Twitter Sentiment & Investing - modeling stock price movements with twitter sentiment.

Geo-spatial Multimedia Sentiment Analysis in Disasters · sentiment analysis (a.k.a. sentiment classification). Sentiment analysis has been widely used and applied in various studies

Sentiment Analysis and the Consumer Genome2012w.sentimentsymposium.com/presentations/SAS12w-Vaidyanath… · Sentiment Analysis and the Consumer Genome Sentiment Analysis Symposium

december 2009 CardMaps · sentiment © pagemaps sentiment © pagemaps mops sentiment pagemaps Q pager-naps © pacem page Irôqps © pacema © pager-naps sentiment sentiment sentiment

Sentiment Analysis and Opinion Miningliub/FBS/Sentiment...Sentiment Analysis and Opinion Mining 7 CHAPTER 1 Sentiment Analysis: A Fascinating Problem Sentiment analysis, also called

Sentiment analysis

Aula 1 - 17/10/2010 Inteligência Artificial Aula 1 Profª Bianca Zadrozny.

Negative Sentiment (or "Sentiment Analysis is Sh*te")

Indigenous Tourism Association of Canada: Sentiment Analysis … · 2018-11-08 · ITAC –Overall Sentiment Score Sentiment Score-7 The conversation Sentiment Score is a measure

Practical Sentiment Analysis Tutorial - BIUu.cs.biu.ac.il/~89-680/sentiment.pdf · Practical Sentiment Analysis Tutorial Jason Baldridge @jasonbaldridge Sentiment Analysis Symposium

Weekly Sentiment

Deep-Sentiment: Sentiment Analysis Using …Deep-Sentiment: Sentiment Analysis Using Ensemble of CNN and Bi-LSTM Models Shervin Minaee , Elham Azimi , AmirAli Abdolrashidiy New York