AVAYA: Sentiment Analysis in Twitter with Self-Training and Polarity Lexicon Expansion
description
Transcript of AVAYA: Sentiment Analysis in Twitter with Self-Training and Polarity Lexicon Expansion
AVAYA: Sentiment Analysis in Twitter with Self-Training and
Polarity Lexicon ExpansionLee Becker, George Erhart,
David Skiba, and Valentine Matula
June 16, 2013
Labs
SemEval 2013 Task 2
2
Participation
SemEval
2013Task 2
Subtasks:• A: Message
Polarity Classification
• B: Contextual Polarity Disambiguation
Training Conditions:• Constrained• Unconstrained
Testing Conditions• Tweet• SMS
3
Guiding Intuitions• Boost recall of positive/negative instances
(A,B)• Don’t worry about neutral instances (A,B)• Encode polarity cues into features (A,B)• Exploit the context (A)
4
System Overview: Task B Constrained
Sentiment Labeled Tweets
FeatureExtraction
PolarityLexicon
ConstrainedModel
5
System Overview: Task B Unconstrained
Unlabeled Tweets
Auto LabeledTweets
ExpandedPolarityLexicon
FeatureExtraction
UnconstrainedModel
ConstrainedModel
6
Overview: Task A Models
Sentiment Labeled Contexts
FeatureExtraction
PolarityLexicon
ConstrainedModel
Sentiment Labeled Contexts
FeatureExtraction
Expanded Polarity Lexicon
UnconstrainedModel
7
Preprocessing• Normalization:
o URLSo @Mentions
• NLP Pipelineo Written in ClearTK frameworko ClearNLP Wrappers
• Tokenization – preserves emoticons and URLs• POS Tagging• Lemmatization• Dependency Parsing
o PTB POS -> ArkTweet POS (Gimpel, et. al. 2011)o Dependencies -> Collapsed Dependencies
8
Resources• MPQA Subjectivity Lexicon
(Wilson, Weibe and Hoffman, 2005)
• Hand-Crafted Negation Word Dictionary
• Hand-Crafted Emoticon Polarity Dictionary
http://leebecker.com/resources/semeval-2013/
9
Task B Features• Polarized Bag-of-Words
o Easy way to double the feature space (e.g. happy & NOT_happy)
I am not too happy about this, but I’m still pumped and thrilled for tomorrow.
Negation Window
Features:• Token• Token + PTB POS• Token + Simplified POS• Lemma• Lemma + PTB POS• Lemma + Simplified POS
10
Task B Features• Message Polarity Features
o Word Sentiment Counts (pos|neg)o Emoticon Sentiment Counts (pos|neg)o Net word polarityo Net emoticon polarity
• Microblogging Featureso ALL CAPS word countso Words with repeated characters (yaaaaay, booooo) countso Emphasis (*yes*)o Winning Sports score (Nuggets 15-0)
• PTB POS Tag counts• Collapsed Dependency Relations
o Incorporated negationo Text-Texto Lemma+Simplified POS – Lemma+Simplified POSo POS - Lemma
11
Task B: Constrained Model
• LIBLinear with Logistic Regression loss function• Heavily boosted negative-polarity instances
o wpositive =1o wnegative = 25o wneutral = 1
13
Polarity Lexicon Expansion: Pointwise Mutual
Information• Based on Semantic Orientation for Sentiment
(Turney, 2002)• Intuition: Utilize co-occurrence statistics to
measure words’ dependence/independence with a polarity.
PMI(word, sentiment) = log2p(word, sentiment)
p(word)p(sentiment)
polarity(word) = sgn(PMI(word, positive) – PMI(word, negative))
14
Polarity Lexicon Expansion:From tweets to lexicon
• Differences from Turney (2002)o Classifier output instead of seed wordso Words instead of word phrases
• Procedureo Applied to ~475k Unlabeled Tweetso Filtered and balanced corpus via classifier confidence score thresholds
• 50,789 positive instances ( > 0.9)• 59,029 negative instances ( > 0.7)• 70,601 neutral instances ( > 0.8)
o Removed:• f(word) < 10• neutral polarity words• single character words (‘a’, ‘j’, ‘I’, etc…)• numbers (1, 20, 1000)• punctuation
o Merged with MPQA subjectivity lexiconFinal lexicon size: 11,740 entries
15
Task B: Unconstrained Model
• Self-trained modelo ~470k constrained model produced instanceso ~10k original instances
• Expanded polarity lexicon• Heavily discounted neutral instances
o wpositive =2o wnegative = 5o wneutral = 0.1
16
Task B ResultsSystem Fpos
+
Fne
g-
Fne
u
Favg
+/-
Rank
Tweet
NRC-Canada .733
.647
.744
.690 1
Avaya-Unconstrained
.700
.582
.713
.641 5
Avaya-Constrained .669
.548
.608
.608 12
Mean .626
.450
.538
.538 -
SMS
NRC-Canada .730
.639
.799
.685 1
Avaya-Constrained .648
.553
.778
.600 4
Avaya-Unconstrained
.633
.557
.759
.595 5
Mean .546
.456
.627
.501 -
17
Task A: Features• Same as Task B
o Polarized Bag of Wordso Contextual Polarity Features
• Word Sentiment Counts (pos|neg)• Emoticon Sentiment Counts (pos|neg)• Net word polarity• Net emoticon polarity
o Microblogging Featureso PTB POS tags
• Additional Features:o Scoped Dependencieso Dependency Paths
18
Task A Features: Scoped Dependencies
• OUT_neg_nsubj(want,you)• OUT_neg(want, not)• IN_xcomp(want, miss)• IN_aux(miss, to)• OUT_tmod(miss, tomorrow)
You do not want to miss this tomorrow night.
rootnsubj
negxcomp
auxtmod
19
Task A Features: Dependency Paths
• POS Path: {NNP} dobj < {VBD} < conj {VBD} < root • Sentiment POS Path: {^/neutral} < {V/negative} < {V/negative} <
{root}• In Subject: False• In Object: True
Criminals killed Sadat and in the process they killed Egypt.
dobjconj
root
20
Task A Models• Constrained: MPQA Subjectivity Lexicon• Unconstrained: Expanded Polarity Lexicon• LIBLinear
o wpositive =11o wnegative = 2o wneutral = 1
21
Task A ResultsSystem Fpos
+
Fne
g-
Fne
u
Favg
+/-
Rank
Tweet
NRC-Canada .910
.869
.110
.889 1
Avaya-Unconstrained
.898
.849
.311
.874 2
Avaya-Constrained .896
.843
.309
.870 3
Mean .773
.677
.115
.725 -
SMS
GUMLTLT .865
.902
.086
.884 1
Avaya-Unconstrained
.842
.874
.138
.858 3
Avaya-Constrained .823
.856
.125
.839 4
Mean .710
.698
.099
.704 -
22
Discussion• Dictionary expansion via supervised sentiment
models provides a relatively simple way to expand the feature space and expand coverage.
• Dependency-Based features provide additional context and richer information
• Future worko Ablation studieso Better tuning of self-training
23
Thank you!• Task 2 Organizers and Participants• SemEval 2013 Organizers• Anonymous Reviewers