Domain-based Lexicon Enhancement for Sentiment Analysis
description
Transcript of Domain-based Lexicon Enhancement for Sentiment Analysis
Domain-based Lexicon Enhancement for Sentiment Analysis
A. Muhammad, N. Wiratunga, R. Lothian, R. Glassey
IDEAS Research Institute,Robert Gordon University, Aberdeen
Introduction
• Sentiment Classification
• Sentiment Analysis– A wider task, involves identification of
• Object/Aspects• Opinion holder• Time
Text Sentiment Classification
2BCS-SGAI-SMA-2013, Cambridge UK
Sentiment Classification
• Machine Learning
The movie is good : +The movie is horrible : -I don’t like the movie : -I love the movie : +…
Classifiere.g. NB, SVMs
Model
The movie is nice : ?
3BCS-SGAI-SMA-2013, Cambridge UK
Sentiment Classif… Cont’d
• Lexicon-Based
4
Contextual analysis/Aggregation
The movie is nice : ?
BCS-SGAI-SMA-2013, Cambridge UK
Lexicon Generation
5
Manual Corpus Dictionary
•Could be too narrow
•Could be too General
•Ugh!! this movie sucks!•This movie is fantastic
BCS-SGAI-SMA-2013, Cambridge UK
Sentiment Lexicons
• Dictionary-based: SentiWordNet (Baccianella et. al, 2010)
• Corpus-based– Generated from target domain– Existing approaches rely on well-formed
spelling/grammar
6
CorpusSeed
horrible happy
affordable rubbish
enjoyable mad …
goodbadterriblenice
but
and
(Hatzivassiloglou and Mckeown, 1997)
horrible happy
affordable rubbish
enjoyable mad …
Excellent
Poorcoocurrence
Turney, 2002
BCS-SGAI-SMA-2013, Cambridge UK
Corpus-based lexicon
• Distant-Supervision (Read 2005, Go et al 2009)
– Automated approach for labelling– Based on appearance of emoticons (, )
7
I’m happy with chocolate on vday
I’m at work today
BCS-SGAI-SMA-2013, Cambridge UK
Scores Generation
• Proportion-based
– Scores are compatible with SentiWordNet
8
AllDocs
cdc ttf
ttftds
)(
)()(
Term + score - score
ugh 0.077 0.923
sucks 0.132 0.868
luv 0.958 0.042
xoxo 0.792 0.208
… … …
BCS-SGAI-SMA-2013, Cambridge UK
Integration with SentiWordNet
• General Scores are extracted from SentiWordNet
9
)|,(|
1
)(|),(|
1)(
PoStsenses
icic tSenseScore
PoStsensestgs
BCS-SGAI-SMA-2013, Cambridge UK
Evaluation
• 20,000 Dist-Sup tweets used to:– Generate domain lexicon– Train Machine Learning classifiers
• For comparison
• 359 hand-labelled tweets used for evaluation
10BCS-SGAI-SMA-2013, Cambridge UK
Evaluation Cont’d
• Individual lexicons Vs Combined– General < Domain < Combined
• Difference not significant btw Domain and Combined
• Machine learning Vs Combined– SVM < NB < LogReg < Combined
• Difference not significant btw LogReg and Combined
BCS-SGAI-SMA-2013, Cambridge UK 11
Evaluation Cont’d
• Varying data sizes– Performance improves with increasing size for all
except SVM
12BCS-SGAI-SMA-2013, Cambridge UK
Conclusions
• Sentiment lexicon is generated using distant-supervision
• Sentiment classification improves with combination of domain-dependent and domain-independent lexicons
• Accuracy of the combination is better than machine learning
13BCS-SGAI-SMA-2013, Cambridge UK
Future work
• Lexicon refinement• Improve aggregation strategy• Extend approach to other Social media
platforms• Extend Dist-sup to neutral labelling• Experiment with ‘big data’
14BCS-SGAI-SMA-2013, Cambridge UK
Thank you for Listening!
BCS-SGAI-SMA-2013, Cambridge UK 15