Domain-based Lexicon Enhancement for Sentiment Analysis

Domain-based Lexicon Enhancement for Sentiment Analysis

A. Muhammad, N. Wiratunga, R. Lothian, R. Glassey

IDEAS Research Institute,Robert Gordon University, Aberdeen

Introduction

• Sentiment Classification

• Sentiment Analysis– A wider task, involves identification of

• Object/Aspects• Opinion holder• Time

Text Sentiment Classification

2BCS-SGAI-SMA-2013, Cambridge UK

Sentiment Classification

• Machine Learning

The movie is good : +The movie is horrible : -I don’t like the movie : -I love the movie : +…

Classifiere.g. NB, SVMs

Model

The movie is nice : ?


Sentiment Classif… Cont’d

• Lexicon-Based

4

Contextual analysis/Aggregation

The movie is nice : ?

BCS-SGAI-SMA-2013, Cambridge UK

Lexicon Generation

5

Manual Corpus Dictionary

•Could be too narrow

•Could be too General

•Ugh!! this movie sucks!•This movie is fantastic


Sentiment Lexicons

• Dictionary-based: SentiWordNet (Baccianella et. al, 2010)

• Corpus-based– Generated from target domain– Existing approaches rely on well-formed

spelling/grammar

6

CorpusSeed

horrible happy

affordable rubbish

enjoyable mad …

goodbadterriblenice

but

and

(Hatzivassiloglou and Mckeown, 1997)

horrible happy

affordable rubbish

enjoyable mad …

Excellent

Poorcoocurrence

Turney, 2002


Corpus-based lexicon

• Distant-Supervision (Read 2005, Go et al 2009)

– Automated approach for labelling– Based on appearance of emoticons (, )

7

I’m happy with chocolate on vday

I’m at work today


Scores Generation

• Proportion-based

– Scores are compatible with SentiWordNet

8

AllDocs

cdc ttf

ttftds

)(

)()(

Term + score - score

ugh 0.077 0.923

sucks 0.132 0.868

luv 0.958 0.042

xoxo 0.792 0.208

… … …


Integration with SentiWordNet

• General Scores are extracted from SentiWordNet

9

)|,(|

1

)(|),(|

1)(

PoStsenses

icic tSenseScore

PoStsensestgs


Evaluation

• 20,000 Dist-Sup tweets used to:– Generate domain lexicon– Train Machine Learning classifiers

• For comparison

• 359 hand-labelled tweets used for evaluation


Evaluation Cont’d

• Individual lexicons Vs Combined– General < Domain < Combined

• Difference not significant btw Domain and Combined

• Machine learning Vs Combined– SVM < NB < LogReg < Combined

• Difference not significant btw LogReg and Combined

BCS-SGAI-SMA-2013, Cambridge UK 11

Evaluation Cont’d

• Varying data sizes– Performance improves with increasing size for all

except SVM


Conclusions

• Sentiment lexicon is generated using distant-supervision

• Sentiment classification improves with combination of domain-dependent and domain-independent lexicons

• Accuracy of the combination is better than machine learning


Future work

• Lexicon refinement• Improve aggregation strategy• Extend approach to other Social media

platforms• Extend Dist-sup to neutral labelling• Experiment with ‘big data’


Thank you for Listening!

BCS-SGAI-SMA-2013, Cambridge UK 15

Domain-based Lexicon Enhancement for Sentiment Analysis

Documents

Transcript of Domain-based Lexicon Enhancement for Sentiment Analysis