Images from wikimedia commons - cfilt.iitb.ac.in · Images from wikimedia commons . Topic Models...
Transcript of Images from wikimedia commons - cfilt.iitb.ac.in · Images from wikimedia commons . Topic Models...
Topic Models for Sentiment: Basics, Implementation and Experiments
Aditya M Joshi
IITB-Monash Research Academy
30th January, 2014
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
• Experimentation
Revisiting classifiers
What did Prof. Pushpak Bhattacharyya talk about in “Topics in NLP” lecture today?
Lecture transcript Classifier
NLP, Databases, Compilers
Topic models can do much more than this: with unlabeled corpus
SA, MT, Wordnet
“Topic-document distribution”
Lectures from
2008 to 2013
Topic Modeler
strong AI parser
alignment ACL
thwarting
co-reference resolution
demo RPC
MTP
Krishna
Raag Mahabharat
NLP Academic Cultural
Swar-sandhya
*Hypothetical example
NLP = 0.7 Academic = 0.2 Cultural = 0.1
Proportion of each topic in a document: “Multiple membership”
* And in context of sentiment analysis?
“Word-topic distribution”
“Aaditya, you are not making sense.”
“Let’s study word sense disambiguation”
Lectures from
2008 to 2013
Topic Modeler
sense logic
explanation
confused
sense
wordnet polysemy
iterative word
*Hypothetical example
“Relevance of each word to a topic” Words across “topics” actually indicate different senses in which a word occurs
Definition
• Topic models are a suite of algorithms that discover thematic structures in a data collection. (Blei (2011) )
• What is a thematic structure? A topic: a collection of words
• Used for a wide variety of tasks such as: author recognition, aspect extraction, sentiment modelling
Black box
Topic Modeler
Unlabeled corpus
Document-topic distribution
Overall word-topic distribution
FAQs:
Can you predict a test document directly? Not directly.
Is there only one way to construct a topic model? No. By intelligent structure of the model, you can derive useful information.
LDA Model
• Latent Dirichlet Allocation (LDA) model is a basic probabilistic topic model
• This presentation focuses on LDA and its adaptations with sentiment as the goal.
Plate Notation (2/2)
w Nd
D
z
w Nd
D
z
w Nd
D
z
Ns
Word-level topics Document-level topics Sentence-level topics
Growing LDA further
w Nd
D
z
θ θ(Z): NLP = 0.7,
culture = 0.2, motivation = 0.1
Z
ϕ
ϕ (Z,word):
(NLP, sense) = 0.7,
(culture, sense)= 0.1, (motivation, sense) = 0.2
Let us now focus on these two multinomial distributions
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
• Experimentation
Multinomial distribution
• Training a LDA implies learning the parameters of the multinomial distribution
• We now focus on a multinomial distribution and the way it is modelled in case of LDA.
θ
Parameter estimation (Heinrich (2004))
P(θ|x) = P(x | θ) P(θ)
P(x)
Prior
Posterior
Likelihood
P(x) = ∫ P (x| θ). P(θ) d θ
Estimating the posterior: P(θ|x) Why? Goal: To estimate θd and ϕwz as accurately, given the data: documents The two are categorical distributions
P(θ|x) ∝ P(x | θ) P(θ)
Marginal likelihood
Binomial distribution & MLE
• Toss of a biased coin
P(X=1) = q P(X=0) = (1-q)
X = {x1, x2,... ,xN} MLE = argmax P(X|q) MLE = argmax P(x1|q). P(x2|q)... P(XN|q) = argmax qx1(1-q)(1-x
1). .... qxN(1-q)(1-x
N)
= argmax q(x1+x2..XN)(1-q)(N-(X1+X2..XN)
= argmax qm(1-q)n-m
= m / n
P(x1|q) = qx1(1-q)(1-x1
)
argmax (mlog q + (n-m)log(1-q)) Equating derivative to zero, m/q = (n-m)/(1-q) q = m/n
m, n are “sufficient statistics” of a binominal distribution
MAP of Binomial distribution (1/2)
MAP = argmax P(q|X)
= argmax P(X|q) P(q)
= argmax qm(1-q)n-m P(q)
Problem! P(q) can be any distribution, strictly speaking. Computationally difficult!
Assume: P(q) is a beta distribution
P(q) = qα-1 (1-q) β-1 / Beta(β-α)
MAP of Binomial distribution (2/2)
MAP = argmax qm(1-q)n-m P(q)
∝ argmax qm(1-q)n-m qα-1 (1-q) β-1
∝ argmax q(m+α-1)(1-q)(n-m+β-1)
∝ (m+α-1) / (n+α+β-2)
argmax (m+α-1) log q + (n-m+β-1) log(1-q)) Equating derivative to zero, (m+α-1) /q = (n-m+β-1) /(1-q) (m+α-1)-q(m+α-1) = q (n-m+β-1) (m+α-1) = q(m+α-1+n-m+ β-1) (m+α-1) = q(α+n+ β-2) q = (m+α-1) /(α+n+ β-2)
Beta Distribution
Binomial Distribution
Conjugate prior
Conjugate prior
• A distribution is a conjugate prior to a posterior distribution if both of them have the same form
• “Algebraic convenience”
Beta distribution is a conjugate prior of binomial distribution
What is it for categorical distribution?
Categorical distribution
• Roll of a dice P(X=1) = q1
P(X=2) = q2
... P(X=6) = q6
P(xi=k|q) = qk
X = {X1,...XN} ~ Cat(q)
P(X|q) = argmax P(x1|q). P(x2|q)... P(XN|q)
= argmax π qjcj
MAP = argmax P(X|q) P(q)
= argmax π qjcj P(q)
P(q) ∝ π qj αj-1
MAP ∝ argmax π qjcj qj
αj-1
∝ argmax π qj αj+cj-1
Dirichlet distribution Dirichlet Distribution
Categorical Distribution
Conjugate prior
Binomial & Categorical distribution
θ z Categorical
α
θ ~ Dir (α) z ~ Categorical (θ)
q x Binomial
α q ~ Beta(α, β ) x ~ Binomial (q)
β
P(z| θ) = θz
Hyper-parameters Distribution Random variable assignments
Does the name Latent Dirichlet Allocation seem justifiable now?
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
• Experimentation
Estimation of LDA model
The denominator is computationally intractable. Hence, Gibbs sampling is used.
We now describe the generative story.
P(θ, ϕ| w) = P(w| θ, ϕ)P(θ, ϕ )/P(w)
Every LDA paper has: Plate notation Generative story Gibbs sampling formulas
Z
Nd
D
Generative story
θ
z
w
α
ϕ
β
Sample ϕ ~ Dir(β)
For each document, Generate θ ~ Dir (α) For each word,
Sample z ~ Multinomial (θ) Sample w ~ ϕ(z)
Z
Nd
D
Implementing topic models
θ
z
w
α
ϕ
β
Sample ϕ ~ Dir(β)
For each document, Generate θ ~ Dir (α) For each word,
Sample z ~ Multinomial (θ) Sample w ~ ϕ(z)
Sampling from multinomial
Input: θ : P(z=0) = 0.1, P(z=1) = 0.3, P(z=2) = 0.6
Goal: Sample a z given this distribution
θ
z
Z=0 Z=1 Z=2
0 0.1 0.4 1
Z
Nd
D
Implementing topic models
θ
z
w
α
ϕ
β
Sample ϕ ~ Dir(β)
For each document, Generate θ ~ Dir (α) For each word,
Sample z ~ Multinomial (θ) Sample w ~ ϕ(z)
Gibbs sampling
Initialize all word positions to random z’s. Compute θ & ϕ accordingly. For each iteration, For each document, For each word, Generate a z based on θ Generate a w based on ϕw|z
Compute θ & ϕ
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
• Experimentation
Evaluation
• Qualitative evaluation (Understanding topic cohesion) (Mukherjee et al (2012))
• Classification accuracy based on topics uncovered
• Held-out likelihood (Likelihood of data given parameters) (Wallach et al (2009))
A naïve addition:
• Measuring sentiment cohesion: Count of positive and negative words in each topic
Outline
• Motivation and Introduction (Blei (2011))
• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))
• Estimation using LDA (Heinrich (2004))
• Evaluation of LDA (Wallach (2009))
• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))
• Experimentation
Experiments with LDA
• Goal: Understand topic models & obtain sentiment-coherent topics from a LDA model
• Implementation: – Topic model implementation using Gibbs sampling
– Hyper-parameter estimation as given in Heinrich (2009)
– “Left to right” likelihood algorithm by Wallach (2009)
Data set
• Movie review data set from Amazon by McAuley & Leskovec (2013).
– Training data set: 11000 movie reviews
– Test data set: 2000 movie reviews
• Average length of a review: ~140 words
Discovering sentiment-coherent topics
Modify basic LDA in one of the following ways:
1) Bootstrap sentiment priors with word lists
2) Modifying the structure of the topic model
Existing topic models
• Lin & He(2009) present a Joint Sentiment-Topic Model with sentiment as a latent variable.
• Jo & Ho(2011) extract senti-aspects: (sentiment, feature) pairs.
• Titov & McDonald (2008) use a sliding window model to incorporate discourse nature of reviews.
• Mukherjee & Liu (2012b) identify words belonging to six types of review comment expressions from an unlabeled corpus.
Discovering sentiment-coherent topics
Modify basic LDA in one of the following ways:
1) Bootstrap sentiment priors with word lists
2) Modifying the structure of the topic model
Discovering sentiment: Use of priors
• Induce positive words and negative words to belong to certain topics (based on Lin & He (2009) )
– For negative words, set beta(word, z=0 to z/2) = 2*beta. beta(word, z=z/2 to z) = 0.
– The corresponding beta for positive words.
Use of Priors: Results (1/2)
• “Basic”: Imposing priors on only 12 sentiment words
• Leads to greater sentiment words being identified in correct topics
Use of Priors: Results(2/2)
Qualitative evaluation: Topic 38 7.330 horror 2.392 killer 2.248 scary 2.147 house 2.072 gore Some topics are positive while
others are negative, depending on the priors. Topic 13
6.931 michael 3.929 fans 3.423 live 2.379 amazing 2.354 concert
Discovering sentiment-coherent topics
Modify basic LDA in one of the following ways:
1) Bootstrap sentiment priors with word lists
2) Modifying the structure of the topic model
Discovering sentiment: Modifying structure
• Sentiment is explicitly modelled as a latent variable (Based on joint sentiment Tying model by Lin & He (2009)
SLDA SLDA-Split
Sentiment as a Variable: Results (2/2)
• SLDA
• SLDA-Split
Topic 13, s = 0 9.551 show 9.254 humor 7.166 comedy 4.846 watch 4.680 hilarious
Topic 13, s = 2 6.964 rock 5.547 children 5.38 school 4.636 remember 3.432 learn
No equivalence between topic 13 for s = 0 and s = 2.
Topic 31, s = 0 8.006 product 6.277 received 5.244 amazon 4.119 condition 4.043 seller
Topic 31, s = 1 5.206 return 4.661 problem 4.412 disappoint 3.654 case 3.616 copy
Topic 31, s = 2 10.358 amazon 9.213 play 7.068 player 3.651 dvds 3.594 purchased
A topic essentially implies “different polarities” in the same ‘context’
For S = 3,
Failed Experiment: Incorporating emotions
Both models repeated for eight basic emotions (joy, anger, trust, surprise, sadness, fear, disgust, anticipation)
1) LDA with priors on eight basic emotions
2) Sentiment LDA models with S = 8
Failed Experiment: Results
• LDA
• SLDA-Split
Fear: boring, lame, watching, predictable Joy: happy, purchase, find, store Anger: kids, children, adults, parents Trust: final, end, planet, heroes, stay
Disgust: action, horror, scenes Anticipation: people, life, man, samurai
Failed Experiment: Analysis
• Better way to incorporate emotion priors
• The current data set consists of reviews. It is likely that users do not express diverse emotions in these reviews.
Future directions (1/2)
Immediate tasks:
Sentiment topic model to discover interests of users
Say, prove that topic model based on tweets by a political commentator reflects his/her political ideology
Future directions (2/2)
Other identified tasks:
1. Models that discover dimensions of emotion in a corpus
2. Cross-lingual topic models
3. Topic models that include eye-tracking data
4. Using sentiment annotation complexity in case of an ensemble of classifiers
The Big Question
Topic models for the 5-tuple representation
I love Tim Tams : Positive Formally, a 5-tuple (holder, target, feature, sentiment, time)
I like the taste of Tim Tams but I wish the smell was different.
I liked Tim Tams a couple of years ago. I think you will like Tim Tams.
(I, Tim tam, taste, positive, present), (I, Tim Tam, smell, negative, present)
(I, Tim tam, -, positive, “couple of years ago”)
(I > you, Tim tam, -, positive, future)
Can five latent variables be structured intelligently to extract 5-tuples of this form?
Conclusion
• LDA can be used to uncover thematic structure in the form of topics.
• LDA models could be adapted effectively to extract different information.
References (1/2)
• Balamurali, A., Joshi, A., & Bhattacharyya, P. (2011). Harnessing wordnet senses for supervised sentiment classication. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (pp. 1081{1091). Association for Computational Linguistics.
• Balamurali, A., Joshi, A., & Bhattacharyya, P. (2012). Cross-lingual sentiment analysis for indian languages using linked wordnets. In COLING (Posters), (pp. 73{82).
• Balamurali, A., Khapra, M. M., & Bhattacharyya, P. (2013). Lost in translation: viability of machine translation for cross language sentiment analysis. In Computational Linguistics and Intelligent Text Processing (pp. 38{49). Springer.
• Banea, C., Mihalcea, R., Wiebe, J., & Hassan, S. (2008). Multilingual subjectivity analysis using machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (pp. 127{135). Association for Computational Linguistics.
• Blei, D. M. (2011). Introduction to probabilistic topic models. • Blei, D. M., Ng, A. Y., Jordan, M. I., & La
erty, J. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003. • Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Neural Information
Processing Systems (NIPS). • Brody, S. & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In HLT-NAACL, (pp. 804{812). The Association for
Computational Linguistics. • Carl, M. (2012). Translog-ii: a program for recording user activity data for empirical reading and writing research. In LREC, (pp. 4108{4112). • Dragsted, B. (2010). Coordination of reading and writing processes in translation. Translation and Cognition, American Translators Association
Scholarly Monograph Series. Amsterdam/Philadelphia: Benjamins, 41{62. • Duh, K., Fujino, A., & Nagata, M. (2011). Is machine translation ripe for cross-lingual sentiment classication? In ACL (Short Papers), (pp. 429{433). • Fellbaum, C. (2010). Wordnet: An electronic lexical database. 1998. WordNet is available from http://www. cogsci. princeton. edu/wn. • Jo, Y. & Oh, A. (2011). Aspect and sentiment unication model for online review analysis. In Proceedings of the fourth ACM international conference
on Web search and data • mining, (pp. 815{824). ACM.
References (2/2) • Joshi, S., Kanojia, D., & Bhattacharyya, P. (2013). More than meets the eye: Study of human cognition in sense annotation. In Proceedings of NAACL-HLT, (pp. 733{738). • Kulis, B. (2012). Conjugate priors. • Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Cheung, D. W.-L., Song, I.-Y., Chu, W. W., Hu, X., & Lin, J. J. (Eds.), CIKM, (pp. 375{384). ACM. • Lu, B., Tan, C., Cardie, C., & Tsou, B. K. Joint bilingual sentiment classification with unlabeled parallel corpora. • McAuley, J. J. & Leskovec, J. (2013). From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd international conference
on World Wide Web, (pp. 897{908). International World Wide Web Conferences Steering Committee. • McCallum, A. (2002). MALLET: A machine learning for language toolkit. • Meng, X., Wei, F., Liu, X., Zhou, M., Xu, G., & Wang, H. (2012). Cross-lingual mixture model for sentiment classication. In Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics: Long Papers-Volume 1, (pp. 572{581). • Association for Computational Linguistics. • Mukherjee, A. & Liu, B. (2012a). Aspect extraction through semi-supervised modeling. In ACL (1), (pp. 339{348). The Association for Computer Linguistics. • Mukherjee, A. & Liu, B. (2012b). Modeling review comments. In ACL (1), (pp. 320{329). The Association for Computer Linguistics. • Mukherjee, A. & Liu, B. (2013). Discovering user interactions in ideological discussions. In ACL (1), (pp. 671{681). The Association for Computer Linguistics. • Mukherjee, S. & Bhattacharyya, P. (2012). Wikisent: Weakly supervised sentiment analysis through extractive summarization with wikipedia. In Machine Learning and Knowledge
Discovery in Databases (pp. 774{793). Springer. • Nallapati, R., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Li, Y., 0001, B. L., & Sarawagi, S. (Eds.), KDD, (pp. 542{550). ACM. • Pang, B. & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on
Association for Computational Linguistics, (pp. 271). Association for Computational Linguistics. • Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1-2), 1{135. • Prettenhofer, P. & Stein, B. (2010). Cross-language text classication using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, (pp. 1118{1127). Association for Computational Linguistics. • Rosen-Zvi, M., Griths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In 20th Conference on Uncertainty in Articial Intelligence, volume 21, Ban
Park Lodge, Ban , Canada.
• Scott, G. G., O'Donnell, P. J., & Sereno, S. C. (2012). Emotion words a ect eye xations during reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38 (3), 783.
• Searle, J. R. (1992). The rediscovery of the mind. the MIT Press. • Titov, I. & McDonald, R. T. (2008a). A joint model of text and aspect ratings for sentiment summarization. In McKeown, K., Moore, J. D., Teufel, S., Allan, J., & Furui, S. (Eds.), ACL, (pp.
308{316). The Association for Computer Linguistics. • Titov, I. & McDonald, R. T. (2008b). Modeling online reviews with multi-grain topic models. CoRR, abs/0801.1063. • Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking lda: Why priors matter. In NIPS, volume 22, (pp. 1973{1981). • Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning,
(pp. 1105{1112). ACM. • Wang, X., McCallum, A., & Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE • International Conference on Data Mining (ICDM), Nebraska, USA. • Yin, Y., Zhou, C., & Zhu, J. (2010). A pipe route design methodology by imitating human imaginal thinking. CIRP Annals-Manufacturing Technology, 59 (1), 167{170.