Generative Topic Models for Community Analysis
description
Transcript of Generative Topic Models for Community Analysis
Generative Topic Models for Community Analysis
Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt
2 / 57
Objectives
• Cultural literacy for ML: – Q: What are “topic models”?
– A1: popular indoor sport for machine learning researchers
– A2: a particular way of applying unsupervised learning of Bayes nets to text
• Quick historical survey of some sample papers in the area
3 / 57
Outline• Part I: Introduction to Topic Models
– Naive Bayes model– Mixture Models
• Expectation Maximization
– PLSA– LDA
• Variational EM• Gibbs Sampling
• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model
4 / 57
Introduction to Topic Models
• Multinomial Naïve Bayes
C
W1 W2 W3 ….. WN
M
• For each document d = 1,, M
• Generate Cd ~ Mult( ¢ | )
• For each position n = 1,, Nd
• Generate wn ~ Mult(¢|,Cd)
5 / 57
Introduction to Topic Models• Naïve Bayes Model: Compact representation
C
W1 W2 W3 ….. WN
C
W
N
M
M
6 / 57
Introduction to Topic Models
• Mixture model: unsupervised naïve Bayes model
C
W
NM
• Joint probability of words and classes:
• But classes are not visible:Z
7 / 57
Introduction to Topic Models
8 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model
d
z
w
M
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
d
N
Topic distribution
9 / 57
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model
• Has a distribution over the training set of documents: no new document can be generated!
– Nevertheless, more realistic than mixture model
• Documents can discuss multiple topics!
10 / 57
Introduction to Topic Models
• PLSA topics (TDT-1 corpus)
11 / 57
Introduction to Topic Models
12 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation
z
w
M
N
• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
13 / 57
Introduction to Topic Models
• Latent Dirichlet Allocation– Overcomes the issues with PLSA
• Can generate any random document
– Parameter learning:• Variational EM
– Numerical approximation using lower-bounds
– Results in biased solutions
– Convergence has numerical guarantees
• Gibbs Sampling – Stochastic simulation
– unbiased solutions
– Stochastic convergence
14 / 57
Introduction to Topic Models
• Variational EM for LDA– Approximate the posterior by a simpler
distribution
• A convex function in each parameter!
15 / 57
Introduction to Topic Models
• Gibbs sampling– Applicable when joint distribution is hard to evaluate but
conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution
16 / 57
Introduction to Topic Models
• LDA topics
17 / 57
Introduction to Topic Models
• LDA’s view of a document
18 / 57
Introduction to Topic Models
• Perplexity comparison of various models
Unigram
Mixture model
PLSA
LDALower is better
19 / 57
Outline• Part I: Introduction to Topic Models
– Naive Bayes model– Mixture Models
• Expectation Maximization
– PLSA– LDA
• Variational EM• Gibbs Sampling
• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model
20 / 57
Hyperlink modeling using PLSA
21 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
• For each citation j = 1,, Ld
• generate zj ~ Mult( ¢ | d)
• generate cj ~ Mult( ¢ | zj)L
22 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
L
PLSA likelihood:
New likelihood:
Learning using EM
23 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
Heuristic:
0 · · 1 determines the relative importance of content and hyperlinks
(1-)
24 / 57
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
• Classification performance
Hyperlink content Hyperlink content
25 / 57
Hyperlink modeling using LDA
26 / 57
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
z
w
M
N
• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
•For each citation j = 1,, Ld
• generate zj ~ Mult( . | d)
• generate cj ~ Mult( . | zj)
z
c
L
Learning using variational EM
27 / 57
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
28 / 57
Author-Topic Model for Scientific Literature
29 / 57
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
z
w
M
N
• For each author a = 1,,A
• Generate a ~ Dir(¢ | )
• For each topic k = 1,,K
• Generate k ~ Dir( ¢ | )
•For each document d = 1,,M
• For each position n = 1,, Nd
•Generate author x ~ Unif(¢ | ad)
• generate zn ~ Mult( ¢ | a)
• generate wn ~ Mult( ¢ | zn)
x
a
A
P
K
30 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
Learning: Gibbs sampling
z
w
M
N
x
a
A
P
K
31 / 57
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Topic-Author visualization
32 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
33 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Gibbs sampling
34 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Datasets– Enron email data
• 23,488 messages between 147 users
– McCallum’s personal email• 23,488(?) messages with 128 authors
35 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: Enron set
36 / 57
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
• Topic Visualization: McCallum’s data
37 / 57
Modeling Citation Influences
38 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence model
39 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence graph for LDA paper
40 / 57
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Words in LDA paper assigned to citations
41 / 57
Link-PLSA-LDA: Topic Influence in Blogs (ICWSM 2008)
Ramesh Nallapati,
Amr Ahmed
Eric Xing
42 / 57