Generative Topic Models for Community Analysis

42
Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep- 18.ppt

description

Generative Topic Models for Community Analysis. Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt. Objectives. Cultural literacy for ML: Q: What are “topic models”? A 1 : popular indoor sport for machine learning researchers - PowerPoint PPT Presentation

Transcript of Generative Topic Models for Community Analysis

Page 1: Generative Topic Models for Community Analysis

Generative Topic Models for Community Analysis

Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt

Page 2: Generative Topic Models for Community Analysis

2 / 57

Objectives

• Cultural literacy for ML: – Q: What are “topic models”?

– A1: popular indoor sport for machine learning researchers

– A2: a particular way of applying unsupervised learning of Bayes nets to text

• Quick historical survey of some sample papers in the area

Page 3: Generative Topic Models for Community Analysis

3 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

Page 4: Generative Topic Models for Community Analysis

4 / 57

Introduction to Topic Models

• Multinomial Naïve Bayes

C

W1 W2 W3 ….. WN

M

• For each document d = 1,, M

• Generate Cd ~ Mult( ¢ | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(¢|,Cd)

Page 5: Generative Topic Models for Community Analysis

5 / 57

Introduction to Topic Models• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

Page 6: Generative Topic Models for Community Analysis

6 / 57

Introduction to Topic Models

• Mixture model: unsupervised naïve Bayes model

C

W

NM

• Joint probability of words and classes:

• But classes are not visible:Z

Page 7: Generative Topic Models for Community Analysis

7 / 57

Introduction to Topic Models

Page 8: Generative Topic Models for Community Analysis

8 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model

d

z

w

M

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

d

N

Topic distribution

Page 9: Generative Topic Models for Community Analysis

9 / 57

Introduction to Topic Models

• Probabilistic Latent Semantic Analysis Model– Learning using EM– Not a complete generative model

• Has a distribution over the training set of documents: no new document can be generated!

– Nevertheless, more realistic than mixture model

• Documents can discuss multiple topics!

Page 10: Generative Topic Models for Community Analysis

10 / 57

Introduction to Topic Models

• PLSA topics (TDT-1 corpus)

Page 11: Generative Topic Models for Community Analysis

11 / 57

Introduction to Topic Models

Page 12: Generative Topic Models for Community Analysis

12 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

Page 13: Generative Topic Models for Community Analysis

13 / 57

Introduction to Topic Models

• Latent Dirichlet Allocation– Overcomes the issues with PLSA

• Can generate any random document

– Parameter learning:• Variational EM

– Numerical approximation using lower-bounds

– Results in biased solutions

– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation

– unbiased solutions

– Stochastic convergence

Page 14: Generative Topic Models for Community Analysis

14 / 57

Introduction to Topic Models

• Variational EM for LDA– Approximate the posterior by a simpler

distribution

• A convex function in each parameter!

Page 15: Generative Topic Models for Community Analysis

15 / 57

Introduction to Topic Models

• Gibbs sampling– Applicable when joint distribution is hard to evaluate but

conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

Page 16: Generative Topic Models for Community Analysis

16 / 57

Introduction to Topic Models

• LDA topics

Page 17: Generative Topic Models for Community Analysis

17 / 57

Introduction to Topic Models

• LDA’s view of a document

Page 18: Generative Topic Models for Community Analysis

18 / 57

Introduction to Topic Models

• Perplexity comparison of various models

Unigram

Mixture model

PLSA

LDALower is better

Page 19: Generative Topic Models for Community Analysis

19 / 57

Outline• Part I: Introduction to Topic Models

– Naive Bayes model– Mixture Models

• Expectation Maximization

– PLSA– LDA

• Variational EM• Gibbs Sampling

• Part II: Topic Models for Community Analysis– Citation modeling with PLSA– Citation Modeling with LDA– Author Topic Model– Author Topic Recipient Model– Modeling influence of Citations– Mixed membership Stochastic Block Model

Page 20: Generative Topic Models for Community Analysis

20 / 57

Hyperlink modeling using PLSA

Page 21: Generative Topic Models for Community Analysis

21 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

• Select document d ~ Mult()

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

• For each citation j = 1,, Ld

• generate zj ~ Mult( ¢ | d)

• generate cj ~ Mult( ¢ | zj)L

Page 22: Generative Topic Models for Community Analysis

22 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

d

z

w

M

d

N

z

c

L

PLSA likelihood:

New likelihood:

Learning using EM

Page 23: Generative Topic Models for Community Analysis

23 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

Heuristic:

0 · · 1 determines the relative importance of content and hyperlinks

(1-)

Page 24: Generative Topic Models for Community Analysis

24 / 57

Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]

• Classification performance

Hyperlink content Hyperlink content

Page 25: Generative Topic Models for Community Analysis

25 / 57

Hyperlink modeling using LDA

Page 26: Generative Topic Models for Community Analysis

26 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(¢ | )

• For each position n = 1,, Nd

• generate zn ~ Mult( ¢ | d)

• generate wn ~ Mult( ¢ | zn)

•For each citation j = 1,, Ld

• generate zj ~ Mult( . | d)

• generate cj ~ Mult( . | zj)

z

c

L

Learning using variational EM

Page 27: Generative Topic Models for Community Analysis

27 / 57

Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]

Page 28: Generative Topic Models for Community Analysis

28 / 57

Author-Topic Model for Scientific Literature

Page 29: Generative Topic Models for Community Analysis

29 / 57

Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

z

w

M

N

• For each author a = 1,,A

• Generate a ~ Dir(¢ | )

• For each topic k = 1,,K

• Generate k ~ Dir( ¢ | )

•For each document d = 1,,M

• For each position n = 1,, Nd

•Generate author x ~ Unif(¢ | ad)

• generate zn ~ Mult( ¢ | a)

• generate wn ~ Mult( ¢ | zn)

x

a

A

P

K

Page 30: Generative Topic Models for Community Analysis

30 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

Learning: Gibbs sampling

z

w

M

N

x

a

A

P

K

Page 31: Generative Topic Models for Community Analysis

31 / 57

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]

• Topic-Author visualization

Page 32: Generative Topic Models for Community Analysis

32 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Page 33: Generative Topic Models for Community Analysis

33 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

Gibbs sampling

Page 34: Generative Topic Models for Community Analysis

34 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Datasets– Enron email data

• 23,488 messages between 147 users

– McCallum’s personal email• 23,488(?) messages with 128 authors

Page 35: Generative Topic Models for Community Analysis

35 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: Enron set

Page 36: Generative Topic Models for Community Analysis

36 / 57

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]

• Topic Visualization: McCallum’s data

Page 37: Generative Topic Models for Community Analysis

37 / 57

Modeling Citation Influences

Page 38: Generative Topic Models for Community Analysis

38 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence model

Page 39: Generative Topic Models for Community Analysis

39 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Citation influence graph for LDA paper

Page 40: Generative Topic Models for Community Analysis

40 / 57

Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]

• Words in LDA paper assigned to citations

Page 41: Generative Topic Models for Community Analysis

41 / 57

Link-PLSA-LDA: Topic Influence in Blogs (ICWSM 2008)

Ramesh Nallapati,

Amr Ahmed

Eric Xing

Page 42: Generative Topic Models for Community Analysis

42 / 57