Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

71
Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation

Transcript of Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Page 1: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Techniques for Dimensionality Reduction

Topic Models/LDA

Latent Dirichlet Allocation

Page 2: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Outline• Recap: PCA

• Directed models for text– Naïve Bayes– unsupervised Naïve Bayes (mixtures of

multinomials)– probabilistic latent semantic indexing pLSI

• and connection to PCA, LSI, MF

– latent Dirichlet allocation (LDA)• inference with Gibbs sampling

• LDA-like models for graphs & etc

Page 3: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Quick Recap….

Page 4: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

PCA as MF10,000 pixels

100

0 im

ag

es

1000 * 10,000,00

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

vij

… vnm

~

V[i,j] = pixel j in image i

2 prototypes

PC1

PC2

Page 5: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

10,000 pixels

100

0 im

ag

es

1000 * 10,000,00

1.4 0.5x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

vij

… vnmV[i,j] = pixel j in image i

2 prototypes

1.4*PC1 + 0.5*PC2 =

PC1

PC2

Page 6: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

PCA for movie recommendation…m movies

n u

sers

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = user i’s rating of movie j

VBob

Page 7: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

….. vs k-meanscluster means

n e

xam

ple

s

0 11 0.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

original data setindicators

for r clusters

Z

M

XQuestion: what other generative models look like MF?

Question: what other generative models look like MF?

Page 8: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

Page 9: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact notation

C

W1 W2 W3 ….. WN

C

W

N

M

M

=

Page 10: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation

C

W1 W2 W3 ….. WN

C

W

N

M

M

=

K

Page 11: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Supervised Naïve Bayes• Multinomial Naïve Bayes

C

W1 W2 W3 ….. WN

M

• For each class 1..K

• Construct a multinomial i

•For each document d = 1,, M

• Generate Cd ~ Mult( . | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd)

K

Page 12: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Unsupervised Naïve Bayes• Mixture model: unsupervised naïve Bayes

C

W

NM

• Joint probability of words and classes:

• But classes are not visible:

Z

Solved using EM

K

Page 13: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• Every document is a mixture of topics

C

W

NM

Z

• For i=1…K:• Let i be a multinomial over words

• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from i

Kcompare to unsupervised NB/mixture of multinomials K

Latent variable for each word not for each document

Latent variable for each word not for each documentCan still learn with EMCan still learn with EM

Page 14: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LATENT SEMANTIC INDEXINGVS SINGULAR VALUE DECOMPOSITIONVS PRINCIPLE COMPONENTS ANALYSIS

Page 15: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

PCA as MF10,000 pixels

100

0 im

ag

es

1000 * 10,000,00

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

vij

… vnm

~

V[i,j] = pixel j in image i

2 prototypes

PC1

PC2

Remember CX – covariance of features of the original data matrix?We can look also at CZ and see what the covariance is…Turns out CZ[i.j]=0 for i!=j and CZ [i,j]=λi

Remember CX – covariance of features of the original data matrix?We can look also at CZ and see what the covariance is…Turns out CZ[i.j]=0 for i!=j and CZ [i,j]=λi

Page 16: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Background: PCA vs LSI• Some connections

– PCA is closely related to singular value decomposition (SVD): see handout• we factored data matrix X into Z*V, V is eigenvectors of CX• one more step: factor Z into U*C where Σ is diagonal and Σ(i,i)=sqrt(λi) and variables in U have unit variance• then X = U*Σ*V and this is called SVD

• When X is a term-document matrix this is called latent semantic indexing (LSI)

Page 17: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LSIm words

n d

ocu

men

ts

m words

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = freq of word j in document i

V

Page 18: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LSI

http://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

Page 19: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• Every document is a mixture of topics

C

W

NM

Z

• For i=1…K:• Let i be a multinomial over words

• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from i

Kcompare to unsupervised NB/mixture of multinomials K

Latent variable for each word not for each document

Latent variable for each word not for each document

Page 20: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• For i=1…K:

• Let i be a multinomial over words• For each document d:

• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from i

C

W

Z

K

m words

a1 a2 .. … amb1 b2 … … bm1

2

x1 y1x2 y2.. ..

… …xn ynn d

ocu

men

ts

π1π2..

…πn

Page 21: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LSI or pLSIm words

n d

ocu

men

ts

m words

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

~

V[i,j] = freq of word j in document i

V

Page 22: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

PLSI and MF and Nonnegative MF

Page 23: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

PLSI and MF and Nonnegative MF~ We considered minimized L2 reconstruction error.

Another choice: constrain C, H to be non-negative and minimize JNMF:

where

Ding et al: the objective functions are for NMF and PLSI are equivalent

Ding et al: the objective functions are for NMF and PLSI are equivalent

Page 24: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• Every document is a mixture of topics

C

W

NM

Z

• For i=1…K:• Let i be a multinomial over words

• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from i

• Turns out to be hard to fit:• Lots of parameters!• Also: only applies to the training data

K

Page 25: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LATENT DIRICHLET ANALYSIS(LDA)

Page 26: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

The LDA Topic Model

Page 27: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA• Motivation

w

M

N

Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)

•For each document d = 1,,M

• Generate d ~ D1(…)

• For each word n = 1,, Nd

•generate wn ~ D2( . | θdn)

Now pick your favorite distributions for D1, D2

Page 28: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

• Latent Dirichlet Allocation

z

w

M

N

• For each document d = 1,,M

• Generate d ~ Dir(. | )

• For each position n = 1,, Nd

• generate zn ~ Mult( . | d)

• generate wn ~ Mult( . | zn)

“Mixed membership”

kk

jjk nn

nnnnjz

...),...,,|Pr(

11,21

K

Page 29: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

• LDA’s view of a document

Page 30: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

• LDA topics

Page 31: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA used as dimension reduction for classification50 topics vs all words, SVM

Page 32: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA used for CFUsers rating 100+ movies only

Page 33: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA• Latent Dirichlet Allocation

–Parameter learning:• Variational EM

– Numerical approximation using lower-bounds– Results in biased solutions– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation– unbiased solutions– Stochastic convergence

Page 34: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA• Gibbs sampling – works for any directed model!

– Applicable when joint distribution is hard to evaluate but conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Page 35: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Why does Gibbs sampling work?• What’s the fixed point?

–Stationary distribution of the chain is the joint distribution• When will it converge (in the limit)?

– If graph defined by the chain is connected• How long will it take to converge?

–Depends on second eigenvector of that graph

Page 36: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 37: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

Page 38: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

LDA• Latent Dirichlet Allocation

z

w

M

N

• Randomly initialize each zm,n

• Repeat for t=1,….

• For each doc m, word n

• Find Pr(zmn=k|other z’s)

• Sample zmn according to that distr.

“Mixed membership”

K

Page 39: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

EVEN MORE DETAIL ON LDA…

Page 40: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Way way more detail

Page 41: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

More detail

Page 42: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 43: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 44: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

What gets learned…..

Page 45: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

In A Math-ier NotationN[*,k]

N[d,k]

M[w,k]N[*,*]=V

Page 46: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

for each document d and word position j in d•z[d,j] = k, a random topic•N[d,k]++•W[w,k]++ where w = id of j-th word in d

Page 47: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

for each document d and word position j in d•z[d,j] = k, a new random topic•update N, W to reflect the new assignment of z:

• N[d,k]++; N[d,k’] - - where k’ is old z[d,j]• W[w,k]++; W[w,k’] - - where w is w[d,j]

for each pass t=1,2,….

Page 48: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 49: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Some comments on LDA• Very widely used model• Also a component of many other models

Page 50: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Techniques for Modeling GraphsLatent Dirichlet Allocation++

Page 51: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Review - LDA• Motivation

w

M

N

Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)

•For each document d = 1,,M

• Generate d ~ D1(…)

• For each word n = 1,, Nd

•generate wn ~ D2( ¢ | θdn)

Docs and words are exchangeable.

Page 52: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

A Matrix Can Also Represent a Graph

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A 1 1 1

B 1 1

C 1

D 1 1

E 1

F 1 1 1

G 1

H 1 1 1

I 1 1 1

J 1 1

Page 53: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Karate club

Page 54: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Schoolkids

Page 55: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

College football

Page 56: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 57: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

citations

Page 58: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 59: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 60: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 61: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Stochastic Block Models

• “Stochastic block model”, aka “Block-stochastic matrix”:– Draw ni nodes in block i

– With probability pij, connect pairs (u,v) where u is in block i, v is in block j

– Special, simple case: pii=qi, and pij=s for all i≠j

Page 62: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable

zp zq

apq

N2

zp

N

p

Page 63: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable

zp zq

apq

N2

zp

N

p

Gibbs sampling:

• Randomly initialize zp for each node p.

• For t = 1…

• For each node p

• Compute zp given other z’s

• Sample zp

See: Snijders & Nowicki, 1997, Estimation and Prediction for Stochastic Blockmodels for Groups with Latent Graph Structure

Page 64: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Mixed Membership Stochastic Block modelsp q

zp. z.q

apq

N2

p

N

p

Airoldi et al, JMLR 2008

Page 65: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Another mixed membership block model

Page 66: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Another mixed membership block modelz=(zi,zj) is a pair of block ids

nz = #pairs z

qz1,i = #links to i from block z1

qz1,. = #outlinks in block z1

δ = indicator for diagonal

M = #nodes

Page 67: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Another mixed membership block model

Page 68: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Sparse network model vs spectral clustering

Page 69: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Hybrid modelsBalasubramanyan & Cohen, BioNLP 2012

Page 70: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.
Page 71: Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.