Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Techniques for Dimensionality Reduction

Topic Models/LDA

Latent Dirichlet Allocation

Outline• Recap: PCA

• Directed models for text– Naïve Bayes– unsupervised Naïve Bayes (mixtures of

multinomials)– probabilistic latent semantic indexing pLSI

• and connection to PCA, LSI, MF

– latent Dirichlet allocation (LDA)• inference with Gibbs sampling

• LDA-like models for graphs & etc

Quick Recap….

PCA as MF10,000 pixels

1000 * 10,000,00

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

… vnm

V[i,j] = pixel j in image i

2 prototypes

10,000 pixels

1000 * 10,000,00

1.4 0.5x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

… vnmV[i,j] = pixel j in image i

2 prototypes

1.4*PC1 + 0.5*PC2 =

PCA for movie recommendation…m movies

m movies

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

V[i,j] = user i’s rating of movie j

….. vs k-meanscluster means

0 11 0.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

original data setindicators

for r clusters

XQuestion: what other generative models look like MF?

Question: what other generative models look like MF?

LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact notation

W1 W2 W3 ….. WN

Supervised Multinomial Naïve Bayes• Naïve Bayes Model: Compact representation

W1 W2 W3 ….. WN

Supervised Naïve Bayes• Multinomial Naïve Bayes

W1 W2 W3 ….. WN

• For each class 1..K

• Construct a multinomial i

•For each document d = 1,, M

• Generate Cd ~ Mult( . | )

• For each position n = 1,, Nd

• Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd)

Unsupervised Naïve Bayes• Mixture model: unsupervised naïve Bayes

• Joint probability of words and classes:

• But classes are not visible:

Solved using EM

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• Every document is a mixture of topics

• For i=1…K:• Let i be a multinomial over words

• For each document d:• Let d be a distribution over {1,..,K}• For each word position in d:

• Pick a topic z from d • Pick a word w from i

Kcompare to unsupervised NB/mixture of multinomials K

Latent variable for each word not for each document

Latent variable for each word not for each documentCan still learn with EMCan still learn with EM

LATENT SEMANTIC INDEXINGVS SINGULAR VALUE DECOMPOSITIONVS PRINCIPLE COMPONENTS ANALYSIS

PCA as MF10,000 pixels

1000 * 10,000,00

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 … … …… …

… vnm

V[i,j] = pixel j in image i

2 prototypes

Remember CX – covariance of features of the original data matrix?We can look also at CZ and see what the covariance is…Turns out CZ[i.j]=0 for i!=j and CZ [i,j]=λi

Background: PCA vs LSI• Some connections

– PCA is closely related to singular value decomposition (SVD): see handout• we factored data matrix X into Z*V, V is eigenvectors of CX• one more step: factor Z into U*C where Σ is diagonal and Σ(i,i)=sqrt(λi) and variables in U have unit variance• then X = U*Σ*V and this is called SVD

• When X is a term-document matrix this is called latent semantic indexing (LSI)

LSIm words

m words

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

V[i,j] = freq of word j in document i

http://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

Kcompare to unsupervised NB/mixture of multinomials K

Latent variable for each word not for each document

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)• For i=1…K:

• Let i be a multinomial over words• For each document d:

• Let d be a distribution over {1,..,K}• For each word position in d:

m words

a1 a2 .. … amb1 b2 … … bm1

x1 y1x2 y2.. ..

… …xn ynn d

π1π2..

…πn

LSI or pLSIm words

m words

x1 y1x2 y2.. ..

… …xn yn

a1 a2 .. … amb1 b2 … … bm v11 …

… …vij

… vnm

V[i,j] = freq of word j in document i

PLSI and MF and Nonnegative MF

PLSI and MF and Nonnegative MF~ We considered minimized L2 reconstruction error.

Another choice: constrain C, H to be non-negative and minimize JNMF:

Ding et al: the objective functions are for NMF and PLSI are equivalent

• Turns out to be hard to fit:• Lots of parameters!• Also: only applies to the training data

LATENT DIRICHLET ANALYSIS(LDA)

The LDA Topic Model

LDA• Motivation

Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)

•For each document d = 1,,M

• Generate d ~ D1(…)

• For each word n = 1,, Nd

•generate wn ~ D2( . | θdn)

Now pick your favorite distributions for D1, D2

• Latent Dirichlet Allocation

• For each document d = 1,,M

• Generate d ~ Dir(. | )

• For each position n = 1,, Nd

• generate zn ~ Mult( . | d)

• generate wn ~ Mult( . | zn)

“Mixed membership”

jjk nn

nnnnjz

...),...,,|Pr(

• LDA’s view of a document

• LDA topics

LDA used as dimension reduction for classification50 topics vs all words, SVM

LDA used for CFUsers rating 100+ movies only

LDA• Latent Dirichlet Allocation

–Parameter learning:• Variational EM

– Numerical approximation using lower-bounds– Results in biased solutions– Convergence has numerical guarantees

• Gibbs Sampling – Stochastic simulation– unbiased solutions– Stochastic convergence

LDA• Gibbs sampling – works for any directed model!

– Applicable when joint distribution is hard to evaluate but conditional distribution is known– Sequence of samples comprises a Markov Chain– Stationary distribution of the chain is the joint distribution

Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Why does Gibbs sampling work?• What’s the fixed point?

–Stationary distribution of the chain is the joint distribution• When will it converge (in the limit)?

– If graph defined by the chain is connected• How long will it take to converge?

–Depends on second eigenvector of that graph

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

LDA• Latent Dirichlet Allocation

• Randomly initialize each zm,n

• Repeat for t=1,….

• For each doc m, word n

• Find Pr(zmn=k|other z’s)

• Sample zmn according to that distr.

“Mixed membership”

EVEN MORE DETAIL ON LDA…

Way way more detail

More detail

What gets learned…..

In A Math-ier NotationN[*,k]

N[d,k]

M[w,k]N[*,*]=V

for each document d and word position j in d•z[d,j] = k, a random topic•N[d,k]++•W[w,k]++ where w = id of j-th word in d

for each document d and word position j in d•z[d,j] = k, a new random topic•update N, W to reflect the new assignment of z:

• N[d,k]++; N[d,k’] - - where k’ is old z[d,j]• W[w,k]++; W[w,k’] - - where w is w[d,j]

for each pass t=1,2,….

Some comments on LDA• Very widely used model• Also a component of many other models

Techniques for Modeling GraphsLatent Dirichlet Allocation++

Review - LDA• Motivation

Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words)

•For each document d = 1,,M

• Generate d ~ D1(…)

• For each word n = 1,, Nd

•generate wn ~ D2( ¢ | θdn)

Docs and words are exchangeable.

A Matrix Can Also Represent a Graph

A B C D E F G H I J

A 1 1 1

F 1 1 1

H 1 1 1

I 1 1 1

Karate club

Schoolkids

College football

citations

Stochastic Block Models

• “Stochastic block model”, aka “Block-stochastic matrix”:– Draw ni nodes in block i

– With probability pij, connect pairs (u,v) where u is in block i, v is in block j

– Special, simple case: pii=qi, and pij=s for all i≠j

Stochastic Block models: assume 1) nodes w/in a block z and 2) edges between blocks zp,zq are exchangeable

Gibbs sampling:

• Randomly initialize zp for each node p.

• For t = 1…

• For each node p

• Compute zp given other z’s

• Sample zp

See: Snijders & Nowicki, 1997, Estimation and Prediction for Stochastic Blockmodels for Groups with Latent Graph Structure

Mixed Membership Stochastic Block modelsp q

zp. z.q

Airoldi et al, JMLR 2008

Another mixed membership block model

Another mixed membership block modelz=(zi,zj) is a pair of block ids

nz = #pairs z

qz1,i = #links to i from block z1

qz1,. = #outlinks in block z1

δ = indicator for diagonal

M = #nodes

Another mixed membership block model

Sparse network model vs spectral clustering

Hybrid modelsBalasubramanyan & Cohen, BioNLP 2012

Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

Documents

Transcript of Techniques for Dimensionality Reduction Topic Models/LDA Latent Dirichlet Allocation.

A LDA-Based Approach for Semi-Supervised Document Clusteringijmlc.org/papers/430-M0015.pdf · semi-supervised document clustering based on Latent Dirichlet Allocation (LDA), namely

Inference Methods for Latent Dirichlet Allocationtimes.cs.uiuc.edu/course/510f17/notes/lda-survey.pdf · Inference Methods for Latent Dirichlet Allocation Chase Geigle University

Topic Modeling of Multimodal Data: An Autoregressive Approachopenaccess.thecvf.com/.../Zheng_Topic_Modeling_of... · Topic modeling based on latent Dirichlet allocation (LDA) has

Decoupling Sparsity and Smoothness in the ... - Data Mining · topic models using neural networks. Topic models based on latent Dirichlet allocation (LDA) successfully use the Dirichlet

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Partial Membership Latent Dirichlet Allocation for Image ... · In this paper, we present a Partial Membership Latent Dirichlet Allocation (PM-LDA) to address image segmentation in

Latent Dirichlet Allocation · Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures

Latent Dirichlet Allocation (LDA)1 - Hediberthedibert.org/wp-content/uploads/2018/07/lda.pdf · Latent Dirichlet allocation LDA is a generative probabilistic model of a corpus. Documents

Latent Dirichlet Allocation - University of Minnesotabaner029/Teaching/Fall07/talks/Han… · Latent Dirichlet Allocation • LDA doesn’t model documents d explicitly. • LDA doesn’t

1 An Introduction to Latent Dirichlet Allocation (LDA)

PCA and LDA - EIEmwmak/EIE6207/PCA-LDA-beamer.pdf · Man-Wai MAK (EIE) PCA and LDA October 24, 2019 3 / 29. Dimension Reduction Given a feature vector x 2RD, dimensionality reduction

StochasticCollapsedVariationalBayesianInference ... · gorithms for topic models. Recent advances in stochastic variational inference algorithms for latent Dirichlet allocation (LDA)

Generalized Correspondence-LDA Models (GC-LDA) for ... · The GC-LDA and Correspondence-LDA models are extensions of Latent Dirichlet Allocation (LDA) [3]. Several Bayesian methods

Latent Dirichlet Allocation - BYU Data Mining Labdml.cs.byu.edu/~cgc/docs/atdm/Readings/LDA-Paper.pdf · Latent Dirichlet allocation (LDA) is a generative probabilistic model of a

Latent dirichlet markov allocation for sentiment analysisusir.salford.ac.uk/29460/1/Latent_Dirichlet_Markov_Allocation.pdf · Keywords: topic model, latent Dirichlet allocation (LDA),

The Security of Latent Dirichlet Allocationpages.cs.wisc.edu/~jerryzhu/pub/aistatsAttackLDA.pdf · The Security of Latent Dirichlet Allocation 2 The KKT Conditions for LDA Variational

Automated Construction and Analysis of Political Networks ...marias/papers/sogood16paper.pdfoverview of Latent Dirichlet Allocation (LDA) or alternatively Latent Semantic Indexing

Parallel Latent Dirichlet Allocation on GPUs · Latent Dirichlet Allocation (LDA) is a powerful technique for topic modeling originally developed by Blei et al. [1]. Given a collection

Notes on Latent Dirichlet Allocation (LDA) for Beginners

CS 6347 Lecture 18 Topic Models and LDA · 2017-04-07 · Latent Dirichlet Allocation (LDA) • Inference in this model is NP-hard • Given the 𝐷documents, want to find the parameters