Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...

21
Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology Tsinghua University, Beijing 100084 [email protected] Janurary 13, 2014 Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012. Chapter 27.

Transcript of Latent variable models for discrete dataml.cs.tsinghua.edu.cn/~jianfei/static/lvm.pdf · Latent...

Latent variable models for discrete data

Jianfei Chen

Department of Computer Science and TechnologyTsinghua University, Beijing 100084

[email protected]

Janurary 13, 2014

Murphy, Kevin P. Machine learning: a probabilistic perspective. The MIT Press, 2012.Chapter 27.

Introduction

We want to model three types of discrete data

Sequence of tokens: p(yi,1:Li)

Bag of words: p(ni)

Discrete features: p(yi,1:R)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 2 / 21

Outline

Mixture Models

LSA / PLSI / LDA / GaP / NMF

LDA

EvaluationInferenceVariants: CTM, DTM, LDA-HMM, SLDA, MedLDA, etc.

RBM

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 3 / 21

Mixture models

p(y) =∑k

p(y|qi = k)p(qi = k)

Sequence of tokens: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)

Discrete features: p(yi,1:R|qi = k) =∏Rr=1Cat(yir|b

(r)k )

Bag of words (known Li): p(ni|Li, qi = k) = Mu(ni|Li,bk)Bag of words (unknown Li): p(ni|qi = k) =

∏Vv=1 Poi(niv|λvk)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 4 / 21

Mixture models

Theorem

If ∀i,Xi ∼ Poi(λi), let n =∑

iXi

p(X1, · · · , Xk|n) = Mu(X|n, π)

where πi =λi∑k λk

.

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 5 / 21

Exponential Family PCA

latent semantic analysis (LSA) / latent semantic indexing (LSI)

Sequence of tokens: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))

Discrete features: p(yi,1:R|zi) =∏Rr=1Cat(yir|S(Wrzi))

Bag of words (known Li): p(ni|Li, zi) = Mu(ni|Li, S(Wzi))

Bag of words (unknown Li): p(ni|zi) =∏Vv=1 Poi(niv|exp(wv,:zi))

where S(·) is the softmax transformation, zi ∈ RK , W,Wr ∈ RV×K .Inference

coordinate ascent / degenerated EM (problem: overfitting?)

variational EM / MCMC

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 6 / 21

LSA / PLSI / LDA

Unigram: p(yi,1:Li |qi = k) =∏Lil=1Cat(yil|bk)

LSI: p(yi,1:Li |zi) =∏Lil=1Cat(yil|S(Wzi))

PLSI: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi)

LDA: p(yi,1:Li |πi) =∏Lil=1Cat(yil|Bπi), πi ∼ Dir(πi|α)

LDA for other data types

Bag of words:p(ni|Li, πi) = Mu(ni|Li,Bπi)Discrete features:p(yi,1:R|πi) =

∏Rr=1Cat(yir|B(r)πi)

Question: What is dual parameter? Why is it convenient?

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 7 / 21

Marlin, Benjamin M. ”Modeling user rating profiles for collaborative filtering.” Advancesin neural information processing systems. 2003.

Gamma-Poisson Model

LDA

models p(ni|Li, πi) = Mu(ni|Li,Bπi)Prior πi ∼ Dir(α)

Constraint 0 ≤ πik,∑

j πik = 1, 0 ≤ Bvk,∑

v Bvk = 1

GaP

models p(ni|z+i ) =∏Vv=1 Poi(niv|b>v,:z

+i )

Prior p(z+i ) =∏kGa(z+ik|αk, βk)

Constraint 0 ≤ zik, 0 ≤ BvkCan use sparse-inducing prior (27.17)GaP only have non-negative constraints

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 8 / 21

Non-negative matrix factorization

Given non-negative matrix V , find non-negative matrix factors W,H suchthat

V ≈WH

Vi ≈∑k

WikHk

Can be view as GaP when prior αk = βk = 0.

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 9 / 21

Seung, D., and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances inneural information processing systems.

Latent Dirichlet Allocation (LDA)

Notation

πz|α ∼ Dir(α) (1)

qil|πi ∼ Cat(πi) (2)

bk|γ ∼ Dir(γ) (3)

yil|qil = k,B ∼ Cat(bk) (4)

Geometric interpretation

Simplex: handle ambiguity (?)

Unidentifiable: Labeled LDA

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 10 / 21

D. Blei et al. ”Latent dirichlet allocation.” JMLRG. Heinrich. ”Parameter estimation for text analysis.”D. Ramage, et al. ”Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora.” EMNLP

http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

Evaluation: Perplexity

Perplexity of language model q given language p is defined as (both p, qare stocastic process)

perplexity(p, q) = 2H(p,q)

where H(p, q) is cross-entrypy

H(p, q) = limN→∞

− 1

N

∑y1:N

p(y1:N ) log q(y1:N )

Approximations

N is finite

p(y1:N ) = δy∗1:N

(y1:N )

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 11 / 21

Evaluation: Perplexity

H(p, q) = − 1

Nlog q(y∗1:N )

Intuition: weighted average branching factorFor unigram model

H = − 1

N

N∑i=1

1

Li

Li∑l=1

log q(y∗il)

For LDA

H = − 1

N

∑i=1N

p(y∗i,1:Li)

Use variational evidence lower bound (ELBO)

Use annealed importance sampling

Use validation set and plug in approximation

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 12 / 21

H. Wallach, et al. ”Evaluation methods for topic models.” ICML 2009

Evaluation: Coherence

TODO

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 13 / 21

D. Newman et al. ”Automatic evaluation of topic coherence.” NAACL HLT 2010.

Inference

Exponential number of inference algorithms

Variational inference vs sampling vs both

Collapsed vs non-collpased

Online vs stocastic vs offline

Empirical Bayes vs fully Bayes

Other algorithms: expectation propagation, etc.

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 14 / 21

Inference: towards large scale

algorithms

Online / stocasticSparsitySpectral methods

system

Distributed: Yahoo-LDA, Petuum, Parameter-Server, etc.GPU: BIDMach, etc.

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 15 / 21

Model Selection

Compute evidence with AIS / ELBO

Cross validation

Bayesian non-parametrics

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 16 / 21

Teh et al. ”Hierarchical dirichlet processes.” Journal of the american statistical association (2006).

Extensions of LDA

Correlation: Correlated topic model

Time series: Dynamic topic model

Syntax: LDA-HMM

Supervision: many

1D categorial label: SLDA (generative), DLDA (discrimitive), MedLDA(regularized)nD label: MR-LDA, random effects mixture of experts, conditionaltopic random field, Dirchlet multinomial regression LDAK labels per document: labeled LDAlabels per word: TagLDA

Structural: RTM

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 17 / 21

Restricted Boltzmann machines

Restricted Boltzmann machines

p(h,v|θ) = 1

Z(θ)

R∏r=1

K∏k=1

ψrk(vr, hk)

where h,v are binary vectors.factorized posterior

p(h|v, θ) =∏k

p(hk|v, θ)

advantage: symmetric, both posterior inference (backward) and generating(forward) are easy.

Exponential family harmonium (harmonium is 2-layer UGM)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 18 / 21

Restricted Boltzmann machines

Binary latent and binary visiable (other models exist, see Table 27.2)

p(v,h|θ) = 1

Z(θ)exp(−E(v,h; θ)) (5)

E(v,h; θ) = v>Wh (6)

p(h|v, θ) =∏k

Ber(hk|sigm(w>:,k,v)) (7)

p(v|h, θ) =∏r

Ber(vr|sigm(w>r,:,h)) (8)

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 19 / 21

Restricted Boltzmann machines

Goal: maximize p(v|θ)

∇wl = Epemp(·|θ)[vh>]− Ep(·|θ)[vh>]

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 20 / 21

Conclusions

Why there are many things to do

Exponential number of inference algorithms

Exponential number of models

Exponential × exponential number of solutions

Application, evaluation, theory (e.g. spectral), etc.

Need a way for information retriver, data miners find correct & fastsolutions for them...

Jianfei Chen (THU) Latent variable models Janurary 13, 2014 21 / 21