Learning Partially Observed Graphical Models Variational...

Learning Partially Observed Graphical Models

+Variational EM

1

10-418 / 10-618 Machine Learning for Structured Data

Matt GormleyLecture 24

Nov. 18, 2019

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

Q&A

2

Q: How does the reduction of MAP Inference toVariational Inference work again…?

A: Let’s look at an example…

Reminders

• Homework 4: Topic Modeling– Out: Wed, Nov. 6– Due: Mon, Nov. 18 at 11:59pm

• Homework 5: Variational Inference– Out: Mon, Nov. 18– Due: Mon, Nov. 25 at 11:59pm

• 618 Midway Poster:– Submission: Thu, Nov. 21 at 11:59pm– Presentation: Fri, Nov. 22 or Mon, Nov. 25

3

MEAN FIELD VARIATIONAL INFERENCE

14

Variational Inference

Whiteboard– Background: KL Divergence– Mean Field Variational Inference (overview)– Evidence Lower Bound (ELBO)– ELBO’s relation to log p(x)

15


Whiteboard– Mean Field Variational Inference (derivation)– Algorithm Summary (CAVI)– Example: Factor Graph with Discrete Variables

16


Whiteboard– Example: two variable factor graph• Iterated Conditional Models• Gibbs Sampling• Mean Field Variational Inference

17

EXACT INFERENCE ON GRID CRFAn example of why we need approximate inference

23

Application: Pose Estimation

© Eric Xing @ CMU, 2005-2015 25

Feature Functions for CRF in Vision

© Eric Xing @ CMU, 2005-2015 26

Case Study: Image Segmentation• Image segmentation (FG/BG) by modeling of interactions btw RVs

– Images are noisy. – Objects occupy continuous regions in an image.

© Eric Xing @ CMU, 2005-2015 27

Input image Pixel-wise separateoptimal labeling

Locally-consistent joint optimal labeling

[Nowozin,Lampert 2012]

Y*= argmaxy∈{0,1}n

Vi (yi,X)+ Vi, j (yi, yj )j∈Ni

∑i∈S∑

i∈S∑#

$%%

&

'((.

Y: labelsX: data (features)S: pixelsNi: neighbors of pixel i

Unary Term Pairwise Term

Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?

28


29

Assuming we divide into foreground / background, each

factor is a table with 22

entries.


30


31


32


33


34


35

This new factor has 25

entries


36

For an MxM grid the new factor has 2M

entries

…

… ……………


37

For an MxM grid the new factor has 2M

entries

…

… ……………

In general, for high treewidth graphs like

this, we turn to approximate inference

(which we’ll cover soon!)

VARIATIONAL INFERENCE RESULTS

38

Collapsed Variational Bayesian LDA

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

M

Nm K

xmn

zmn

⇤m

�

⌅k ⇥

Figure 1: The graphical model for the SCTM.

2 SCTM

A Product of Experts (PoE) [1] model p(x|⌅1, . . . ,⌅C) =QC

c=1 ⇥cxPVv=1

QCc=1 ⇥cv

, where there are C

components, and the summation in the denominator is over all possible feature types.

Latent Dirichlet allocation generative process

For each topic k ⇥ {1, . . . , K}:�k � Dir(�) [draw distribution over words]

For each document m ⇥ {1, . . . , M}✓m � Dir(↵) [draw distribution over topics]For each word n ⇥ {1, . . . , Nm}

zmn � Mult(1, ✓m) [draw topic]xmn � �zmi

[draw word]

The Finite IBP model generative process

For each component c ⇥ {1, . . . , C}: [columns]

�c � Beta( �C , 1) [draw probability of component c]

For each topic k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)[draw whether topic includes cth component in its PoE]

2.1 PoE

p(x|⌅1, . . . ,⌅C) =⇥C

c=1 ⇤cx�Vv=1

⇥Cc=1 ⇤cv

(2)

2.2 IBP





[draw word]

The Beta-Bernoulli model generative process

For each feature c ⇥ {1, . . . , C}: [columns]

�c � Beta( �C , 1)

For each class k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)

2.3 Shared Components Topic Models

Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⌅c over the V words from a Dirichletparametrized by ⇥. Next, we generate a K � C binary matrix using the finite IBP prior. We selectthe probability ⇥c of each component c being on (bkc = 1) from a Beta distribution parametrizedby �/C. We then sample K topics (rows of the matrix), which combine component distributions,where each position bkc is drawn from a Bernoulli parameterized by ⇥c. These components and the

2

Dirichlet

Document-specific topic distribution

Topic assignment

Observed word

Topic Dirichlet

Approximate with q

• Explicit Variational Inference

Collapsed Variational Bayesian LDA

• Collapsed Variational Inference054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

M

Nm K

xmn

zmn

⇤m

�

⌅k ⇥

Figure 1: The graphical model for the SCTM.

2 SCTM

A Product of Experts (PoE) [1] model p(x|⌅1, . . . ,⌅C) =QC

c=1 ⇥cxPVv=1

QCc=1 ⇥cv

, where there are C

components, and the summation in the denominator is over all possible feature types.





[draw word]

The Finite IBP model generative process

For each component c ⇥ {1, . . . , C}: [columns]

�c � Beta( �C , 1) [draw probability of component c]

For each topic k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)[draw whether topic includes cth component in its PoE]

2.1 PoE

p(x|⌅1, . . . ,⌅C) =⇥C

c=1 ⇤cx�Vv=1

⇥Cc=1 ⇤cv

(2)

2.2 IBP





[draw word]

The Beta-Bernoulli model generative process

For each feature c ⇥ {1, . . . , C}: [columns]

�c � Beta( �C , 1)

For each class k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)

2.3 Shared Components Topic Models

Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⌅c over the V words from a Dirichletparametrized by ⇥. Next, we generate a K � C binary matrix using the finite IBP prior. We selectthe probability ⇥c of each component c being on (bkc = 1) from a Beta distribution parametrizedby �/C. We then sample K topics (rows of the matrix), which combine component distributions,where each position bkc is drawn from a Bernoulli parameterized by ⇥c. These components and the

2

Dirichlet

Document-specific topic distribution

Topic assignment

Observed word

Topic DirichletIntegrated out

Approximate with q

Collapsed Variational Bayesian LDA• First row: test

set per word log probabilities as functions of numbers of iterations for VB, CVB and Gibbs.

• Second row: histograms of final test set per word log probabilities across 50 random initializations.

41Slide from Teh et al (2007)

Data from dailykos.com

Data from NeurIPSproceedings

Online Variational Bayes for LDA

42

4096

systems

health

communication

service

billion

language

care

road

8192

service

systems

health

companies

market

communication

company

billion

12288

service

systems

companies

business

company

billion

health

industry

16384

service

companies

systems

business

company

industry

market

billion

32768

business

service

companies

industry

company

management

systems

services

49152

business

service

companies

industry

services

company

management

public

2048

systems

road

made

service

announced

national

west

language

65536

business

industry

service

companies

services

company

management

public

Documents

analyzed

Top eight

words

Documents seen (log scale)

Perplexity

600

650

700

750

800

850

900

103.5

104

104.5

105

105.5

106

106.5

Batch 98K

Online 98K

Online 3.3M

Figure 1: Top: Perplexity on held-out Wikipedia documents as a function of number of documentsanalyzed, i.e., the number of E steps. Online VB run on 3.3 million unique Wikipedia articles iscompared with online VB run on 98,000 Wikipedia articles and with the batch algorithm run on thesame 98,000 articles. The online algorithms converge much faster than the batch algorithm does.Bottom: Evolution of a topic about business as online LDA sees more and more documents.

to summarize the latent structure of massive document collections that cannot be annotated by hand.A central research problem for topic modeling is to efficiently fit models to larger corpora [4, 5].

To this end, we develop an online variational Bayes algorithm for latent Dirichlet allocation (LDA),one of the simplest topic models and one on which many others are based. Our algorithm is based ononline stochastic optimization, which has been shown to produce good parameter estimates dramat-ically faster than batch algorithms on large datasets [6]. Online LDA handily analyzes massive col-lections of documents and, moreover, online LDA need not locally store or collect the documents—each can arrive in a stream and be discarded after one look.

In the subsequent sections, we derive online LDA and show that it converges to a stationary pointof the variational objective function. We study the performance of online LDA in several ways,including by fitting a topic model to 3.3M articles from Wikipedia without looking at the samearticle twice. We show that online LDA finds topic models as good as or better than those foundwith batch VB, and in a fraction of the time (see figure 1). Online variational Bayes is a practicalnew method for estimating the posterior of complex hierarchical Bayesian models.

2 Online variational Bayes for latent Dirichlet allocation

Latent Dirichlet Allocation (LDA) [7] is a Bayesian probabilistic model of text documents. It as-sumes a collection of K “topics.” Each topic defines a multinomial distribution over the vocabularyand is assumed to have been drawn from a Dirichlet, �k ⇠ Dirichlet(⌘). Given the topics, LDAassumes the following generative process for each document d. First, draw a distribution over topics✓d ⇠ Dirichlet(↵). Then, for each word i in the document, draw a topic index zdi 2 {1, . . . ,K}from the topic weights zdi ⇠ ✓d and draw the observed word wdi from the selected topic, wdi ⇠ �zdi .For simplicity, we assume symmetric priors on ✓ and �, but this assumption is easy to relax [8].

Note that if we sum over the topic assignments z, then we get p(wdi|✓d,�) =

P

k ✓dk�kw. Thisleads to the “multinomial PCA” interpretation of LDA; we can think of LDA as a probabilisticfactorization of the matrix of word counts n (where ndw is the number of times word w appears indocument d) into a matrix of topic weights ✓ and a dictionary of topics � [9]. Our work can thus

2

Figures from Hoffman et al. (2010)

Online Variational Bayes for LDA

43Figures from Hoffman et al. (2010)

Fully–Connected CRF

44Figures from Krähenbühl & Koltun (2011)

Model Results

This is a fully connected

graph!Inference• Can do MCMC, but

slow• Instead use Variational

Inference• Then filter some

variables for speed up

Fully–Connected CRF

45Figures from Krähenbühl & Koltun (2011)

Model Follow-up Work (combine with CNN)

This is a fully connected

graph!

Inference• Can do MCMC, but

slow• Instead use Variational

Inference• Then filter some

variables for speed up

Joint Parsing and Alignment with Weakly Synchronized Grammars

46Figures from Burkett et al. (2010)

Joint Parsing and Alignment with Weakly Synchronized Grammars

47Figures from Burkett et al. (2010)

Figures from Burkett & Klein (ACL 2013 tutorial)

HIDDEN STATE CRFS

48

Case Study: Object Recognition

Data consists of images x and labels y.

49

pigeon

leopard llama

rhinoceros



50

• Preprocess data into “patches”

• Posit a latent labeling zdescribing the object’s parts (e.g. head, leg, tail, torso, grass)

leopard

• Define graphical model with these latent variables in mind

• z is not observed at train or test time



51



leopard



X1

Z1

X2

Z2

X3

Z3

X4

Z4X5

Z5X7

Z7

X 6

Z 6

Y



52



leopard



ψ2 ψ4

X1

Z1ψ1

X2

Z2

ψ3

X3

Z3

ψ5X4

Z4

ψ7X5

Z5

ψ9X7

Z7

ψ1

X 6

Z 6ψ 1

ψ4

ψ4

Y

Hidden-state CRFs

53

Joint model:

Marginalized model:

leopard

ψ2 ψ4

X1

Z1ψ1

X2

Z2

ψ3

X3

Z3

ψ5X4

Z4

ψ7X5

Z5

ψ9X7

Z7

ψ1

X 6

Z 6ψ 1

ψ4

ψ4

Y

p✓(y | x) =X

z

p✓(y, z | x)

p✓(y, z | x) = 1

Z(x,✓)

Y

↵

↵(y↵, z↵,x)

D = {x(n),y(n)}Nn=1Data:

Hidden-state CRFs

54

Joint model:

Marginalized model: p✓(y | x) =X

z

p✓(y, z | x)

p✓(y, z | x) = 1

Z(x,✓)

Y

↵

↵(y↵, z↵,x)

We can train using gradient based methods:(the values x are omitted below for clarity)

d`(✓|D)

d✓=

NX

n=1

⇣Ez⇠p✓(·|y(n))[fj(y

(n), z)]� Ey,z⇠p✓(·,·)[fj(y, z)]⌘

=NX

n=1

X

↵

X

z↵

p✓(z↵ | y(n))f↵,j(y(n)↵ , z↵)�

X

y↵,z↵

p✓(y↵, z↵)f↵,j(y↵, z↵)

!

Inference on fullfactor graph

Inference on clampedfactor graph

D = {x(n),y(n)}Nn=1Data:

VARIATIONAL EM

55

Variational EM

Whiteboard– Example: Unsupervised POS Tagging– Variational Bayes– Variational EM

56

Learning Partially Observed Graphical Models Variational...

Documents

Transcript of Learning Partially Observed Graphical Models Variational...