Learning Partially Observed Graphical Models Variational...
Transcript of Learning Partially Observed Graphical Models Variational...
Learning Partially Observed Graphical Models
+Variational EM
1
10-418 / 10-618 Machine Learning for Structured Data
Matt GormleyLecture 24
Nov. 18, 2019
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
Q&A
2
Q: How does the reduction of MAP Inference toVariational Inference work again…?
A: Let’s look at an example…
Reminders
• Homework 4: Topic Modeling– Out: Wed, Nov. 6– Due: Mon, Nov. 18 at 11:59pm
• Homework 5: Variational Inference– Out: Mon, Nov. 18– Due: Mon, Nov. 25 at 11:59pm
• 618 Midway Poster:– Submission: Thu, Nov. 21 at 11:59pm– Presentation: Fri, Nov. 22 or Mon, Nov. 25
3
MEAN FIELD VARIATIONAL INFERENCE
14
Variational Inference
Whiteboard– Background: KL Divergence– Mean Field Variational Inference (overview)– Evidence Lower Bound (ELBO)– ELBO’s relation to log p(x)
15
Variational Inference
Whiteboard– Mean Field Variational Inference (derivation)– Algorithm Summary (CAVI)– Example: Factor Graph with Discrete Variables
16
Variational Inference
Whiteboard– Example: two variable factor graph• Iterated Conditional Models• Gibbs Sampling• Mean Field Variational Inference
17
EXACT INFERENCE ON GRID CRFAn example of why we need approximate inference
23
Application: Pose Estimation
© Eric Xing @ CMU, 2005-2015 25
Feature Functions for CRF in Vision
© Eric Xing @ CMU, 2005-2015 26
Case Study: Image Segmentation• Image segmentation (FG/BG) by modeling of interactions btw RVs
– Images are noisy. – Objects occupy continuous regions in an image.
© Eric Xing @ CMU, 2005-2015 27
Input image Pixel-wise separateoptimal labeling
Locally-consistent joint optimal labeling
[Nowozin,Lampert 2012]
Y*= argmaxy∈{0,1}n
Vi (yi,X)+ Vi, j (yi, yj )j∈Ni
∑i∈S∑
i∈S∑#
$%%
&
'((.
Y: labelsX: data (features)S: pixelsNi: neighbors of pixel i
Unary Term Pairwise Term
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
28
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
29
Assuming we divide into foreground / background, each
factor is a table with 22
entries.
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
30
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
31
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
32
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
33
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
34
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
35
This new factor has 25
entries
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
36
For an MxM grid the new factor has 2M
entries
…
… ……………
Grid CRF• Suppose we want to image segmentation using a grid model• What happens when we run variable elimination?
37
For an MxM grid the new factor has 2M
entries
…
… ……………
In general, for high treewidth graphs like
this, we turn to approximate inference
(which we’ll cover soon!)
VARIATIONAL INFERENCE RESULTS
38
Collapsed Variational Bayesian LDA
054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107
M
Nm K
xmn
zmn
⇤m
�
⌅k ⇥
Figure 1: The graphical model for the SCTM.
2 SCTM
A Product of Experts (PoE) [1] model p(x|⌅1, . . . ,⌅C) =QC
c=1 ⇥cxPVv=1
QCc=1 ⇥cv
, where there are C
components, and the summation in the denominator is over all possible feature types.
Latent Dirichlet allocation generative process
For each topic k ⇥ {1, . . . , K}:�k � Dir(�) [draw distribution over words]
For each document m ⇥ {1, . . . , M}✓m � Dir(↵) [draw distribution over topics]For each word n ⇥ {1, . . . , Nm}
zmn � Mult(1, ✓m) [draw topic]xmn � �zmi
[draw word]
The Finite IBP model generative process
For each component c ⇥ {1, . . . , C}: [columns]
�c � Beta( �C , 1) [draw probability of component c]
For each topic k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)[draw whether topic includes cth component in its PoE]
2.1 PoE
p(x|⌅1, . . . ,⌅C) =⇥C
c=1 ⇤cx�Vv=1
⇥Cc=1 ⇤cv
(2)
2.2 IBP
Latent Dirichlet allocation generative process
For each topic k ⇥ {1, . . . , K}:�k � Dir(�) [draw distribution over words]
For each document m ⇥ {1, . . . , M}✓m � Dir(↵) [draw distribution over topics]For each word n ⇥ {1, . . . , Nm}
zmn � Mult(1, ✓m) [draw topic]xmn � �zmi
[draw word]
The Beta-Bernoulli model generative process
For each feature c ⇥ {1, . . . , C}: [columns]
�c � Beta( �C , 1)
For each class k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)
2.3 Shared Components Topic Models
Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⌅c over the V words from a Dirichletparametrized by ⇥. Next, we generate a K � C binary matrix using the finite IBP prior. We selectthe probability ⇥c of each component c being on (bkc = 1) from a Beta distribution parametrizedby �/C. We then sample K topics (rows of the matrix), which combine component distributions,where each position bkc is drawn from a Bernoulli parameterized by ⇥c. These components and the
2
Dirichlet
Document-specific topic distribution
Topic assignment
Observed word
Topic Dirichlet
Approximate with q
• Explicit Variational Inference
Collapsed Variational Bayesian LDA
• Collapsed Variational Inference054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107
M
Nm K
xmn
zmn
⇤m
�
⌅k ⇥
Figure 1: The graphical model for the SCTM.
2 SCTM
A Product of Experts (PoE) [1] model p(x|⌅1, . . . ,⌅C) =QC
c=1 ⇥cxPVv=1
QCc=1 ⇥cv
, where there are C
components, and the summation in the denominator is over all possible feature types.
Latent Dirichlet allocation generative process
For each topic k ⇥ {1, . . . , K}:�k � Dir(�) [draw distribution over words]
For each document m ⇥ {1, . . . , M}✓m � Dir(↵) [draw distribution over topics]For each word n ⇥ {1, . . . , Nm}
zmn � Mult(1, ✓m) [draw topic]xmn � �zmi
[draw word]
The Finite IBP model generative process
For each component c ⇥ {1, . . . , C}: [columns]
�c � Beta( �C , 1) [draw probability of component c]
For each topic k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)[draw whether topic includes cth component in its PoE]
2.1 PoE
p(x|⌅1, . . . ,⌅C) =⇥C
c=1 ⇤cx�Vv=1
⇥Cc=1 ⇤cv
(2)
2.2 IBP
Latent Dirichlet allocation generative process
For each topic k ⇥ {1, . . . , K}:�k � Dir(�) [draw distribution over words]
For each document m ⇥ {1, . . . , M}✓m � Dir(↵) [draw distribution over topics]For each word n ⇥ {1, . . . , Nm}
zmn � Mult(1, ✓m) [draw topic]xmn � �zmi
[draw word]
The Beta-Bernoulli model generative process
For each feature c ⇥ {1, . . . , C}: [columns]
�c � Beta( �C , 1)
For each class k ⇥ {1, . . . , K}: [rows]bkc � Bernoulli(�c)
2.3 Shared Components Topic Models
Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⌅c over the V words from a Dirichletparametrized by ⇥. Next, we generate a K � C binary matrix using the finite IBP prior. We selectthe probability ⇥c of each component c being on (bkc = 1) from a Beta distribution parametrizedby �/C. We then sample K topics (rows of the matrix), which combine component distributions,where each position bkc is drawn from a Bernoulli parameterized by ⇥c. These components and the
2
Dirichlet
Document-specific topic distribution
Topic assignment
Observed word
Topic DirichletIntegrated out
Approximate with q
Collapsed Variational Bayesian LDA• First row: test
set per word log probabilities as functions of numbers of iterations for VB, CVB and Gibbs.
• Second row: histograms of final test set per word log probabilities across 50 random initializations.
41Slide from Teh et al (2007)
Data from dailykos.com
Data from NeurIPSproceedings
Online Variational Bayes for LDA
42
4096
systems
health
communication
service
billion
language
care
road
8192
service
systems
health
companies
market
communication
company
billion
12288
service
systems
companies
business
company
billion
health
industry
16384
service
companies
systems
business
company
industry
market
billion
32768
business
service
companies
industry
company
management
systems
services
49152
business
service
companies
industry
services
company
management
public
2048
systems
road
made
service
announced
national
west
language
65536
business
industry
service
companies
services
company
management
public
Documents
analyzed
Top eight
words
Documents seen (log scale)
Perplexity
600
650
700
750
800
850
900
103.5
104
104.5
105
105.5
106
106.5
Batch 98K
Online 98K
Online 3.3M
Figure 1: Top: Perplexity on held-out Wikipedia documents as a function of number of documentsanalyzed, i.e., the number of E steps. Online VB run on 3.3 million unique Wikipedia articles iscompared with online VB run on 98,000 Wikipedia articles and with the batch algorithm run on thesame 98,000 articles. The online algorithms converge much faster than the batch algorithm does.Bottom: Evolution of a topic about business as online LDA sees more and more documents.
to summarize the latent structure of massive document collections that cannot be annotated by hand.A central research problem for topic modeling is to efficiently fit models to larger corpora [4, 5].
To this end, we develop an online variational Bayes algorithm for latent Dirichlet allocation (LDA),one of the simplest topic models and one on which many others are based. Our algorithm is based ononline stochastic optimization, which has been shown to produce good parameter estimates dramat-ically faster than batch algorithms on large datasets [6]. Online LDA handily analyzes massive col-lections of documents and, moreover, online LDA need not locally store or collect the documents—each can arrive in a stream and be discarded after one look.
In the subsequent sections, we derive online LDA and show that it converges to a stationary pointof the variational objective function. We study the performance of online LDA in several ways,including by fitting a topic model to 3.3M articles from Wikipedia without looking at the samearticle twice. We show that online LDA finds topic models as good as or better than those foundwith batch VB, and in a fraction of the time (see figure 1). Online variational Bayes is a practicalnew method for estimating the posterior of complex hierarchical Bayesian models.
2 Online variational Bayes for latent Dirichlet allocation
Latent Dirichlet Allocation (LDA) [7] is a Bayesian probabilistic model of text documents. It as-sumes a collection of K “topics.” Each topic defines a multinomial distribution over the vocabularyand is assumed to have been drawn from a Dirichlet, �k ⇠ Dirichlet(⌘). Given the topics, LDAassumes the following generative process for each document d. First, draw a distribution over topics✓d ⇠ Dirichlet(↵). Then, for each word i in the document, draw a topic index zdi 2 {1, . . . ,K}from the topic weights zdi ⇠ ✓d and draw the observed word wdi from the selected topic, wdi ⇠ �zdi .For simplicity, we assume symmetric priors on ✓ and �, but this assumption is easy to relax [8].
Note that if we sum over the topic assignments z, then we get p(wdi|✓d,�) =
P
k ✓dk�kw. Thisleads to the “multinomial PCA” interpretation of LDA; we can think of LDA as a probabilisticfactorization of the matrix of word counts n (where ndw is the number of times word w appears indocument d) into a matrix of topic weights ✓ and a dictionary of topics � [9]. Our work can thus
2
Figures from Hoffman et al. (2010)
Online Variational Bayes for LDA
43Figures from Hoffman et al. (2010)
Fully–Connected CRF
44Figures from Krähenbühl & Koltun (2011)
Model Results
This is a fully connected
graph!Inference• Can do MCMC, but
slow• Instead use Variational
Inference• Then filter some
variables for speed up
Fully–Connected CRF
45Figures from Krähenbühl & Koltun (2011)
Model Follow-up Work (combine with CNN)
This is a fully connected
graph!
Inference• Can do MCMC, but
slow• Instead use Variational
Inference• Then filter some
variables for speed up
Joint Parsing and Alignment with Weakly Synchronized Grammars
46Figures from Burkett et al. (2010)
Joint Parsing and Alignment with Weakly Synchronized Grammars
47Figures from Burkett et al. (2010)
Figures from Burkett & Klein (ACL 2013 tutorial)
HIDDEN STATE CRFS
48
Case Study: Object Recognition
Data consists of images x and labels y.
49
pigeon
leopard llama
rhinoceros
Case Study: Object Recognition
Data consists of images x and labels y.
50
• Preprocess data into “patches”
• Posit a latent labeling zdescribing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
• Define graphical model with these latent variables in mind
• z is not observed at train or test time
Case Study: Object Recognition
Data consists of images x and labels y.
51
• Preprocess data into “patches”
• Posit a latent labeling zdescribing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
• Define graphical model with these latent variables in mind
• z is not observed at train or test time
X1
Z1
X2
Z2
X3
Z3
X4
Z4X5
Z5X7
Z7
X 6
Z 6
Y
Case Study: Object Recognition
Data consists of images x and labels y.
52
• Preprocess data into “patches”
• Posit a latent labeling zdescribing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
• Define graphical model with these latent variables in mind
• z is not observed at train or test time
ψ2 ψ4
X1
Z1ψ1
X2
Z2
ψ3
X3
Z3
ψ5X4
Z4
ψ7X5
Z5
ψ9X7
Z7
ψ1
X 6
Z 6ψ 1
ψ4
ψ4
Y
Hidden-state CRFs
53
Joint model:
Marginalized model:
leopard
ψ2 ψ4
X1
Z1ψ1
X2
Z2
ψ3
X3
Z3
ψ5X4
Z4
ψ7X5
Z5
ψ9X7
Z7
ψ1
X 6
Z 6ψ 1
ψ4
ψ4
Y
p✓(y | x) =X
z
p✓(y, z | x)
p✓(y, z | x) = 1
Z(x,✓)
Y
↵
↵(y↵, z↵,x)
D = {x(n),y(n)}Nn=1Data:
Hidden-state CRFs
54
Joint model:
Marginalized model: p✓(y | x) =X
z
p✓(y, z | x)
p✓(y, z | x) = 1
Z(x,✓)
Y
↵
↵(y↵, z↵,x)
We can train using gradient based methods:(the values x are omitted below for clarity)
d`(✓|D)
d✓=
NX
n=1
⇣Ez⇠p✓(·|y(n))[fj(y
(n), z)]� Ey,z⇠p✓(·,·)[fj(y, z)]⌘
=NX
n=1
X
↵
X
z↵
p✓(z↵ | y(n))f↵,j(y(n)↵ , z↵)�
X
y↵,z↵
p✓(y↵, z↵)f↵,j(y↵, z↵)
!
Inference on fullfactor graph
Inference on clampedfactor graph
D = {x(n),y(n)}Nn=1Data:
VARIATIONAL EM
55
Variational EM
Whiteboard– Example: Unsupervised POS Tagging– Variational Bayes– Variational EM
56