RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and...

39
2/11/2020 1 Jun Luo 02/2020 Modern Generative Models: Restricted Boltzmann Machines Based on presentation by Hung Chao https://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class22.pdf Unsupervised Learning: use only the inputs for learning automatically extract meaningful features for data Leverage the availability of unlabeled data Can use negative log-likelihood to learn the underlying feature We will see 2 neural networks for unsupervised learning Restricted Boltzmann Machines Variational Autoencoders RESTRICTED BOLTZMANN MACHINE

Transcript of RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and...

Page 1: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

1

Jun Luo02/2020

Modern Generative Models:Restricted Boltzmann Machines

Based on presentation by Hung Chaohttps://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class22.pdf

• Unsupervised Learning: use only the inputs for learning

automatically extract meaningful features for data

Leverage the availability of unlabeled data

Can use negative log-likelihood to learn the underlying feature

• We will see 2 neural networks for unsupervised learning

Restricted Boltzmann Machines

Variational Autoencoders

RESTRICTED BOLTZMANN MACHINE

Page 2: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

2

• Given training data, we want to generate new samples from

the same distribution

GENERATIVE MODELS

Training data ∼ pdata(x) Generated samples ∼

pmodel(x)

Want to learn pmodel(x) similar to pdata(x)Figure source: CIFAR-10 dataset (Krizhevsky and Hinton, 2009)

GENERATIVE MODELS• Why generative models?

• Realistic samples for artwork, super-resolution, colorization, etc.

• Generative models of time-series data can be used for simulation and planning

(reinforcement learning applications!)

• Training generative models can also enable inference of latent representation

that can be useful as general features

Figure source: Internet

Page 3: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

3

• Many interesting theoretical results about undirected models depends on

the assumption that . A convenient way to enforce this

condition is to use an energy-based model where

RESTRICTED BOLTZMANN MACHINE

• E(x) is known as the energy function

• Any distribution of this form is an example of a Boltzmann distribution.

For this reason, many energy-based models are called Boltzmann

machines.

• Normalized probability

a

d e f

cb E(a, b, c, d, e, f) can be written as

Ea,b (a,b) + Eb,c (b,c) + Ea,d (a,d) + Eb,e (b,e) +

Ee,f (e,f)

e.g.

• Restricted Boltzmann machines (RBMs) are undirected probabilistic

graphical models containing a layer of observable variables and a single

layer of latent variables

• RBM is a bipartite graph, with no connections permitted between any

variables in the observed layer or between any units in the latent layer

RESTRICTED BOLTZMANN MACHINE

Page 4: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

4

RESTRICTED BOLTZMANN MACHINE

partition function (intractable)

Energy function:

Distribution:

Structure: Markov network view

h

xFactors

RESTRICTED BOLTZMANN MACHINE

The notation based on an energy function is simply an alternative to

the representation as the product of factors

Page 5: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

5

...

...

hHh2h1

x 1 x 2 x D

pair-wise factors

Unary Factors

Markov network view

RESTRICTED BOLTZMANN MACHINE

The scalar visualization is more informative of the structure

within the vectors

INFERENCE

h

h

Conditional Distribution:

Page 6: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

6

Page 7: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

7

Page 8: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

8

FREE ENERGY

What about ?

Free Energy

h

Page 9: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

9

Page 10: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

10

h

FREE ENERGY

Bias the probability of x

Bias of each feature

Feature expected in x

Training

We’d like to proceed by stochastic gradient descent

Positive Phase Negative Phase

Training objective

To train an RBM, we minimize the average negative log-likelihood (NLL)

Page 11: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

11

Training

We’d like to proceed by stochastic gradient descent

Positive Phase Negative Phase

Training objective

To train an RBM, we minimize the average negative log-likelihood (NLL)

Hard to compute

Contrastive Divergence (CD)(Hinton, Neural Computation, 2002)

Idea:

1. obtain the point by Gibbs sampling

2. start sampling chain at

3. replace the expectation by a point estimate at

Often called negative sample

Page 12: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

12

Contrastive Divergence (CD)(Hinton, Neural Computation, 2002)

Replace the expectation by a point estimate

Contrastive Divergence (CD)(Hinton, Neural Computation, 2002)

Replace the expectation by a point estimate

Probability

goes up

Probability

goes down

Page 13: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

13

Parameter Update

Derivation of for

Parameter Update

Derivation of for

Page 14: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

14

Parameter Update

Derivation of for

If we define: Then,

Parameter Update

Update of W

Given and , the learning rule of becomes:

Page 15: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

15

CD-K: PSEUDOCODEContrastive Divergence:

1. For each training sample

i. Generate a negative sample using k steps of Gibbs

sampling, starting at

ii. Update parameters:

2. Go back to 1. until stopping criteria is reached

Contrastive Divergence-kContrastive Divergence:

• CD-k: contrastive divergence with k iterations of Gibbs sampling

• In general, the bigger k is, the less biased the estimate of the

gradient will be

• In practice, k=1 works well for pre-training

Page 16: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

16

Software

• Sci-kit learn

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.BernoulliRBM.html

• Pydbm

https://pypi.org/project/pydbm/

• Other self-implemented versions on github

References

[1] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,

2016. Chapter 20

[2] Hung Chao’s CS 3750 year 2018 slides https://people.cs.pitt.edu/~milos/courses/cs3750/lectures/class22.pdf

[3] Hugo Larochelle. Neural networks class series - Université de Sherbrooke

[4] Hinton, Geoffrey E. "Training products of experts by minimizing contrastive

divergence." Neural computation 14.8 (2002): 1771-1800.

[5] Tieleman, Tijmen. "Training restricted Boltzmann machines using approximations

to the likelihood gradient." Proceedings of the 25th international conference on

Machine learning. 2008.

Page 17: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

17

Jun Luo02/2020

Modern Generative Models:Variational Autoencoders

• Given training data, we want to generate new samples from

the same distribution

GENERATIVE MODELS

Training data ∼ pdata(x) Generated samples ∼

pmodel(x)

Want to learn pmodel(x) similar to pdata(x)Figure source: CIFAR-10 dataset (Krizhevsky and Hinton, 2009)

Page 18: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

18

GENERATIVE MODELS• Why generative models?

• Realistic samples for artwork, super-resolution, colorization, etc.

• Generative models of time-series data can be used for simulation and planning

(reinforcement learning applications!)

• Training generative models can also enable inference of latent representation

that can be useful as general features

Figure source: Internet

Autoencoders (Recap)

Encoder

Input data

Features

Originally: Linear +

nonlinearity (sigmoid)

Later: Deep, fully-connected

Later: ReLU CNN

z usually smaller than x

(dimensionality reduction)

Unsupervised approach for learning a lower-dimensional feature representation

from unlabeled training data

Features vector are

generally shorter

than input vector to

extract meaningful

features

Page 19: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

19

Autoencoders (Recap)

Encoder

Input data

Features

Originally: Linear +

nonlinearity (sigmoid)

Later: Deep, fully-connected

Later: ReLU CNN

How to learn this feature representation?

Train such that features can be used to reconstruct original data

"Autoencoding" - encoding itself

Reconstructed

dataDecoder

Autoencoders (Recap)

Encoder

Input data

Features

How to learn this feature representation?

Train such that features can be used to reconstruct original data

"Autoencoding" - encoding itself

Reconstructed

dataDecoder

Reconstructed data

Input data

Encoder: 4-layer conv

Decoder: 4-layer upconv

Figure adapt from CS 231n

Page 20: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

20

Autoencoders (Recap)

Encoder

Input data

Features

Train such that features

can be used to

reconstruct original data

Reconstructed

dataDecoder

Reconstructed data

Input data

Encoder: 4-layer conv

Decoder: 4-layer upconv

Use L2 Loss

Function

Figure adapt from CS 231n

Autoencoders (Recap)

Encoder

Input data

Features

Train such that features

can be used to

reconstruct original data

Reconstructed

dataDecoder

Reconstructed data

Input data

Encoder: 4-layer conv

Decoder: 4-layer upconv

Use L2 Loss

Function Does not

need labels

Figure adapt from CS 231n

Page 21: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

21

Autoencoders (Recap)

Encoder

Input data

Features

Encoder can be used to

initialize a supervised

model

Reconstructed

dataDecoder

Autoencoders (Recap)

Encoder

Input data

Features

Encoder can be used to

initialize a supervised

model

Throw away

decoder

Page 22: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

22

Autoencoders (Recap)

Encoder

Input data

Features

Encoder can be used to

initialize a supervised

model

Throw away decoder

after training with

reconstruction loss

Add label and new

loss function (e.g.

softmax loss)

Softmax

Fine-tune

encoder

jointly with

classifier

Train for final

task (sometimes

with small data)

Autoencoders (Recap)

Encoder

Input data

Features

Autoencoders can reconstruct data,

and can learn features to initialize a

supervised model

Reconstructed

dataDecoder

Page 23: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

23

Autoencoders (Recap)

Encoder

Input data

Features

Autoencoders can reconstruct data,

and can learn features to initialize a

supervised model

Reconstructed

dataDecoder

Features capture factors

of variation in training

data. Can we generate

new images from an

autoencoder?

Variational Autoencoders

Probabilistic spin on autoencoders - will let us sample from the model to

generate data!

Assume training data is generated from underlying unobserved

(latent) representation z

Intuition (remember from

autoencoders!): x is an

image; z is latent factors

used to generate

Sample from true prior

Sample from true

conditional

Page 24: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

24

Variational Autoencoders

We want to estimate the true parameters of this generative model.

How should we represent

this model?

Choose prior to be simple,

e.g. Gaussian. Reasonable for

latent attributes, e.g. pose, how

much smile.

Sample from true prior

Sample from true

conditional

Decoder network

Variational Autoencoders

We want to estimate the true parameters of this generative model.

How should we represent

this model?

Choose prior to be simple,

e.g. Gaussian.

Conditional is complex

(generates image) => represent

with neural networkSample from true prior

Sample from true

conditional

Decoder network

Page 25: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

25

Variational Autoencoders

We want to estimate the true parameters of this generative model.

How to train the model?

Sample from true prior

Sample from true

conditionalData likelihood

Decoder networkSimilar as Restricted Boltzman

Machines, here, we maximize the

data likelihood.

Variational Autoencoders

Simple

Gaussian PriorDecoder neural network

Data likelihood:✔ ✔

Page 26: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

26

Variational Autoencoders

Data likelihood:✔ ✔

Intractable to compute

for every z

Variational Autoencoders

Data likelihood:✔ ✔

Posterior density also intractable:✔ ✔

Because of intractable data likelihood

Page 27: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

27

Variational Autoencoders

Data likelihood:✔ ✔

Posterior density also intractable:✔ ✔

Solution: In addition to decoder network modeling , define

additional encoder network that approximates

Will see that this allows us to derive a lower bound on the data likelihood

that is tractable, which we can optimize

Variational AutoencodersSince we’re modeling probabilistic generation of data, encoder and decoder networks are

probabilistic

Encoder network Decoder network

(parameters ) (parameters )

Mean and (diagonal) covariance of z | x Mean and (diagonal) covariance of x | z

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Page 28: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

28

Variational AutoencodersSince we’re modeling probabilistic generation of data, encoder and decoder networks are

probabilistic

Encoder network Decoder network

(parameters ) (parameters )

Sample z|x from

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Sample x|z from

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

Taking expectation over z

(using encoder network)

will be helpful later on

Page 29: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

29

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Bayes’ rule)

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Multiply by 1)

(Bayes’ rule)

Page 30: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

30

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

The expectation over z lead to

nice KL Divergence form

Page 31: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

31

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

This KL term (between

Gaussians for encoder

and z prior) has nice

closed-form solution!

intractable (saw earlier),

can’t compute this KL term. But

we know KL divergence always

>= 0.

Decoder network gives , can

compute estimate of this term through

sampling. (Sampling differentiable

through reparam. trick, see paper.)

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

Tractable lower bound which we can take gradient

of and optimize! differentiable, KL term

differentiable)

Page 32: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

32

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

Variational Lower Bound

“ELBO”

Training: Maximize

lower bound

Variational AutoencodersNow let us see the (log) data likelihood again with encoder and decoder

( Does not depend on )

(Logarithm)

(Multiply by 1)

(Bayes’ rule)

Variational Lower Bound

“ELBO”

Training: Maximize

lower bound

Reconstruct

the input data

Make approximate

posterior distribution

close to prior

Page 33: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

33

Variational AutoencodersPutting it all together: maximizing the

likelihood lower boundLet’s look at computing the bound

(forward pass) for a given

minibatch of input

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

(parameters )

Page 34: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

34

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

(parameters )

Make approximate

posterior distribution close

to prior

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

(parameters )

Sample z|x fromMake approximate

posterior distribution close

to prior

Page 35: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

35

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

Decoder network

(parameters )

(parameters )

Sample z|x fromMake approximate

posterior distribution close

to prior

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

Decoder network

(parameters )

(parameters )

Sample z|x from

Sample x|z from

Maximize likelihood

of original input

being reconstructed

Make approximate

posterior distribution close

to prior

Page 36: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

36

Variational AutoencodersPutting it all together: maximizing the

likelihood lower bound

Input data

Encoder network

Decoder network

(parameters )

(parameters )

Sample z|x from

Sample x|z from

Maximize likelihood

of original input

being reconstructed

Make approximate

posterior distribution close

to prior

For every minibatch of

input data: compute this

forward pass, and then

backprop!

VAE: Generate dataUse decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Decoder network

(parameters )

Sample x|z from

Sample z|x from

Page 37: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

37

Use decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Decoder network

(parameters )

Sample x|z from

Sample z|x from

VAE: Generate dataUse decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Decoder network

(parameters )

Sample x|z from

Sample z|x from

Vary Z2

Vary Z1

Data manifold for 2-d z

VAE: Generate data

Page 38: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

38

Use decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Different dimensions of z

encode interpretable factors

of variation

Diagonal prior on z =>

independent latent

variables

Vary Z2

Vary Z1

Degree of smile

Head pose

VAE: Generate dataUse decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Different dimensions of z

encode interpretable factors

of variation

Diagonal prior on z =>

independent latent

variables

Vary Z2

Vary Z1

Degree of smile

Head pose

VAE: Generate data

Also good feature

representation that can be computed using

Page 39: RESTRICTEDBOLTZMANN MACHINEpeople.cs.pitt.edu/~milos/courses/cs3750-Spring... · Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR2014 Sample x|z from Variational Autoencoders

2/11/2020

39

Softwares

• Vae – PyPI

• https://pypi.org/project/vae/

• Deep learning platforms such as TensorFlow

and PyTorch

References

[1] Kingma, Diederik P., and Max Welling. "An introduction to variational

autoencoders." Foundations and Trends® in Machine Learning 12.4 (2019): 307-

392.

[2] CS 3750 Previous slides

[3] Stanford CS 231n 2017 lecture 13

[4] Kingma, Diederik P., and Max Welling. "An introduction to variational

autoencoders." Foundations and Trends® in Machine Learning 12.4 (2019): 307-

392.

[5] Higgins, Irina, et al. "beta-VAE: Learning Basic Visual Concepts with a

Constrained Variational Framework." Iclr 2.5 (2017): 6.