Variational Autoencoders - An IntroductionIntroduction I Auto-Encoding Variational Bayes, Diederik...

Variational Autoencoders - An Introduction

Devon Graham

University of British [email protected]

Oct 31st, 2017

Table of contents

Introduction

Deep Learning Perspective

Probabilistic Model Perspective

Applications

Conclusion

Introduction

I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014

I Generative model

I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)

I https://jaan.io/

what-is-variational-autoencoder-vae-tutorial/

I Deep Learning perspective and Probabilistic Model perspective

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction

I Generative model

I https://jaan.io/

3 / 30

Introduction - Autoencoders

I Attempt to learn identity function

I Constrained in some way (e.g., small latent vectorrepresentation)

I Can generate new images by giving different latent vectors totrained network

I Variational: use probabilistic latent encoding

4 / 30

5 / 30

I Goal: Build a neural network that generates MNIST digitsfrom random (Gaussian) noise

I Define two sub-networks: Encoder and Decoder

I Define a Loss Function

6 / 30

Encoder

I A neural network qθ(z |x)

I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)

I Output: encoding z , drawn from Gaussian density withparameters θ

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Encoder

I |z | � |x |

7 / 30

Decoder

I A neural network pφ(x |z), parameterized by φ

I Input: encoding z , output from encoder

I Output: reconstruction x̃ , drawn from distribution of the data

I E.g., output parameters for 28× 28 Bernoulli variables

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Decoder

8 / 30

Loss Function

I x̃ is reconstructed from z where |z | � |x̃ |

I How much information is lost when we go from x to z to x̃?

I Measure this with reconstruction log-likelihood: log pφ(x |z)

I Measures how effectively the decoder has learned toreconstruct x given the latent representation z

9 / 30

Loss Function

I x̃ is reconstructed from z where |z | � |x̃ |I How much information is lost when we go from x to z to x̃?

9 / 30

Loss Function

9 / 30

Loss Function

9 / 30

Loss Function

I Loss function is negative reconstruction log-likelihood +regularizer

I Loss decomposes into term for each datapoint:

L(θ, φ) =N∑i=1

li (θ, φ)

I Loss for datapoint xi :

li (θ, φ) = −Ez∼qθ(z|xi )[

log pφ(xi |z)]

+ KL(qθ(z |xi )||p(z)

10 / 30

Loss Function

L(θ, φ) =N∑i=1

li (θ, φ)

log pφ(xi |z)]

10 / 30

Loss Function

L(θ, φ) =N∑i=1

li (θ, φ)

log pφ(xi |z)]

10 / 30

Loss Function

I Negative reconstruction log-likelihood:

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

I Encourages decoder to learn to reconstruct the data

I Expectation taken over distribution of latent representations

11 / 30

Loss Function

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

11 / 30

Loss Function

−Ez∼qθ(z|xi )[

log pφ(xi |z)]

11 / 30

Loss Function

I KL Divergence as regularizer:

KL(qθ(z |xi )||p(z)

)= Ez∼qθ(z|xi )

[log qθ(z |xi )− log p(z)

I Measures information lost when using qθ to represent p

I We will use p(z) = N (0, I)

I Encourages encoder to produce z ’s that are close to standardnormal distribution

I Encoder learns a meaningful representation of MNIST digits

I Representation for images of the same digit are close togetherin latent space

I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

]I Measures information lost when using qθ to represent p

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

Loss Function

)= Ez∼qθ(z|xi )

12 / 30

MNIST latent variable space

13 / 30

Reparameterization trick

I We want to use gradient descent to learn the model’sparameters

I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?

I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product

I Can take derivatives of (functions of) z w.r.t. µ and σ

I Output of qθ(z |x) is vector of µ’s and vector of σ’s

14 / 30

I We can reparameterize: z = µ+ σ � ε

I ε ∼ N (0, I), and � is element-wise product

14 / 30

Summary

I Deep Learning objective is to minimize the loss function:

L(θ, φ) =N∑i=1

(− Ez∼qθ(z|xi )

[log pφ(xi |z)

(qθ(z |xi )||p(z)

15 / 30

16 / 30

I Data x and latent variables z

I Joint pdf of the model: p(x , z) = p(x |z)p(z)

I Decomposes into likelihood: p(x |z), and prior: p(z)

I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Graphical model:

17 / 30

I Suppose we want to do inference in this model

I We would like to infer good values of z , given observed data

I Then we could use them to generate real-looking MNISTdigits

I We want to calculate the posterior:

p(z |x) =p(x |z)p(z)

I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz

I Integral over all configurations of latent variables /I Intractable

18 / 30

I Integral over all configurations of latent variables /

I Intractable

18 / 30

I Variational inference to the rescue!

I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)

I Which choice of λ gives the ‘best’ qλ(z |x)?

I KL divergence measures information lost when using qλ toapproximate p

I Choose λ to minimize KL(qλ(z |x)||p(z |x)

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

(qλ||p

19 / 30

KL(qλ||p

):= Ez∼qλ

[log qλ(z |x)− log p(z |x)

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

I Still contains p(x) term! So cannot compute directly

I But p(x) does not depend on λ, so still hope

20 / 30

KL(qλ||p

):= Ez∼qλ

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

20 / 30

KL(qλ||p

):= Ez∼qλ

]= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

20 / 30

I Define Evidence Lower BOund:

ELBO(λ) := Ez∼qλ[

log p(x , z)]− Ez∼qλ

[log qλ(z |x)

I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

= −ELBO(λ) + log p(x)

I So minimizing KL(qλ||p

)w.r.t. λ is equivalent to maximizing

ELBO(λ)

21 / 30

[log qλ(z |x)

]I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

ELBO(λ)

21 / 30

[log qλ(z |x)

]I Then

KL(qλ||p

)= Ez∼qλ

[log qλ(z |x)

]− Ez∼qλ

[log p(x , z)

]+ log p(x)

ELBO(λ)

21 / 30

I Since no two datapoints share latent variables, we can write:

ELBO(λ) =N∑i=1

ELBOi (λ)

I Where

ELBOi (λ) = Ez∼qλ(z|xi )[

log p(xi , z)]− Ez∼qλ(z|xi )

[log qλ(z |xi )

22 / 30

I Since no two datapoints share latent variables, we can write:

ELBO(λ) =N∑i=1

ELBOi (λ)

I Where

[log qλ(z |xi )

22 / 30

I We can rewrite the term ELBOi (λ):

[log qλ(z |xi )

]= Ez∼qλ(z|xi )

[log p(xi |z) + log p(z)

]− Ez∼qλ(z|xi )

[log qλ(z |xi )

]= Ez∼qλ(z|xi )

[log p(xi |z)

]− Ez∼qλ(z|xi )

[log qλ(z |xi )− log p(z)

]= Ez∼qλ(z|xi )

[log p(xi |z)

]− KL

(qλ(z |xi )||p(z)

23 / 30

I How do we relate λ to φ and θ seen earlier?

I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ

I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)

I So we can re-write

ELBOi (θ, φ) = Ez∼qθ(z|xi )[

log pφ(xi |z)]− KL

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

(qθ(z |xi )||p(z)

24 / 30

Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We wantto minimize:

L(θ, φ) =N∑i=1

[log pφ(xi |z)

(qθ(z |xi )||p(z)

I The objective just derived for the Probabilistic Model was tomaximize:

ELBO(θ, φ) =N∑i=1

(Ez∼qθ(z|xi )

[log pφ(xi |z)

]− KL

(qθ(z |xi )||p(z)

))I They are equivalent!

25 / 30

Probabilistic Model Objective

I Recall the Deep Learning objective derived earlier. We wantto minimize:

L(θ, φ) =N∑i=1

[log pφ(xi |z)

(qθ(z |xi )||p(z)

I The objective just derived for the Probabilistic Model was tomaximize:

ELBO(θ, φ) =N∑i=1

(Ez∼qθ(z|xi )

[log pφ(xi |z)

]− KL

(qθ(z |xi )||p(z)

))I They are equivalent!

25 / 30

Applications - Image generation

A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. arXivpreprint arXiv :1602.02644, 2016.

26 / 30

Applications - Caption generation

Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning ofimages, labels and captions. In NIPS, 2016.

27 / 30

Applications - Semi-/Un-supervised document classification

Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders for text modelingusing dilated convolutions. In Proceedings of The 34rd International Conference on Machine Learning, 2017.

28 / 30

Applications - Pixel art videogame characters

https://mlexplained.wordpress.com/category/generative-models/vae/.

29 / 30

Conclusion

I We derived the same objective from

I 1) A deep learning point of view, and

I 2) A probabilistic models point of view

I Showed they are equivalent

I Saw some applications

I Thank you. Questions?

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Conclusion

30 / 30

Variational Autoencoders - An IntroductionIntroduction I Auto-Encoding Variational Bayes, Diederik...

Documents

Transcript of Variational Autoencoders - An IntroductionIntroduction I Auto-Encoding Variational Bayes, Diederik...