Variational Autoencoders - An Introduction
Devon Graham
University of British [email protected]
Oct 31st, 2017
Table of contents
Introduction
Deep Learning Perspective
Probabilistic Model Perspective
Applications
Conclusion
Introduction
I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
3 / 30
Introduction
I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
3 / 30
Introduction
I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
3 / 30
Introduction
I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
3 / 30
Introduction
I Auto-Encoding Variational Bayes, Diederik P. Kingma andMax Welling, ICLR 2014
I Generative model
I Running example: Want to generate realistic-looking MNISTdigits (or celebrity faces, video game plants, cat pictures, etc)
I https://jaan.io/
what-is-variational-autoencoder-vae-tutorial/
I Deep Learning perspective and Probabilistic Model perspective
3 / 30
Introduction - Autoencoders
I
I Attempt to learn identity function
I Constrained in some way (e.g., small latent vectorrepresentation)
I Can generate new images by giving different latent vectors totrained network
I Variational: use probabilistic latent encoding
4 / 30
Introduction - Autoencoders
I
I Attempt to learn identity function
I Constrained in some way (e.g., small latent vectorrepresentation)
I Can generate new images by giving different latent vectors totrained network
I Variational: use probabilistic latent encoding
4 / 30
Introduction - Autoencoders
I
I Attempt to learn identity function
I Constrained in some way (e.g., small latent vectorrepresentation)
I Can generate new images by giving different latent vectors totrained network
I Variational: use probabilistic latent encoding
4 / 30
Introduction - Autoencoders
I
I Attempt to learn identity function
I Constrained in some way (e.g., small latent vectorrepresentation)
I Can generate new images by giving different latent vectors totrained network
I Variational: use probabilistic latent encoding
4 / 30
Introduction - Autoencoders
I
I Attempt to learn identity function
I Constrained in some way (e.g., small latent vectorrepresentation)
I Can generate new images by giving different latent vectors totrained network
I Variational: use probabilistic latent encoding
4 / 30
Deep Learning Perspective
5 / 30
Deep Learning Perspective
I Goal: Build a neural network that generates MNIST digitsfrom random (Gaussian) noise
I Define two sub-networks: Encoder and Decoder
I Define a Loss Function
6 / 30
Deep Learning Perspective
I Goal: Build a neural network that generates MNIST digitsfrom random (Gaussian) noise
I Define two sub-networks: Encoder and Decoder
I Define a Loss Function
6 / 30
Deep Learning Perspective
I Goal: Build a neural network that generates MNIST digitsfrom random (Gaussian) noise
I Define two sub-networks: Encoder and Decoder
I Define a Loss Function
6 / 30
Encoder
I A neural network qθ(z |x)
I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)
I Output: encoding z , drawn from Gaussian density withparameters θ
I |z | � |x |
I
7 / 30
Encoder
I A neural network qθ(z |x)
I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)
I Output: encoding z , drawn from Gaussian density withparameters θ
I |z | � |x |
I
7 / 30
Encoder
I A neural network qθ(z |x)
I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)
I Output: encoding z , drawn from Gaussian density withparameters θ
I |z | � |x |
I
7 / 30
Encoder
I A neural network qθ(z |x)
I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)
I Output: encoding z , drawn from Gaussian density withparameters θ
I |z | � |x |
I
7 / 30
Encoder
I A neural network qθ(z |x)
I Input: datapoint x (e.g. 28× 28-pixel MNIST digit)
I Output: encoding z , drawn from Gaussian density withparameters θ
I |z | � |x |
I
7 / 30
Decoder
I A neural network pφ(x |z), parameterized by φ
I Input: encoding z , output from encoder
I Output: reconstruction x̃ , drawn from distribution of the data
I E.g., output parameters for 28× 28 Bernoulli variables
I
8 / 30
Decoder
I A neural network pφ(x |z), parameterized by φ
I Input: encoding z , output from encoder
I Output: reconstruction x̃ , drawn from distribution of the data
I E.g., output parameters for 28× 28 Bernoulli variables
I
8 / 30
Decoder
I A neural network pφ(x |z), parameterized by φ
I Input: encoding z , output from encoder
I Output: reconstruction x̃ , drawn from distribution of the data
I E.g., output parameters for 28× 28 Bernoulli variables
I
8 / 30
Decoder
I A neural network pφ(x |z), parameterized by φ
I Input: encoding z , output from encoder
I Output: reconstruction x̃ , drawn from distribution of the data
I E.g., output parameters for 28× 28 Bernoulli variables
I
8 / 30
Decoder
I A neural network pφ(x |z), parameterized by φ
I Input: encoding z , output from encoder
I Output: reconstruction x̃ , drawn from distribution of the data
I E.g., output parameters for 28× 28 Bernoulli variables
I
8 / 30
Loss Function
I x̃ is reconstructed from z where |z | � |x̃ |
I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ(x |z)
I Measures how effectively the decoder has learned toreconstruct x given the latent representation z
9 / 30
Loss Function
I x̃ is reconstructed from z where |z | � |x̃ |I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ(x |z)
I Measures how effectively the decoder has learned toreconstruct x given the latent representation z
9 / 30
Loss Function
I x̃ is reconstructed from z where |z | � |x̃ |I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ(x |z)
I Measures how effectively the decoder has learned toreconstruct x given the latent representation z
9 / 30
Loss Function
I x̃ is reconstructed from z where |z | � |x̃ |I How much information is lost when we go from x to z to x̃?
I Measure this with reconstruction log-likelihood: log pφ(x |z)
I Measures how effectively the decoder has learned toreconstruct x given the latent representation z
9 / 30
Loss Function
I Loss function is negative reconstruction log-likelihood +regularizer
I Loss decomposes into term for each datapoint:
L(θ, φ) =N∑i=1
li (θ, φ)
I Loss for datapoint xi :
li (θ, φ) = −Ez∼qθ(z|xi )[
log pφ(xi |z)]
+ KL(qθ(z |xi )||p(z)
)
10 / 30
Loss Function
I Loss function is negative reconstruction log-likelihood +regularizer
I Loss decomposes into term for each datapoint:
L(θ, φ) =N∑i=1
li (θ, φ)
I Loss for datapoint xi :
li (θ, φ) = −Ez∼qθ(z|xi )[
log pφ(xi |z)]
+ KL(qθ(z |xi )||p(z)
)
10 / 30
Loss Function
I Loss function is negative reconstruction log-likelihood +regularizer
I Loss decomposes into term for each datapoint:
L(θ, φ) =N∑i=1
li (θ, φ)
I Loss for datapoint xi :
li (θ, φ) = −Ez∼qθ(z|xi )[
log pφ(xi |z)]
+ KL(qθ(z |xi )||p(z)
)
10 / 30
Loss Function
I Negative reconstruction log-likelihood:
−Ez∼qθ(z|xi )[
log pφ(xi |z)]
I Encourages decoder to learn to reconstruct the data
I Expectation taken over distribution of latent representations
11 / 30
Loss Function
I Negative reconstruction log-likelihood:
−Ez∼qθ(z|xi )[
log pφ(xi |z)]
I Encourages decoder to learn to reconstruct the data
I Expectation taken over distribution of latent representations
11 / 30
Loss Function
I Negative reconstruction log-likelihood:
−Ez∼qθ(z|xi )[
log pφ(xi |z)]
I Encourages decoder to learn to reconstruct the data
I Expectation taken over distribution of latent representations
11 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]
I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
Loss Function
I KL Divergence as regularizer:
KL(qθ(z |xi )||p(z)
)= Ez∼qθ(z|xi )
[log qθ(z |xi )− log p(z)
]I Measures information lost when using qθ to represent p
I We will use p(z) = N (0, I)
I Encourages encoder to produce z ’s that are close to standardnormal distribution
I Encoder learns a meaningful representation of MNIST digits
I Representation for images of the same digit are close togetherin latent space
I Otherwise could “memorize” the data and map each observeddatapoint to a distinct region of space
12 / 30
MNIST latent variable space
13 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � ε
I ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Reparameterization trick
I We want to use gradient descent to learn the model’sparameters
I Given z drawn from qθ(z |x), how do we take derivatives of (afunction of) z w.r.t. θ?
I We can reparameterize: z = µ+ σ � εI ε ∼ N (0, I), and � is element-wise product
I Can take derivatives of (functions of) z w.r.t. µ and σ
I Output of qθ(z |x) is vector of µ’s and vector of σ’s
14 / 30
Summary
I Deep Learning objective is to minimize the loss function:
L(θ, φ) =N∑i=1
(− Ez∼qθ(z|xi )
[log pφ(xi |z)
]+ KL
(qθ(z |xi )||p(z)
))
15 / 30
Probabilistic Model Perspective
16 / 30
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x , z) = p(x |z)p(z)
I Decomposes into likelihood: p(x |z), and prior: p(z)
I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)
I Graphical model:
17 / 30
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x , z) = p(x |z)p(z)
I Decomposes into likelihood: p(x |z), and prior: p(z)
I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)
I Graphical model:
17 / 30
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x , z) = p(x |z)p(z)
I Decomposes into likelihood: p(x |z), and prior: p(z)
I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)
I Graphical model:
17 / 30
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x , z) = p(x |z)p(z)
I Decomposes into likelihood: p(x |z), and prior: p(z)
I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)
I Graphical model:
17 / 30
Probabilistic Model Perspective
I Data x and latent variables z
I Joint pdf of the model: p(x , z) = p(x |z)p(z)
I Decomposes into likelihood: p(x |z), and prior: p(z)
I Generative process:Draw latent variables zi ∼ p(z)Draw datapoint xi ∼ p(x |z)
I Graphical model:
17 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /
I Intractable
18 / 30
Probabilistic Model Perspective
I Suppose we want to do inference in this model
I We would like to infer good values of z , given observed data
I Then we could use them to generate real-looking MNISTdigits
I We want to calculate the posterior:
p(z |x) =p(x |z)p(z)
p(x)
I Need to calculate evidence: p(x) =∫p(x |z)p(z)dz
I Integral over all configurations of latent variables /I Intractable
18 / 30
Probabilistic Model Perspective
I Variational inference to the rescue!
I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)
I Which choice of λ gives the ‘best’ qλ(z |x)?
I KL divergence measures information lost when using qλ toapproximate p
I Choose λ to minimize KL(qλ(z |x)||p(z |x)
)= KL
(qλ||p
)
19 / 30
Probabilistic Model Perspective
I Variational inference to the rescue!
I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)
I Which choice of λ gives the ‘best’ qλ(z |x)?
I KL divergence measures information lost when using qλ toapproximate p
I Choose λ to minimize KL(qλ(z |x)||p(z |x)
)= KL
(qλ||p
)
19 / 30
Probabilistic Model Perspective
I Variational inference to the rescue!
I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)
I Which choice of λ gives the ‘best’ qλ(z |x)?
I KL divergence measures information lost when using qλ toapproximate p
I Choose λ to minimize KL(qλ(z |x)||p(z |x)
)= KL
(qλ||p
)
19 / 30
Probabilistic Model Perspective
I Variational inference to the rescue!
I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)
I Which choice of λ gives the ‘best’ qλ(z |x)?
I KL divergence measures information lost when using qλ toapproximate p
I Choose λ to minimize KL(qλ(z |x)||p(z |x)
)= KL
(qλ||p
)
19 / 30
Probabilistic Model Perspective
I Variational inference to the rescue!
I Let’s approximate the true posterior p(z |x) with the ‘best’distribution from some family qλ(z |x)
I Which choice of λ gives the ‘best’ qλ(z |x)?
I KL divergence measures information lost when using qλ toapproximate p
I Choose λ to minimize KL(qλ(z |x)||p(z |x)
)= KL
(qλ||p
)
19 / 30
Probabilistic Model Perspective
I
KL(qλ||p
):= Ez∼qλ
[log qλ(z |x)− log p(z |x)
]= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
I Still contains p(x) term! So cannot compute directly
I But p(x) does not depend on λ, so still hope
20 / 30
Probabilistic Model Perspective
I
KL(qλ||p
):= Ez∼qλ
[log qλ(z |x)− log p(z |x)
]= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
I Still contains p(x) term! So cannot compute directly
I But p(x) does not depend on λ, so still hope
20 / 30
Probabilistic Model Perspective
I
KL(qλ||p
):= Ez∼qλ
[log qλ(z |x)− log p(z |x)
]= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
I Still contains p(x) term! So cannot compute directly
I But p(x) does not depend on λ, so still hope
20 / 30
Probabilistic Model Perspective
I Define Evidence Lower BOund:
ELBO(λ) := Ez∼qλ[
log p(x , z)]− Ez∼qλ
[log qλ(z |x)
]
I Then
KL(qλ||p
)= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
= −ELBO(λ) + log p(x)
I So minimizing KL(qλ||p
)w.r.t. λ is equivalent to maximizing
ELBO(λ)
21 / 30
Probabilistic Model Perspective
I Define Evidence Lower BOund:
ELBO(λ) := Ez∼qλ[
log p(x , z)]− Ez∼qλ
[log qλ(z |x)
]I Then
KL(qλ||p
)= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
= −ELBO(λ) + log p(x)
I So minimizing KL(qλ||p
)w.r.t. λ is equivalent to maximizing
ELBO(λ)
21 / 30
Probabilistic Model Perspective
I Define Evidence Lower BOund:
ELBO(λ) := Ez∼qλ[
log p(x , z)]− Ez∼qλ
[log qλ(z |x)
]I Then
KL(qλ||p
)= Ez∼qλ
[log qλ(z |x)
]− Ez∼qλ
[log p(x , z)
]+ log p(x)
= −ELBO(λ) + log p(x)
I So minimizing KL(qλ||p
)w.r.t. λ is equivalent to maximizing
ELBO(λ)
21 / 30
Probabilistic Model Perspective
I Since no two datapoints share latent variables, we can write:
ELBO(λ) =N∑i=1
ELBOi (λ)
I Where
ELBOi (λ) = Ez∼qλ(z|xi )[
log p(xi , z)]− Ez∼qλ(z|xi )
[log qλ(z |xi )
]
22 / 30
Probabilistic Model Perspective
I Since no two datapoints share latent variables, we can write:
ELBO(λ) =N∑i=1
ELBOi (λ)
I Where
ELBOi (λ) = Ez∼qλ(z|xi )[
log p(xi , z)]− Ez∼qλ(z|xi )
[log qλ(z |xi )
]
22 / 30
Probabilistic Model Perspective
I We can rewrite the term ELBOi (λ):
ELBOi (λ) = Ez∼qλ(z|xi )[
log p(xi , z)]− Ez∼qλ(z|xi )
[log qλ(z |xi )
]= Ez∼qλ(z|xi )
[log p(xi |z) + log p(z)
]− Ez∼qλ(z|xi )
[log qλ(z |xi )
]= Ez∼qλ(z|xi )
[log p(xi |z)
]− Ez∼qλ(z|xi )
[log qλ(z |xi )− log p(z)
]= Ez∼qλ(z|xi )
[log p(xi |z)
]− KL
(qλ(z |xi )||p(z)
)
23 / 30
Probabilistic Model Perspective
I How do we relate λ to φ and θ seen earlier?
I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ
I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)
I So we can re-write
ELBOi (θ, φ) = Ez∼qθ(z|xi )[
log pφ(xi |z)]− KL
(qθ(z |xi )||p(z)
)
24 / 30
Probabilistic Model Perspective
I How do we relate λ to φ and θ seen earlier?
I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ
I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)
I So we can re-write
ELBOi (θ, φ) = Ez∼qθ(z|xi )[
log pφ(xi |z)]− KL
(qθ(z |xi )||p(z)
)
24 / 30
Probabilistic Model Perspective
I How do we relate λ to φ and θ seen earlier?
I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ
I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)
I So we can re-write
ELBOi (θ, φ) = Ez∼qθ(z|xi )[
log pφ(xi |z)]− KL
(qθ(z |xi )||p(z)
)
24 / 30
Probabilistic Model Perspective
I How do we relate λ to φ and θ seen earlier?
I We can parameterize approximate posterior qθ(z |x , λ) by anetwork that takes data x and outputs parameters λ
I Parameterize the likelihood p(x |z) with a network that takeslatent variables and outputs parameters to the datadistribution pφ(x |z)
I So we can re-write
ELBOi (θ, φ) = Ez∼qθ(z|xi )[
log pφ(xi |z)]− KL
(qθ(z |xi )||p(z)
)
24 / 30
Probabilistic Model Objective
I Recall the Deep Learning objective derived earlier. We wantto minimize:
L(θ, φ) =N∑i=1
(− Ez∼qθ(z|xi )
[log pφ(xi |z)
]+ KL
(qθ(z |xi )||p(z)
))
I The objective just derived for the Probabilistic Model was tomaximize:
ELBO(θ, φ) =N∑i=1
(Ez∼qθ(z|xi )
[log pφ(xi |z)
]− KL
(qθ(z |xi )||p(z)
))I They are equivalent!
25 / 30
Probabilistic Model Objective
I Recall the Deep Learning objective derived earlier. We wantto minimize:
L(θ, φ) =N∑i=1
(− Ez∼qθ(z|xi )
[log pφ(xi |z)
]+ KL
(qθ(z |xi )||p(z)
))
I The objective just derived for the Probabilistic Model was tomaximize:
ELBO(θ, φ) =N∑i=1
(Ez∼qθ(z|xi )
[log pφ(xi |z)
]− KL
(qθ(z |xi )||p(z)
))I They are equivalent!
25 / 30
Applications - Image generation
A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. arXivpreprint arXiv :1602.02644, 2016.
26 / 30
Applications - Caption generation
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning ofimages, labels and captions. In NIPS, 2016.
27 / 30
Applications - Semi-/Un-supervised document classification
Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick. Improved variational autoencoders for text modelingusing dilated convolutions. In Proceedings of The 34rd International Conference on Machine Learning, 2017.
28 / 30
Applications - Pixel art videogame characters
https://mlexplained.wordpress.com/category/generative-models/vae/.
29 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Conclusion
I We derived the same objective from
I 1) A deep learning point of view, and
I 2) A probabilistic models point of view
I Showed they are equivalent
I Saw some applications
I Thank you. Questions?
30 / 30
Top Related