CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den...

165
1/36 CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Transcript of CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den...

Page 1: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

1/36

CS7015 (Deep Learning) : Lecture 21Variational Autoencoders

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 2: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

2/36

Acknowledgments

Tutorial on Variational Autoencoders by Carl Doersch1

Blog on Variational Autoencoders by Jaan Altosaar2

1Tutorial2Blog

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 3: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

3/36

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 4: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

4/36

Module 21.1: Revisiting Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 5: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

5/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Before we start talking about VAEs, let usquickly revisit autoencoders

An autoencoder contains an encoder whichtakes the input X and maps it to a hiddenrepresentation

The decoder then takes this hidden represent-ation and tries to reconstruct the input fromit as X̂

The training happens using the following ob-jective function

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(x̂ij − xij)2

where m is the number of training instances,{xi}mi=1 and each xi ∈ Rn (xij is thus the j-thdimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 6: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

5/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Before we start talking about VAEs, let usquickly revisit autoencoders

An autoencoder contains an encoder whichtakes the input X and maps it to a hiddenrepresentation

The decoder then takes this hidden represent-ation and tries to reconstruct the input fromit as X̂

The training happens using the following ob-jective function

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(x̂ij − xij)2

where m is the number of training instances,{xi}mi=1 and each xi ∈ Rn (xij is thus the j-thdimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 7: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

5/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Before we start talking about VAEs, let usquickly revisit autoencoders

An autoencoder contains an encoder whichtakes the input X and maps it to a hiddenrepresentation

The decoder then takes this hidden represent-ation and tries to reconstruct the input fromit as X̂

The training happens using the following ob-jective function

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(x̂ij − xij)2

where m is the number of training instances,{xi}mi=1 and each xi ∈ Rn (xij is thus the j-thdimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 8: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

5/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Before we start talking about VAEs, let usquickly revisit autoencoders

An autoencoder contains an encoder whichtakes the input X and maps it to a hiddenrepresentation

The decoder then takes this hidden represent-ation and tries to reconstruct the input fromit as X̂

The training happens using the following ob-jective function

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(x̂ij − xij)2

where m is the number of training instances,{xi}mi=1 and each xi ∈ Rn (xij is thus the j-thdimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 9: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

5/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Before we start talking about VAEs, let usquickly revisit autoencoders

An autoencoder contains an encoder whichtakes the input X and maps it to a hiddenrepresentation

The decoder then takes this hidden represent-ation and tries to reconstruct the input fromit as X̂

The training happens using the following ob-jective function

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(x̂ij − xij)2

where m is the number of training instances,{xi}mi=1 and each xi ∈ Rn (xij is thus the j-thdimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 10: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction

(they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 11: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction

(they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 12: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction

(they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 13: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction

(they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 14: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction (they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 15: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

6/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

But where’s the fun in this ?

We are taking an input and simply recon-structing it

Of course, the fun lies in the fact that we aregetting a good abstraction of the input

But RBMs were able to do something morebesides abstraction (they were able to do gen-eration)

Let us revisit generation in the context of au-toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 16: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

7/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Can we do generation with autoencoders ?

In other words, once the autoencoder istrained can I remove the encoder, feed a hid-den representation h to the decoder and de-code a X̂ from it ?

In principle, yes! But in practice there is aproblem with this approach

h is a very high dimensional vector and onlya few vectors in this space would actually cor-respond to meaningful latent representationsof our input

So of all the possible value of h which valuesshould I feed to the decoder (we had asked asimilar question before: slide 67, bullet 5 oflecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 17: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

7/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Can we do generation with autoencoders ?

In other words, once the autoencoder istrained can I remove the encoder, feed a hid-den representation h to the decoder and de-code a X̂ from it ?

In principle, yes! But in practice there is aproblem with this approach

h is a very high dimensional vector and onlya few vectors in this space would actually cor-respond to meaningful latent representationsof our input

So of all the possible value of h which valuesshould I feed to the decoder (we had asked asimilar question before: slide 67, bullet 5 oflecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 18: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

7/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Can we do generation with autoencoders ?

In other words, once the autoencoder istrained can I remove the encoder, feed a hid-den representation h to the decoder and de-code a X̂ from it ?

In principle, yes! But in practice there is aproblem with this approach

h is a very high dimensional vector and onlya few vectors in this space would actually cor-respond to meaningful latent representationsof our input

So of all the possible value of h which valuesshould I feed to the decoder (we had asked asimilar question before: slide 67, bullet 5 oflecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 19: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

7/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Can we do generation with autoencoders ?

In other words, once the autoencoder istrained can I remove the encoder, feed a hid-den representation h to the decoder and de-code a X̂ from it ?

In principle, yes! But in practice there is aproblem with this approach

h is a very high dimensional vector and onlya few vectors in this space would actually cor-respond to meaningful latent representationsof our input

So of all the possible value of h which valuesshould I feed to the decoder (we had asked asimilar question before: slide 67, bullet 5 oflecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 20: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

7/36

X

W

h

W ∗

h = g(WX + b)

X̂ = f(W ∗h + c)

Can we do generation with autoencoders ?

In other words, once the autoencoder istrained can I remove the encoder, feed a hid-den representation h to the decoder and de-code a X̂ from it ?

In principle, yes! But in practice there is aproblem with this approach

h is a very high dimensional vector and onlya few vectors in this space would actually cor-respond to meaningful latent representationsof our input

So of all the possible value of h which valuesshould I feed to the decoder (we had asked asimilar question before: slide 67, bullet 5 oflecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 21: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

8/36

h

W ∗

X̂ = f(W ∗h + c)

Ideally, we should only feed those values of hwhich are highly likely

In other words, we are interested in samplingfrom P (h|X) so that we pick only those h’swhich have a high probability

But unlike RBMs, autoencoders do not havesuch a probabilistic interpretation

They learn a hidden representation h but nota distribution P (h|X)

Similarly the decoder is also deterministic anddoes not learn a distribution over X (given ah we can get a X but not P (X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 22: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

8/36

h

W ∗

X̂ = f(W ∗h + c)

Ideally, we should only feed those values of hwhich are highly likely

In other words, we are interested in samplingfrom P (h|X) so that we pick only those h’swhich have a high probability

But unlike RBMs, autoencoders do not havesuch a probabilistic interpretation

They learn a hidden representation h but nota distribution P (h|X)

Similarly the decoder is also deterministic anddoes not learn a distribution over X (given ah we can get a X but not P (X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 23: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

8/36

h

W ∗

X̂ = f(W ∗h + c)

Ideally, we should only feed those values of hwhich are highly likely

In other words, we are interested in samplingfrom P (h|X) so that we pick only those h’swhich have a high probability

But unlike RBMs, autoencoders do not havesuch a probabilistic interpretation

They learn a hidden representation h but nota distribution P (h|X)

Similarly the decoder is also deterministic anddoes not learn a distribution over X (given ah we can get a X but not P (X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 24: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

8/36

h

W ∗

X̂ = f(W ∗h + c)

Ideally, we should only feed those values of hwhich are highly likely

In other words, we are interested in samplingfrom P (h|X) so that we pick only those h’swhich have a high probability

But unlike RBMs, autoencoders do not havesuch a probabilistic interpretation

They learn a hidden representation h but nota distribution P (h|X)

Similarly the decoder is also deterministic anddoes not learn a distribution over X (given ah we can get a X but not P (X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 25: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

8/36

h

W ∗

X̂ = f(W ∗h + c)

Ideally, we should only feed those values of hwhich are highly likely

In other words, we are interested in samplingfrom P (h|X) so that we pick only those h’swhich have a high probability

But unlike RBMs, autoencoders do not havesuch a probabilistic interpretation

They learn a hidden representation h but nota distribution P (h|X)

Similarly the decoder is also deterministic anddoes not learn a distribution over X (given ah we can get a X but not P (X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 26: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

9/36

We will now look at variational autoencoders which have the same structure asautoencoders but they learn a distribution over the hidden variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 27: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

10/36

Module 21.2: Variational Autoencoders: The NeuralNetwork Perspective

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 28: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 29: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 30: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 31: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Figure: Abstraction

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 32: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Figure: Abstraction

Figure: Generation

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 33: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

11/36

Figure: Abstraction

Figure: Generation

Let {X = xi}Ni=1 be the training data

We can think of X as a random variable in Rn

For example, X could be an image and thedimensions of X correspond to pixels of theimage

We are interested in learning an abstraction(i.e., given an X find the hidden representa-tion z)

We are also interested in generation (i.e.,given a hidden representation generate an X)

In probabilistic terms we are interested inP (z|X) and P (X|z) (to be consistent with theliteration on VAEs we will use z instead of Hand X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 34: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)

Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 35: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 36: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 37: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 38: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 39: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

12/36

v1 v2 · · · vm

V ∈ {0, 1}mb1 b2 bm

h1 h2 · · · hn

H ∈ {0, 1}nc1 c2 cn

W ∈ Rm×nw1,1 wm,n

Earlier we saw RBMs where we learnt P (z|X)and P (X|z)Below we list certain characteristics of RBMs

Structural assumptions: We assume cer-tain independencies in the Markov Network

Computational: When training with GibbsSampling we have to run the Markov Chainfor many time steps which is expensive

Approximation: When using ContrastiveDivergence, we approximate the expectationby a point estimate

(Nothing wrong with the above but we justmention them to make the reader aware ofthese characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 40: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural networkφ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 41: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural networkφ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 42: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural networkφ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))

VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 43: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural network

φ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 44: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural networkφ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 45: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

13/36

z

Data: X

Encoder Qθ(z|X)

Reconstruction: X̂

Decoder Pφ(X|z)

θ: the parameters of the encoderneural networkφ: the parameters of the decoderneural network

We now return to our goals

Goal 1: Learn a distribution over the latentvariables (Q(z|X))

Goal 2: Learn a distribution over the visiblevariables (P (X|z))VAEs use a neural network based encoder forGoal 1

and a neural network based decoder for Goal2

We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 46: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

14/36

X

z

Qθ(z|X)

X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m

Encoder: What do we mean when we saywe want to learn a distribution?

We meanthat we want to learn the parameters of thedistribution

But what are the parameters of Q(z|X)?

Well it depends on our modeling assump-tion!

In VAEs we assume that the latent variablescome from a standard normal distributionN (0, I) and the job of the encoder is to thenpredict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 47: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

14/36

X

z

Qθ(z|X)

X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m

Encoder: What do we mean when we saywe want to learn a distribution? We meanthat we want to learn the parameters of thedistribution

But what are the parameters of Q(z|X)?

Well it depends on our modeling assump-tion!

In VAEs we assume that the latent variablescome from a standard normal distributionN (0, I) and the job of the encoder is to thenpredict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 48: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

14/36

X

z

Qθ(z|X)

X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m

Encoder: What do we mean when we saywe want to learn a distribution? We meanthat we want to learn the parameters of thedistribution

But what are the parameters of Q(z|X)?

Well it depends on our modeling assump-tion!

In VAEs we assume that the latent variablescome from a standard normal distributionN (0, I) and the job of the encoder is to thenpredict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 49: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

14/36

X

z

Qθ(z|X)

X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m

Encoder: What do we mean when we saywe want to learn a distribution? We meanthat we want to learn the parameters of thedistribution

But what are the parameters of Q(z|X)?Well it depends on our modeling assump-tion!

In VAEs we assume that the latent variablescome from a standard normal distributionN (0, I) and the job of the encoder is to thenpredict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 50: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

14/36

X

z

Qθ(z|X)

Σµ

X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m

Encoder: What do we mean when we saywe want to learn a distribution? We meanthat we want to learn the parameters of thedistribution

But what are the parameters of Q(z|X)?Well it depends on our modeling assump-tion!

In VAEs we assume that the latent variablescome from a standard normal distributionN (0, I) and the job of the encoder is to thenpredict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 51: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 52: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)

Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 53: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 54: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?

We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 55: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 56: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

15/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iNow what about the decoder?

The job of the decoder is to predict a probab-ility distribution over X : P (X|z)Once again we will assume a certain form forthis distribution

For example, if we want to predict 28 x 28pixels and each pixel belongs to R (i.e., X ∈R784) then what would be a suitable familyfor P (X|z)?We could assume that P (X|z) is a Gaussiandistribution with unit variance

The job of the decoder f would then be topredict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 57: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

16/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iWhat would be the objective function of thedecoder ?

For any given training sample xi it shouldmaximize P (xi) given by

P (xi) =

ˆP (z)P (xi|z)dz

= −Ez∼Qθ(z|xi)[logPφ(xi|z)]

(As usual we take log for numerical stability)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 58: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

16/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iWhat would be the objective function of thedecoder ?

For any given training sample xi it shouldmaximize P (xi) given by

P (xi) =

ˆP (z)P (xi|z)dz

= −Ez∼Qθ(z|xi)[logPφ(xi|z)]

(As usual we take log for numerical stability)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 59: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

16/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iWhat would be the objective function of thedecoder ?

For any given training sample xi it shouldmaximize P (xi) given by

P (xi) =

ˆP (z)P (xi|z)dz

= −Ez∼Qθ(z|xi)[logPφ(xi|z)]

(As usual we take log for numerical stability)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 60: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

16/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂iWhat would be the objective function of thedecoder ?

For any given training sample xi it shouldmaximize P (xi) given by

P (xi) =

ˆP (z)P (xi|z)dz

= −Ez∼Qθ(z|xi)[logPφ(xi|z)]

(As usual we take log for numerical stability)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 61: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

17/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

KL divergence capturesthe difference (or distance)between 2 distributions

This is the loss function for one data point(li(θ)) and we will just sum over all the datapoints to get the total loss L (θ)

L (θ) =

m∑i=1

li(θ)

In addition, we also want a constraint on thedistribution over the latent variables

Specifically, we had assumed P (z) to beN (0, I) and we want Q(z|X) to be as closeto P (z) as possible

Thus, we will modify the loss function suchthat

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 62: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

17/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

KL divergence capturesthe difference (or distance)between 2 distributions

This is the loss function for one data point(li(θ)) and we will just sum over all the datapoints to get the total loss L (θ)

L (θ) =

m∑i=1

li(θ)

In addition, we also want a constraint on thedistribution over the latent variables

Specifically, we had assumed P (z) to beN (0, I) and we want Q(z|X) to be as closeto P (z) as possible

Thus, we will modify the loss function suchthat

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 63: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

17/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

KL divergence capturesthe difference (or distance)between 2 distributions

This is the loss function for one data point(li(θ)) and we will just sum over all the datapoints to get the total loss L (θ)

L (θ) =

m∑i=1

li(θ)

In addition, we also want a constraint on thedistribution over the latent variables

Specifically, we had assumed P (z) to beN (0, I) and we want Q(z|X) to be as closeto P (z) as possible

Thus, we will modify the loss function suchthat

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 64: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

17/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

KL divergence capturesthe difference (or distance)between 2 distributions

This is the loss function for one data point(li(θ)) and we will just sum over all the datapoints to get the total loss L (θ)

L (θ) =

m∑i=1

li(θ)

In addition, we also want a constraint on thedistribution over the latent variables

Specifically, we had assumed P (z) to beN (0, I) and we want Q(z|X) to be as closeto P (z) as possible

Thus, we will modify the loss function suchthat

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 65: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

17/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

KL divergence capturesthe difference (or distance)between 2 distributions

This is the loss function for one data point(li(θ)) and we will just sum over all the datapoints to get the total loss L (θ)

L (θ) =

m∑i=1

li(θ)

In addition, we also want a constraint on thedistribution over the latent variables

Specifically, we had assumed P (z) to beN (0, I) and we want Q(z|X) to be as closeto P (z) as possible

Thus, we will modify the loss function suchthat

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 66: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 67: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 68: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 69: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 70: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 71: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

18/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

The second term in the loss function can actually bethought of as a regularizer

It ensures that the encoder does not cheat by mappingeach xi to a different point (a normal distribution withvery low variance) in the Euclidean space

In other words, in the absence of the regularizer theencoder can learn a unique mapping for each xi andthe decoder can then decode from this unique mapping

Even with high variance in samples from the distribu-tion, we want the decoder to be able to reconstructthe original data very well (motivation similar to theadding noise)

To summarize, for each data point we predict a distri-bution such that, with high probability a sample fromthis distribution should be able to reconstruct the ori-ginal data point

But why do we choose a normal distribution? Isn’tit too simplistic to assume that z follows a normaldistribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 72: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Isn’t it a very strong assumption that P (z) ∼N (0, I) ?

For example, in the 2-dimensional case howcan we be sure that P (z) is a normal distri-bution and not any other distribution

The key insight here is that any distributionin d dimensions can be generated by the fol-lowing steps

Step 1: Start with a set of d variables that arenormally distributed (that’s exactly what weare assuming for P (z))

Step 2: Mapping these variables through asufficiently complex function (that’s exactlywhat the first few layers of the decoder cando)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 73: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Isn’t it a very strong assumption that P (z) ∼N (0, I) ?

For example, in the 2-dimensional case howcan we be sure that P (z) is a normal distri-bution and not any other distribution

The key insight here is that any distributionin d dimensions can be generated by the fol-lowing steps

Step 1: Start with a set of d variables that arenormally distributed (that’s exactly what weare assuming for P (z))

Step 2: Mapping these variables through asufficiently complex function (that’s exactlywhat the first few layers of the decoder cando)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 74: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Isn’t it a very strong assumption that P (z) ∼N (0, I) ?

For example, in the 2-dimensional case howcan we be sure that P (z) is a normal distri-bution and not any other distribution

The key insight here is that any distributionin d dimensions can be generated by the fol-lowing steps

Step 1: Start with a set of d variables that arenormally distributed (that’s exactly what weare assuming for P (z))

Step 2: Mapping these variables through asufficiently complex function (that’s exactlywhat the first few layers of the decoder cando)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 75: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Isn’t it a very strong assumption that P (z) ∼N (0, I) ?

For example, in the 2-dimensional case howcan we be sure that P (z) is a normal distri-bution and not any other distribution

The key insight here is that any distributionin d dimensions can be generated by the fol-lowing steps

Step 1: Start with a set of d variables that arenormally distributed (that’s exactly what weare assuming for P (z))

Step 2: Mapping these variables through asufficiently complex function (that’s exactlywhat the first few layers of the decoder cando)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 76: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

Isn’t it a very strong assumption that P (z) ∼N (0, I) ?

For example, in the 2-dimensional case howcan we be sure that P (z) is a normal distri-bution and not any other distribution

The key insight here is that any distributionin d dimensions can be generated by the fol-lowing steps

Step 1: Start with a set of d variables that arenormally distributed (that’s exactly what weare assuming for P (z))

Step 2: Mapping these variables through asufficiently complex function (that’s exactlywhat the first few layers of the decoder cando)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 77: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

In particular, note that in the adjoining example if zis 2-D and normally distributed then f(z) is roughlyring shaped (giving us the distribution in the bottomfigure)

f(z) =z

10+

z

||z||

A non-linear neural network, such as the one we usefor the decoder, could learn a complex mapping fromz to fφ(z) using its parameters φ

The initial layers of a non linear decoder could learntheir weights such that the output is fφ(z)

The above argument suggests that even if we start withnormally distributed variables the initial layers of thedecoder could learn a complex transformation of thesevariables say fφ(z) if required

The objective function of the decoder will ensure thatan appropriate transformation of z is learnt to recon-struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 78: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

In particular, note that in the adjoining example if zis 2-D and normally distributed then f(z) is roughlyring shaped (giving us the distribution in the bottomfigure)

f(z) =z

10+

z

||z||A non-linear neural network, such as the one we usefor the decoder, could learn a complex mapping fromz to fφ(z) using its parameters φ

The initial layers of a non linear decoder could learntheir weights such that the output is fφ(z)

The above argument suggests that even if we start withnormally distributed variables the initial layers of thedecoder could learn a complex transformation of thesevariables say fφ(z) if required

The objective function of the decoder will ensure thatan appropriate transformation of z is learnt to recon-struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 79: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

In particular, note that in the adjoining example if zis 2-D and normally distributed then f(z) is roughlyring shaped (giving us the distribution in the bottomfigure)

f(z) =z

10+

z

||z||A non-linear neural network, such as the one we usefor the decoder, could learn a complex mapping fromz to fφ(z) using its parameters φ

The initial layers of a non linear decoder could learntheir weights such that the output is fφ(z)

The above argument suggests that even if we start withnormally distributed variables the initial layers of thedecoder could learn a complex transformation of thesevariables say fφ(z) if required

The objective function of the decoder will ensure thatan appropriate transformation of z is learnt to recon-struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 80: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

In particular, note that in the adjoining example if zis 2-D and normally distributed then f(z) is roughlyring shaped (giving us the distribution in the bottomfigure)

f(z) =z

10+

z

||z||A non-linear neural network, such as the one we usefor the decoder, could learn a complex mapping fromz to fφ(z) using its parameters φ

The initial layers of a non linear decoder could learntheir weights such that the output is fφ(z)

The above argument suggests that even if we start withnormally distributed variables the initial layers of thedecoder could learn a complex transformation of thesevariables say fφ(z) if required

The objective function of the decoder will ensure thatan appropriate transformation of z is learnt to recon-struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 81: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[logPφ(xi|z)]+KL(Qθ(z|xi)||P (z))

In particular, note that in the adjoining example if zis 2-D and normally distributed then f(z) is roughlyring shaped (giving us the distribution in the bottomfigure)

f(z) =z

10+

z

||z||A non-linear neural network, such as the one we usefor the decoder, could learn a complex mapping fromz to fφ(z) using its parameters φ

The initial layers of a non linear decoder could learntheir weights such that the output is fφ(z)

The above argument suggests that even if we start withnormally distributed variables the initial layers of thedecoder could learn a complex transformation of thesevariables say fφ(z) if required

The objective function of the decoder will ensure thatan appropriate transformation of z is learnt to recon-struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 82: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

21/36

Module 21.3: Variational autoencoders: (The graphicalmodel perspective)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 83: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

22/36

X

z

N

Here we can think of z and X as random vari-ables

We are then interested in the joint prob-ability distribution P (X, z) which factorizesas P (X, z) = P (z)P (X|z)This factorization is natural because we canimagine that the latent variables are fixed firstand then the visible variables are drawn basedon the latent variables

For example, if we want to draw a digit wecould first fix the latent variables: the digit,size, angle, thickness, position and so on andthen draw a digit which corresponds to theselatent variables

And of course, unlike RBMs, this is a directedgraphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 84: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

22/36

X

z

N

Here we can think of z and X as random vari-ables

We are then interested in the joint prob-ability distribution P (X, z) which factorizesas P (X, z) = P (z)P (X|z)

This factorization is natural because we canimagine that the latent variables are fixed firstand then the visible variables are drawn basedon the latent variables

For example, if we want to draw a digit wecould first fix the latent variables: the digit,size, angle, thickness, position and so on andthen draw a digit which corresponds to theselatent variables

And of course, unlike RBMs, this is a directedgraphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 85: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

22/36

X

z

N

Here we can think of z and X as random vari-ables

We are then interested in the joint prob-ability distribution P (X, z) which factorizesas P (X, z) = P (z)P (X|z)This factorization is natural because we canimagine that the latent variables are fixed firstand then the visible variables are drawn basedon the latent variables

For example, if we want to draw a digit wecould first fix the latent variables: the digit,size, angle, thickness, position and so on andthen draw a digit which corresponds to theselatent variables

And of course, unlike RBMs, this is a directedgraphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 86: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

22/36

X

z

N

Here we can think of z and X as random vari-ables

We are then interested in the joint prob-ability distribution P (X, z) which factorizesas P (X, z) = P (z)P (X|z)This factorization is natural because we canimagine that the latent variables are fixed firstand then the visible variables are drawn basedon the latent variables

For example, if we want to draw a digit wecould first fix the latent variables: the digit,size, angle, thickness, position and so on andthen draw a digit which corresponds to theselatent variables

And of course, unlike RBMs, this is a directedgraphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 87: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

22/36

X

z

N

Here we can think of z and X as random vari-ables

We are then interested in the joint prob-ability distribution P (X, z) which factorizesas P (X, z) = P (z)P (X|z)This factorization is natural because we canimagine that the latent variables are fixed firstand then the visible variables are drawn basedon the latent variables

For example, if we want to draw a digit wecould first fix the latent variables: the digit,size, angle, thickness, position and so on andthen draw a digit which corresponds to theselatent variables

And of course, unlike RBMs, this is a directedgraphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 88: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

23/36

X

z

N

Now at inference time, we are given an X (observedvariable) and we are interested in finding the mostlikely assignments of latent variables z which wouldhave resulted in this observation

Mathematically, we want to find

P (z|X) =P (X|z)P (z)

P (X)

This is hard to compute because the LHS containsP (X) which is intractable

P (X) =

ˆP (X|z)P (z)dz

=

ˆ ˆ...

ˆP (X|z1, z2, ..., zn)P (z1, z2, ..., zn)dz1, ...dzn

In RBMs, we had a similar integral which we approx-imated using Gibbs Sampling

VAEs, on the other hand, cast this into an optimiza-tion problem and learn the parameters of the optim-ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 89: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

23/36

X

z

N

Now at inference time, we are given an X (observedvariable) and we are interested in finding the mostlikely assignments of latent variables z which wouldhave resulted in this observation

Mathematically, we want to find

P (z|X) =P (X|z)P (z)

P (X)

This is hard to compute because the LHS containsP (X) which is intractable

P (X) =

ˆP (X|z)P (z)dz

=

ˆ ˆ...

ˆP (X|z1, z2, ..., zn)P (z1, z2, ..., zn)dz1, ...dzn

In RBMs, we had a similar integral which we approx-imated using Gibbs Sampling

VAEs, on the other hand, cast this into an optimiza-tion problem and learn the parameters of the optim-ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 90: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

23/36

X

z

N

Now at inference time, we are given an X (observedvariable) and we are interested in finding the mostlikely assignments of latent variables z which wouldhave resulted in this observation

Mathematically, we want to find

P (z|X) =P (X|z)P (z)

P (X)

This is hard to compute because the LHS containsP (X) which is intractable

P (X) =

ˆP (X|z)P (z)dz

=

ˆ ˆ...

ˆP (X|z1, z2, ..., zn)P (z1, z2, ..., zn)dz1, ...dzn

In RBMs, we had a similar integral which we approx-imated using Gibbs Sampling

VAEs, on the other hand, cast this into an optimiza-tion problem and learn the parameters of the optim-ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 91: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

23/36

X

z

N

Now at inference time, we are given an X (observedvariable) and we are interested in finding the mostlikely assignments of latent variables z which wouldhave resulted in this observation

Mathematically, we want to find

P (z|X) =P (X|z)P (z)

P (X)

This is hard to compute because the LHS containsP (X) which is intractable

P (X) =

ˆP (X|z)P (z)dz

=

ˆ ˆ...

ˆP (X|z1, z2, ..., zn)P (z1, z2, ..., zn)dz1, ...dzn

In RBMs, we had a similar integral which we approx-imated using Gibbs Sampling

VAEs, on the other hand, cast this into an optimiza-tion problem and learn the parameters of the optim-ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 92: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

23/36

X

z

N

Now at inference time, we are given an X (observedvariable) and we are interested in finding the mostlikely assignments of latent variables z which wouldhave resulted in this observation

Mathematically, we want to find

P (z|X) =P (X|z)P (z)

P (X)

This is hard to compute because the LHS containsP (X) which is intractable

P (X) =

ˆP (X|z)P (z)dz

=

ˆ ˆ...

ˆP (X|z1, z2, ..., zn)P (z1, z2, ..., zn)dz1, ...dzn

In RBMs, we had a similar integral which we approx-imated using Gibbs Sampling

VAEs, on the other hand, cast this into an optimiza-tion problem and learn the parameters of the optim-ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 93: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

24/36

X

z

N

Specifically, in VAEs, we assume that insteadof P (z|X) which is intractable, the posteriordistribution is given by Qθ(z|X)

Further, we assume that Qθ(z|X) is a Gaus-sian whose parameters are determined by aneural network µ, Σ = gθ(X)

The parameters of the distribution are thusdetermined by the parameters θ of a neuralnetwork

Our job then is to learn the parameters of thisneural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 94: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

24/36

X

z

N

Specifically, in VAEs, we assume that insteadof P (z|X) which is intractable, the posteriordistribution is given by Qθ(z|X)

Further, we assume that Qθ(z|X) is a Gaus-sian whose parameters are determined by aneural network µ, Σ = gθ(X)

The parameters of the distribution are thusdetermined by the parameters θ of a neuralnetwork

Our job then is to learn the parameters of thisneural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 95: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

24/36

X

z

N

Specifically, in VAEs, we assume that insteadof P (z|X) which is intractable, the posteriordistribution is given by Qθ(z|X)

Further, we assume that Qθ(z|X) is a Gaus-sian whose parameters are determined by aneural network µ, Σ = gθ(X)

The parameters of the distribution are thusdetermined by the parameters θ of a neuralnetwork

Our job then is to learn the parameters of thisneural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 96: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

24/36

X

z

N

Specifically, in VAEs, we assume that insteadof P (z|X) which is intractable, the posteriordistribution is given by Qθ(z|X)

Further, we assume that Qθ(z|X) is a Gaus-sian whose parameters are determined by aneural network µ, Σ = gθ(X)

The parameters of the distribution are thusdetermined by the parameters θ of a neuralnetwork

Our job then is to learn the parameters of thisneural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 97: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

25/36

X

z

N

But what is the objective function for thisneural network

Well we want the proposed distributionQθ(z|X) to be as close to the true distribu-tion

We can capture this using the following ob-jective function

minimize KL(Qθ(z|X)||P (z|X))

What are the parameters of the objectivefunction ?

(they are the parameters of theneural network - we will return back to thisagain)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 98: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

25/36

X

z

N

But what is the objective function for thisneural network

Well we want the proposed distributionQθ(z|X) to be as close to the true distribu-tion

We can capture this using the following ob-jective function

minimize KL(Qθ(z|X)||P (z|X))

What are the parameters of the objectivefunction ?

(they are the parameters of theneural network - we will return back to thisagain)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 99: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

25/36

X

z

N

But what is the objective function for thisneural network

Well we want the proposed distributionQθ(z|X) to be as close to the true distribu-tion

We can capture this using the following ob-jective function

minimize KL(Qθ(z|X)||P (z|X))

What are the parameters of the objectivefunction ?

(they are the parameters of theneural network - we will return back to thisagain)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 100: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

25/36

X

z

N

But what is the objective function for thisneural network

Well we want the proposed distributionQθ(z|X) to be as close to the true distribu-tion

We can capture this using the following ob-jective function

minimize KL(Qθ(z|X)||P (z|X))

What are the parameters of the objectivefunction ?

(they are the parameters of theneural network - we will return back to thisagain)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 101: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

25/36

X

z

N

But what is the objective function for thisneural network

Well we want the proposed distributionQθ(z|X) to be as close to the true distribu-tion

We can capture this using the following ob-jective function

minimize KL(Qθ(z|X)||P (z|X))

What are the parameters of the objectivefunction ? (they are the parameters of theneural network - we will return back to thisagain)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 102: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)]

=

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 103: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 104: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 105: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 106: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 107: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 108: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 109: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 110: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

26/36

Let us expand the KL divergence term

D[Qθ(z|X)||P (z|X)] =

ˆQθ(z|X) logQθ(z|X)dz −

ˆQθ(z|X) logP (z|X)dz

= Ez∼Qθ(z|X)[logQθ(z|X)− logP (z|X)]

For shorthand we will use EQ = Ez∼Qθ(z|X)

Substituting P (z|X) = P (X|z)P (z)P (X) , we get

D[Qθ(z|X)||P (z|X)] = EQ[logQθ(z|X)− logP (X|z)− logP (z) + logP (X)]

= EQ[logQθ(z|X)− logP (z)]− EQ[logP (X|z)] + logP (X)

= D[Qθ(z|X)||p(z)]− EQ[logP (X|z)] + logP (X)

∴ log p(X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 111: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 112: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)

Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 113: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 114: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)

Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 115: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 116: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inference

Why is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 117: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

27/36

So, we have

logP (X) = EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] +D[Qθ(z|X)||P (z|X)]

Recall that we are interested in maximizing the log likelihood of the data i.e.P (X)Since KL divergence (the red term) is always >= 0 we can say that

EQ[logP (X|z)]−D[Qθ(z|X)||P (z)] <= logP (X)

The quantity on the LHS is thus a lower bound for the quantity that we wantto maximize and is knows as the Evidence lower bound (ELBO)Maximizing this lower bound is the same as maximizing logP (X) and henceour equivalent objective now becomes

maximize EQ[logP (X|z)]−D[Qθ(z|X)||P (z)]

And, this method of learning parameters of probability distributions associ-ated with graphical models using optimization (by maximizing ELBO) is calledvariational inferenceWhy is this any easier? It is easy because of certain assumptions that we makeas discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 118: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

28/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

First we will just reintroduce the parameters in theequation to make things explicit

maximize EQ[logPφ(X|z)]−D[Qθ(z|X)||P (z)]

At training time, we are interested in learning theparameters θ which maximize the above for everytraining example (xi ∈ {xi}Ni=1)

So our total objective function is

maximizeθ

N∑i=1

EQ[logPφ(X = xi|z)]

−D[Qθ(z|X = xi)||P (z)]

We will shorthand P (X = xi) as P (xi)

However, we will assume that we are using stochasticgradient descent so we need to deal with only one of theterms in the summation corresponding to the currenttraining example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 119: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

28/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

First we will just reintroduce the parameters in theequation to make things explicit

maximize EQ[logPφ(X|z)]−D[Qθ(z|X)||P (z)]

At training time, we are interested in learning theparameters θ which maximize the above for everytraining example (xi ∈ {xi}Ni=1)

So our total objective function is

maximizeθ

N∑i=1

EQ[logPφ(X = xi|z)]

−D[Qθ(z|X = xi)||P (z)]

We will shorthand P (X = xi) as P (xi)

However, we will assume that we are using stochasticgradient descent so we need to deal with only one of theterms in the summation corresponding to the currenttraining example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 120: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

28/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

First we will just reintroduce the parameters in theequation to make things explicit

maximize EQ[logPφ(X|z)]−D[Qθ(z|X)||P (z)]

At training time, we are interested in learning theparameters θ which maximize the above for everytraining example (xi ∈ {xi}Ni=1)

So our total objective function is

maximizeθ

N∑i=1

EQ[logPφ(X = xi|z)]

−D[Qθ(z|X = xi)||P (z)]

We will shorthand P (X = xi) as P (xi)

However, we will assume that we are using stochasticgradient descent so we need to deal with only one of theterms in the summation corresponding to the currenttraining example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 121: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

28/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

First we will just reintroduce the parameters in theequation to make things explicit

maximize EQ[logPφ(X|z)]−D[Qθ(z|X)||P (z)]

At training time, we are interested in learning theparameters θ which maximize the above for everytraining example (xi ∈ {xi}Ni=1)

So our total objective function is

maximizeθ

N∑i=1

EQ[logPφ(X = xi|z)]

−D[Qθ(z|X = xi)||P (z)]

We will shorthand P (X = xi) as P (xi)

However, we will assume that we are using stochasticgradient descent so we need to deal with only one of theterms in the summation corresponding to the currenttraining example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 122: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

28/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

First we will just reintroduce the parameters in theequation to make things explicit

maximize EQ[logPφ(X|z)]−D[Qθ(z|X)||P (z)]

At training time, we are interested in learning theparameters θ which maximize the above for everytraining example (xi ∈ {xi}Ni=1)

So our total objective function is

maximizeθ

N∑i=1

EQ[logPφ(X = xi|z)]

−D[Qθ(z|X = xi)||P (z)]

We will shorthand P (X = xi) as P (xi)

However, we will assume that we are using stochasticgradient descent so we need to deal with only one of theterms in the summation corresponding to the currenttraining example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 123: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 124: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 125: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 126: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 127: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 128: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 129: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

29/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

So our objective function w.r.t. one example is

maximizeθ

EQ[logPφ(xi|z)]−D[Qθ(z|xi)||P (z)]

Now, first we will do a forward prop through the en-coder using Xi and compute µ(X) and Σ(X)

The second term in the above objective functionis the difference between two normal distributionN (µ(X),Σ(X)) and N (0, I)

With some simple trickery you can show that this termreduces to the following expression (Seep proof here)

D[N (µ(X),Σ(X))||N (0, I)]

=1

2(tr(Σ(X)) + (µ(X))T [µ(X))− k − log det(Σ(X))]

where k is the dimensionality of the latent variables

This term can be computed easily because we havealready computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 130: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

30/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Now let us look at the other term in the ob-jective function

n∑i=1

EQ[logPφ(X|z)]

This is again an expectation and hence in-tractable (integral over z)

In VAEs, we approximate this with a single zsampled from N (µ(X),Σ(X))

Hence this term is also easy to compute (ofcourse it is a nasty approximation but we willlive with it!)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 131: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

30/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Now let us look at the other term in the ob-jective function

n∑i=1

EQ[logPφ(X|z)]

This is again an expectation and hence in-tractable (integral over z)

In VAEs, we approximate this with a single zsampled from N (µ(X),Σ(X))

Hence this term is also easy to compute (ofcourse it is a nasty approximation but we willlive with it!)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 132: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

30/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Now let us look at the other term in the ob-jective function

n∑i=1

EQ[logPφ(X|z)]

This is again an expectation and hence in-tractable (integral over z)

In VAEs, we approximate this with a single zsampled from N (µ(X),Σ(X))

Hence this term is also easy to compute (ofcourse it is a nasty approximation but we willlive with it!)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 133: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

30/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Now let us look at the other term in the ob-jective function

n∑i=1

EQ[logPφ(X|z)]

This is again an expectation and hence in-tractable (integral over z)

In VAEs, we approximate this with a single zsampled from N (µ(X),Σ(X))

Hence this term is also easy to compute (ofcourse it is a nasty approximation but we willlive with it!)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 134: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

31/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Further, as usual, we need to assume someparametric form for P (X|z)

For example, if we assume that P (X|z) is aGaussian with mean µ(z) and variance I then

logP (X = Xi|z) = C − 1

2||Xi − µ(z)||2

µ(z) in turn is a function of the parameters ofthe decoder and can be written as fφ(z)

logP (X = Xi|z) = C − 1

2||Xi − fφ(z)||2

Our effective objective function thus becomes

minimizeθ,φ

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))− k

− log det(Σ(Xi))] + ||Xi − fφ(z)||2]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 135: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

31/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Further, as usual, we need to assume someparametric form for P (X|z)For example, if we assume that P (X|z) is aGaussian with mean µ(z) and variance I then

logP (X = Xi|z) = C − 1

2||Xi − µ(z)||2

µ(z) in turn is a function of the parameters ofthe decoder and can be written as fφ(z)

logP (X = Xi|z) = C − 1

2||Xi − fφ(z)||2

Our effective objective function thus becomes

minimizeθ,φ

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))− k

− log det(Σ(Xi))] + ||Xi − fφ(z)||2]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 136: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

31/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Further, as usual, we need to assume someparametric form for P (X|z)For example, if we assume that P (X|z) is aGaussian with mean µ(z) and variance I then

logP (X = Xi|z) = C − 1

2||Xi − µ(z)||2

µ(z) in turn is a function of the parameters ofthe decoder and can be written as fφ(z)

logP (X = Xi|z) = C − 1

2||Xi − fφ(z)||2

Our effective objective function thus becomes

minimizeθ,φ

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))− k

− log det(Σ(Xi))] + ||Xi − fφ(z)||2]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 137: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

31/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Further, as usual, we need to assume someparametric form for P (X|z)For example, if we assume that P (X|z) is aGaussian with mean µ(z) and variance I then

logP (X = Xi|z) = C − 1

2||Xi − µ(z)||2

µ(z) in turn is a function of the parameters ofthe decoder and can be written as fφ(z)

logP (X = Xi|z) = C − 1

2||Xi − fφ(z)||2

Our effective objective function thus becomes

minimizeθ,φ

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))− k

− log det(Σ(Xi))] + ||Xi − fφ(z)||2]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 138: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

31/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Further, as usual, we need to assume someparametric form for P (X|z)For example, if we assume that P (X|z) is aGaussian with mean µ(z) and variance I then

logP (X = Xi|z) = C − 1

2||Xi − µ(z)||2

µ(z) in turn is a function of the parameters ofthe decoder and can be written as fφ(z)

logP (X = Xi|z) = C − 1

2||Xi − fφ(z)||2

Our effective objective function thus becomes

minimizeθ,φ

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))− k

− log det(Σ(Xi))] + ||Xi − fφ(z)||2]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 139: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why?

because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 140: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why?

because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 141: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why?

because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 142: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why?

because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 143: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why? because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 144: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

32/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

The above loss can be easily computed and wecan update the parameters θ of the encoderand φ of decoder using backpropagation

However, there is a catch !

The network is not end to end differentiablebecause the output fφ(z) is not an end to enddifferentiable function of the input X

Why? because after passing X through thenetwork we simply compute µ(X) and Σ(X)and then sample a z to be fed to the decoder

This makes the entire process non-deterministic and hence fφ(z) is not acontinuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 145: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

33/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

VAEs use a neat trick to get around this prob-lem

This is known as the reparameterization trickwherein we move the process of sampling toan input layer

For 1 dimensional case, given µ and σ we cansample from N (µ, σ) by first sampling ε ∼N (0, 1), and then computing

z = µ+ σ ∗ ε

The adjacent figure shows the differencebetween the original network and the repara-mterized network

The randomness in fφ(z) is now associatedwith ε and not X or the parameters of themodel

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 146: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

33/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

VAEs use a neat trick to get around this prob-lem

This is known as the reparameterization trickwherein we move the process of sampling toan input layer

For 1 dimensional case, given µ and σ we cansample from N (µ, σ) by first sampling ε ∼N (0, 1), and then computing

z = µ+ σ ∗ ε

The adjacent figure shows the differencebetween the original network and the repara-mterized network

The randomness in fφ(z) is now associatedwith ε and not X or the parameters of themodel

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 147: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

33/36

Sample

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

VAEs use a neat trick to get around this prob-lem

This is known as the reparameterization trickwherein we move the process of sampling toan input layer

For 1 dimensional case, given µ and σ we cansample from N (µ, σ) by first sampling ε ∼N (0, 1), and then computing

z = µ+ σ ∗ ε

The adjacent figure shows the differencebetween the original network and the repara-mterized network

The randomness in fφ(z) is now associatedwith ε and not X or the parameters of themodel

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 148: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

33/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

VAEs use a neat trick to get around this prob-lem

This is known as the reparameterization trickwherein we move the process of sampling toan input layer

For 1 dimensional case, given µ and σ we cansample from N (µ, σ) by first sampling ε ∼N (0, 1), and then computing

z = µ+ σ ∗ ε

The adjacent figure shows the differencebetween the original network and the repara-mterized network

The randomness in fφ(z) is now associatedwith ε and not X or the parameters of themodel

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 149: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

33/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

VAEs use a neat trick to get around this prob-lem

This is known as the reparameterization trickwherein we move the process of sampling toan input layer

For 1 dimensional case, given µ and σ we cansample from N (µ, σ) by first sampling ε ∼N (0, 1), and then computing

z = µ+ σ ∗ ε

The adjacent figure shows the differencebetween the original network and the repara-mterized network

The randomness in fφ(z) is now associatedwith ε and not X or the parameters of themodel

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 150: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

34/36

Data: {Xi}Ni=1

Model: X̂ = fφ(µ(X)+Σ(X)∗ ε)Parameters: θ, φ

Algorithm: Gradient descent

Objective:

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))

− k − log det(Σ(Xi))] + ||Xi − fφ(z)||2]

With that we are done with the process oftraining VAEs

Specifically, we have described the data,model, parameters, objective function andlearning algorithm

Now what happens at test time? We need toconsider both abstraction and generation

In other words we are interested in computinga z given a X as well as in generating a Xgiven a z

Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 151: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

34/36

Data: {Xi}Ni=1

Model: X̂ = fφ(µ(X)+Σ(X)∗ ε)Parameters: θ, φ

Algorithm: Gradient descent

Objective:

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))

− k − log det(Σ(Xi))] + ||Xi − fφ(z)||2]

With that we are done with the process oftraining VAEs

Specifically, we have described the data,model, parameters, objective function andlearning algorithm

Now what happens at test time? We need toconsider both abstraction and generation

In other words we are interested in computinga z given a X as well as in generating a Xgiven a z

Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 152: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

34/36

Data: {Xi}Ni=1

Model: X̂ = fφ(µ(X)+Σ(X)∗ ε)Parameters: θ, φ

Algorithm: Gradient descent

Objective:

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))

− k − log det(Σ(Xi))] + ||Xi − fφ(z)||2]

With that we are done with the process oftraining VAEs

Specifically, we have described the data,model, parameters, objective function andlearning algorithm

Now what happens at test time? We need toconsider both abstraction and generation

In other words we are interested in computinga z given a X as well as in generating a Xgiven a z

Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 153: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

34/36

Data: {Xi}Ni=1

Model: X̂ = fφ(µ(X)+Σ(X)∗ ε)Parameters: θ, φ

Algorithm: Gradient descent

Objective:

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))

− k − log det(Σ(Xi))] + ||Xi − fφ(z)||2]

With that we are done with the process oftraining VAEs

Specifically, we have described the data,model, parameters, objective function andlearning algorithm

Now what happens at test time? We need toconsider both abstraction and generation

In other words we are interested in computinga z given a X as well as in generating a Xgiven a z

Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 154: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

34/36

Data: {Xi}Ni=1

Model: X̂ = fφ(µ(X)+Σ(X)∗ ε)Parameters: θ, φ

Algorithm: Gradient descent

Objective:

N∑n=1

[1

2(tr(Σ(Xi)) + (µ(Xi))

T [µ(Xi))

− k − log det(Σ(Xi))] + ||Xi − fφ(z)||2]

With that we are done with the process oftraining VAEs

Specifically, we have described the data,model, parameters, objective function andlearning algorithm

Now what happens at test time? We need toconsider both abstraction and generation

In other words we are interested in computinga z given a X as well as in generating a Xgiven a z

Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 155: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

35/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Abstraction

After the model parameters are learned wefeed a X to the encoder

By doing a forward pass using the learnedparameters of the model we compute µ(X)and Σ(X)

We then sample a z from the distributionµ(X) and Σ(X) or using the same reparamet-erization trick

In other words, once we have obtainedµ(X) and Σ(X), we first sample ε ∼N (µ(X),Σ(X)) and then compute z

z = µ+ σ ∗ ε

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 156: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

35/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Abstraction

After the model parameters are learned wefeed a X to the encoder

By doing a forward pass using the learnedparameters of the model we compute µ(X)and Σ(X)

We then sample a z from the distributionµ(X) and Σ(X) or using the same reparamet-erization trick

In other words, once we have obtainedµ(X) and Σ(X), we first sample ε ∼N (µ(X),Σ(X)) and then compute z

z = µ+ σ ∗ ε

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 157: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

35/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Abstraction

After the model parameters are learned wefeed a X to the encoder

By doing a forward pass using the learnedparameters of the model we compute µ(X)and Σ(X)

We then sample a z from the distributionµ(X) and Σ(X) or using the same reparamet-erization trick

In other words, once we have obtainedµ(X) and Σ(X), we first sample ε ∼N (µ(X),Σ(X)) and then compute z

z = µ+ σ ∗ ε

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 158: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

35/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Abstraction

After the model parameters are learned wefeed a X to the encoder

By doing a forward pass using the learnedparameters of the model we compute µ(X)and Σ(X)

We then sample a z from the distributionµ(X) and Σ(X) or using the same reparamet-erization trick

In other words, once we have obtainedµ(X) and Σ(X), we first sample ε ∼N (µ(X),Σ(X)) and then compute z

z = µ+ σ ∗ ε

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 159: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 160: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 161: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 162: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 163: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 164: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Page 165: CS7015 (Deep Learning) : Lecture 21 - Variational Autoencodersmiteshk/CS7015/Slides/... · den representation hto the decoder and de-code a X^ from it ? In principle, yes! But in

36/36

+

ε ∼ N (0, I)

z

Xi

Qθ(z|X)

Σµ

Pφ(X|z)

X̂i

Generation

After the model parameters are learned we re-move the encoder and feed a z ∼ N (0, I) tothe decoder

The decoder will then predict fφ(z) and wecan draw an X ∼ N (fφ(z), I)

Why would this work ?

Well, we had trained the model to minimizeD(Qθ(z|X)||p(z)) where p(z) was N (0, I)

If the model is trained well then Qθ(z|X)should also become N (0, I)

Hence, if we feed z ∼ N (0, I), it is almostas if we are feeding a z ∼ Qθ(z|X) and thedecoder was indeed trained to produce a goodfφ(z) from such a z

Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21