From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding...

76
From Autoencoder to Variational Autoencoder Hao Dong Peking University 1

Transcript of From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding...

Page 1: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

From Autoencoder to Variational Autoencoder

Hao Dong

Peking University

1

Page 2: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

2

From Autoencoder to Variational Autoencoder

Feature Representation

Distribution Representation

视频:https://www.youtube.com/watch?v=xH1mBw3tb_c&list=PLe5rNUydzV9QHe8VDStpU0o8Yp63OecdW&index=4&pbjreload=10

Page 3: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

3

Page 4: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

4

Vanilla Autoencoder

• What is it?

Reconstruct high-dimensional data using a neural network model with a narrow bottleneck layer.

The bottleneck layer captures the compressed latent coding, so the nice by-product is dimension reduction.

The low-dimensional representation can be used as the representation of the data in various applications, e.g., image retrieval, data compression …

!𝑥𝑧𝑥

Page 5: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

Latent code: the compressed low dimensional representation of the

input data

5

Vanilla Autoencoder

• How it works?

!𝑥𝑧𝑥

decoder/generatorZ à X

encoderX à Z

Input Reconstructed Input

Ideally the input and reconstruction are identical

The encoder network is for

dimension reduction, just

like PCA

Page 6: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

6

Vanilla Autoencoder

• Training

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden layerinput layer output layer

ℒ =1$!

%𝒙" 𝒙" #

Given 𝑴 data samples

• Thehiddenunitsareusuallylessthanthenumberofinputs• Dimensionreduction--- Representationlearning

The distance between two data can be measure byMean Squared Error (MSE):

ℒ = $%∑&'$% (𝑥& − 𝐺(𝐸 𝑥& ) 2

where 𝑛 is the number of variables

• It is trying to learn an approximation to the identity functionso that the input is “compress” to the “compressed” features,discovering interesting structure about the data.

Encoder Decoder

Page 7: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

7

Vanilla Autoencoder

• Testing/Inferencing

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

hidden layerinput layer

extracted features

• Autoencoder is an unsupervised learning method if weconsidered the latent code as the “output”.

• Autoencoder is also a self-supervised (self-taught) learningmethod which is a type of supervised learning where thetraining labels are determined by the input data.

• Word2Vec (from RNN lecture) is another unsupervised,self-taught learning example.

AutoencoderforMNISTdataset(28×28×1,784pixels)

%𝒙

𝒙

Encoder

Page 8: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

8

Vanilla Autoencoder

• Example:• Compress MNIST (28x28x1) to the latent code with only 2 variables

Lossy

Page 9: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

9

Vanilla Autoencoder

• Power of Latent Representation• t-SNE visualization on MNIST: PCA vs. Autoencoder

PCA Autoencoder (Winner)2006 Science paper by Hinton and Salakhutdinov

Page 10: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

10

Vanilla Autoencoder

• Discussion

• Hidden layer is overcomplete if greater than the input layer

Page 11: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

11

Vanilla Autoencoder

• Discussion

• Hidden layer is overcomplete if greater than the input layer

• No compression

• No guarantee that the hidden units extract meaningful feature

Page 12: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

12

Page 13: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

13

Denoising Autoencoder (DAE)

• Why?• Avoid overfitting• Learn robust representations

Page 14: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

14

Denoising Autoencoder

• Architecture

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden layerinput layer output layer

𝑥!

𝑥"

𝑥#

𝑥$

𝑥%

𝑥&

Applying dropout between the input and the first hidden layer

• Improve the robustness

Encoder Decoder

Page 15: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

15

Denoising Autoencoder

• Feature VisualizationVisualizingthelearnedfeatures

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

One neuron == One feature extractor

reshape à

Page 16: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

16

Denoising Autoencoder

• Denoising Autoencoder & Dropout

Denoising autoencoder was proposed in 2008, 4 years before the dropout paper (Hinton, et al. 2012).

Denoising autoencoder can be seem as applying dropout between the input and the first layer.

Denoising autoencoder can be seem as one type of data augmentation on the input.

Page 17: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

17

Page 18: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

18

Sparse Autoencoder

• Why?

• Even when the number of hidden units is large(perhaps even greater than the number of inputpixels), we can still discover interesting structure,by imposing other constraints on the network.

• In particular, if we impose a ”‘sparsity”’ constrainton the hidden units, then the autoencoder will stilldiscover interesting structure in the data, even ifthe number of hidden units is large.

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

hidden layerinput layer

0.02“inactive”

0.97“active”

0.01“inactive”

0.98“active”

Sigmoid

Encoder

Page 19: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

19

Sparse Autoencoder

• Recap: KL Divergence

Smaller == Closer

Page 20: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

20

Sparse Autoencoder

• Sparsity Regularization

𝑥!

𝑥"

𝑥#

𝑎1

𝑎2

𝑎3𝑥$

𝑥%

𝑥&

𝑎4

hidden layerinput layer

0.02“inactive”

0.97“active”

0.01“inactive”

0.98“active”

Sigmoid�̂�𝑗 =

1𝑀$"%&

!

𝑎𝑗

Given 𝑴 data samples (batch size) and Sigmoidactivation function, the active ratio of a neuron 𝑎𝑗:

To make the output “sparse”, we would like toenforce the following constraint, where 𝜌 is a“sparsity parameter”, such as 0.2 (20% of theneurons)

�̂�𝑗 = 𝜌

The penalty term is as follow, where s is thenumber of activation outputs.

ℒ ' = ∑(%&) 𝐾𝐿(𝜌||�̂�𝑗)

= ∑(%&) (𝜌log '

*'!+ (1 − 𝜌)log &+'

&+*'!)

ℒ ,-,./ = ℒ !01 + 𝜆ℒ '

The total loss:

Encoder

The number of hidden units can be greater than the number of input variables.

Page 21: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

21

Sparse Autoencoder

• Sparsity Regularization Smaller 𝜌 == More sparse

AutoencodersforMNISTdataset

%𝒙

𝒙

Autoencoder

SparseAutoencoder %𝒙

Input

Page 22: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

22

Sparse Autoencoder

• Different regularization loss

ℒ & on the hidden activation output

Method HiddenActivation

ReconstructionActivation

LossFunction

Method1 Sigmoid Sigmoid ℒ ,-,./ = ℒ !01 + ℒ '

Method2 ReLU Softplus ℒ ,-,./ = ℒ !01 + 𝒂

Page 23: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

23

Sparse Autoencoder

• Sparse Autoencoder vs. Denoising AutoencoderFeatureExtractorsofSparseAutoencoder FeatureExtractorsofDenoisingAutoencoder

Page 24: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

24

Sparse Autoencoder

• Autoencoder vs. Denoising Autoencoder vs. Sparse Autoencoder

AutoencodersforMNISTdataset

%𝒙

𝒙

Autoencoder

SparseAutoencoder %𝒙

Input

DenoisingAutoencoder %𝒙

Page 25: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

25

Page 26: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

26

Contractive Autoencoder

• Why?

• Denoising Autoencoder and Sparse Autoencoder overcome the overcomplete problem via the input and hidden layers.

• Could we add an explicit term in the loss to avoid uninteresting features?

We wish the features that ONLY reflect variations observed in the training set

https://www.youtube.com/watch?v=79sYlJ8Cvlc

Page 27: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

27

Contractive Autoencoder

• How

• Penalize the representation being too sensitive to the input• Improve the robustness to small perturbations• Measure the sensitivity by the Frobenius norm of the Jacobian matrix of the

encoder activations

Page 28: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

𝑥 = 𝑓 𝑧

𝑧 =𝑧!𝑧" 𝑥 =

𝑥!𝑥"

𝐽# =⁄𝜕𝑥! 𝜕𝑧! ⁄𝜕𝑥! 𝜕𝑧"⁄𝜕𝑥" 𝜕𝑧! ⁄𝜕𝑥" 𝜕𝑧"

𝐽#!" =⁄𝜕𝑧! 𝜕𝑥! ⁄𝜕𝑧! 𝜕𝑥"⁄𝜕𝑧" 𝜕𝑥! ⁄𝜕𝑧" 𝜕𝑥"

𝑧 = 𝑓$! 𝑥

𝑧! + 𝑧"2𝑧!

= 𝑓𝑧!𝑧"

𝐽# =1 12 0

𝑥!𝑥" =

𝑥"/2𝑥! − 𝑥"/2

= 𝑓$!𝑥!𝑥"

𝑧!𝑧" =

𝐽#!" =0 1/21 −1/2

input

output

𝐽#𝐽#!" = 𝐼28

Contractive Autoencoder

• Recap: Jocobian Matrix

Page 29: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

29

Contractive Autoencoder

• Jocobian Matrix

Page 30: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

30

Contractive Autoencoder

• New Loss

reconstruction new regularization

Page 31: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

31

Contractive Autoencoder

• vs. Denoising Autoencoder

• Advantages• CAE can better model the distribution of raw data

• Disadvantages• DAE is easier to implement• CAE needs second-order optimization (conjugate gradient, LBFGS)

Page 32: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)

32

Page 33: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

33

Stacked Autoencoder

• Start from Autoencoder: Learn Feature From Input

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden 1input output

Thefeatureextractorfortheinputdata

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Encoder Decoder Unsupervised

Red color indicates the trainable weights

Page 34: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

34

Stacked Autoencoder

• 2nd Stage: Learn 2nd Level Feature From 1st Level Feature

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

hidden 1input output

𝑎!"

𝑎""

𝑎#"

𝑎$"

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden 2

Thefeatureextractorforthefirstfeatureextractor

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Encoder Encoder Decoder Unsupervised

Red color indicates the trainable weights

Page 35: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

35

Stacked Autoencoder

• 3rd Stage: Learn 3rd Level Feature From 2nd Level Feature

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

𝑎!"

𝑎""

𝑎#"

𝑎$"

𝑎!#

𝑎"#

𝑎##

𝑎$#

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden 1input outputhidden 2 hidden 3

Thefeatureextractorforthesecondfeatureextractor

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Encoder Encoder Encoder Decoder Unsupervised

Red color indicates the trainable weights

Page 36: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

36

Stacked Autoencoder

• 4th Stage: Learn 4th Level Feature From 3rd Level Feature

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

𝑎!"

𝑎""

𝑎#"

𝑎$"

𝑎!#

𝑎"#

𝑎##

𝑎$#

hidden 1input outputhidden 2 hidden 3

𝑎!$

𝑎"$

𝑎#$

𝑎$%

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden 4

Thefeatureextractorforthethirdfeatureextractor

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Encoder Encoder Encoder Encoder Decoder Unsupervised

Red color indicates the trainable weights

Page 37: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

37

Stacked Autoencoder

• Use the Learned Feature Extractor for Downstream Tasks

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

𝑎!"

𝑎""

𝑎#"

𝑎$"

𝑎!#

𝑎"#

𝑎##

𝑎$#

hidden 1input outputhidden 2 hidden 3

𝑎!$

𝑎"$

𝑎#$

𝑎$$

𝑎!%

hidden 4

Learntoclassifytheinputdatabyusingthelabelsandhigh-levelfeatures

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Supervised

Red color indicates the trainable weights

Page 38: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

38

Stacked Autoencoder

• Fine-tuning

𝑥!

𝑥"

𝑥#

𝑎!!

𝑎"!

𝑎#!𝑥$

𝑥%

𝑥&

𝑎$!

𝑎!"

𝑎""

𝑎#"

𝑎$"

𝑎!#

𝑎"#

𝑎##

𝑎$#

hidden 1input outputhidden 2 hidden 3

𝑎!$

𝑎"$

𝑎#$

𝑎$$

𝑎!%

hidden 4

Fine-tunetheentiremodelforclassification

RedlinesindicatethetrainableweightsBlacklinesindicatethefixed/nontrainableweights

Supervised

Red color indicates the trainable weights

Page 39: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

39

Stacked Autoencoder

• Discussion

• Advantages• …

• Disadvantages• …

Page 40: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)• From Neural Network Perspective• From Probability Model Perspective

40

Page 41: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

41

Before we start

• Question?

• Are the previous Autoencoders generative model?

• Recap: We want to learn a probability distribution 𝑝(𝑥) over 𝑥

o Generation (sampling): 𝐱pqr~𝑝(x)(NO, The compressed latent codes of autoencoders are not prior distributions, autoencoder

cannot learn to represent the data distribution)

o Density Estimation: 𝑝(x) high if 𝐱 looks like a real dataNO

o Unsupervised Representation Learning:Discovering the underlying structure from the data distribution (e.g., ears, nose, eyes …)

(YES, Autoencoders learn the feature representation)

Page 42: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)• From Neural Network Perspective• From Probability Model Perspective

42

Page 43: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

43

Variational Autoencoder

• How to perform generation (sampling)?

𝑥!

𝑥"

𝑥#

𝑧1

𝑧2

𝑧3𝑥$

𝑥%

𝑥&

𝑧4

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

hidden layerinput layer output layer

Canthehiddenoutputbeapriordistribution,e.g.,Normaldistribution?

𝑧1

𝑧2

𝑧3

𝑧4

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

𝑁(0, 1) Decoder(Generator)maps𝑁(0, 1) todata space

Encoder Decoder

Decoder

𝑝 𝑋 = ∑2 𝑝 𝑋 𝑍 𝑝(𝑍)

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 44: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

44

Variational Autoencoder

• Quick Overview

ℒkl

!𝑥𝑧𝑥

ℒ𝑀𝑆𝐸 𝒙 𝑁(0, 1)

BidirectionalMapping

LatentSpaceDataSpace

ℒ )*)+, = ℒ -./ + ℒ 0,

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

𝑝(𝑥|𝑧)generation(decode)

𝑞(𝑧|𝑥)Inference(encoder)

Page 45: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

45

Variational Autoencoder

• The neural net perspective• A variational autoencoder consists of an encoder, a decoder, and a loss function

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 46: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

46

Variational Autoencoder

• Encoder, Decoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 47: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

47

Variational Autoencoder

• Loss function

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

regularizationCan be represented by MSE

Page 48: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

48

• Which direction of the KL divergence to use? • Some applications require an

approximation that usually places high probability anywhere that the true distribution places high probability: left one

• VAE requires an approximation that rarely places high probability anywhere that the true distribution places low probability: right one

Variational Autoencoder

• Why KL(Q||P) not KL(P||Q)

If:

Page 49: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

49

Variational Autoencoder

• Reparameterization Trick

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

ℎ!

ℎ"

ℎ#

𝜇1

𝜇2

𝜇3

ℎ$

ℎ%

ℎ&

𝜇4

#𝑥!

#𝑥"

#𝑥#

#𝑥$

#𝑥%

#𝑥&

𝛿1

𝛿2

𝛿3

𝛿4

𝑧1

𝑧2

𝑧3

𝑧4

𝑧3~𝑁(𝜇3 , 𝛿3)Resamplingpredict means

predict std

𝑥!

𝑥"

𝑥#

𝑥$

𝑥%

𝑥&

1. Encode the input2. Predict means3. Predict standard derivations4. Use the predicted means and standard

derivations to sample new latent variables individually

5. Reconstruct the input

Latent variables are independent

Page 50: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

50

Variational Autoencoder

• Reparameterization Trick• z ~ N(μ, σ) is not differentiable• To make sampling z differentiable• z = μ + σ * ϵ

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

ϵ~N(0, 1)

Page 51: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

51

Variational Autoencoder

• Reparameterization Trick

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 52: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

52

Variational Autoencoder

• Loss function

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 53: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

53

Variational Autoencoder

• Where is ‘variational’?

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 54: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

• Vanilla Autoencoder• Denoising Autoencoder• Sparse Autoencoder• Contractive Autoencoder• Stacked Autoencoder• Variational Autoencoder (VAE)• From Neural Network Perspective• From Probability Model Perspective

54

Page 55: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

55

Variational Autoencoder

• Problem Definition

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Goal: Given 𝑋 = {𝑥�, 𝑥�, 𝑥�… , 𝑥p},find𝑝 𝑋 torepresent𝑋

How:Itisdifficulttodirectlymodel𝑝 𝑋 ,soalternatively,wecan…

𝑝 𝑋 =D�

𝑝 𝑋|𝑍 𝑝(𝑍)

where 𝑝 𝑍 = 𝑁(0,1) is a prior/known distribution

i.e., sample 𝑋 from 𝑍

Page 56: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

56

Variational Autoencoder

• The probability model perspective• P(X) is hard to model

• Alternatively, we learn the joint distribution of X and Z

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

𝑝 𝑋 =G4

𝑝 𝑋|𝑍 𝑝(𝑍)

𝑝 𝑋 =G4

𝑝 𝑋, 𝑍

𝑝 𝑋, 𝑍 = 𝑝 𝑍 𝑝(𝑋|𝑍)

Page 57: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

57

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Assumption

Page 58: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

58

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Assumption

Page 59: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

59

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Monte Carlo?• n might need to be extremely large before we have an accurate estimation of P(X)

Page 60: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

60

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Monte Carlo?• Pixel difference is different from perceptual difference

Page 61: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

61

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Monte Carlo?• VAE alters the sampling procedure

Page 62: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

62

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Recap: Variational Inference• VI turns inference into optimization

idealapproximation

𝑝 𝑧 𝑥 =𝑝(𝑥, 𝑧)𝑝(𝑥)

∝ 𝑝(𝑥, 𝑧)

Page 63: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

63

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Variational Inference• VI turns inference into optimization

parameter distribution

Page 64: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

64

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Setting up the objective• Maximize P(X)• Set Q(z) to be an arbitrary distribution 𝑝 𝑧 𝑋 =

𝑝 𝑋 𝑧 𝑝(𝑧)𝑝(𝑋)

Goal: maximize this logP(x)

Page 65: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

65

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Setting up the objective

encoder ideal reconstruction/decoder KLDGoal: maximize this

Goal becomes: optimize thisdifficult to compute

ℒkl

!𝑥𝑧𝑥

ℒ𝑀𝑆𝐸

ℒ )*)+, = ℒ -./ + ℒ 0,

𝑝(𝑥|𝑧)generation

𝑞(𝑧|𝑥)inference

Page 66: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

66

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Setting up the objective : ELBO

idealencoder

-ELBO

𝑝 𝑧 𝑋 =𝑝(𝑋, 𝑧)𝑝(𝑋)

Page 67: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

67

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Setting up the objective : ELBO

Page 68: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

68

Variational Autoencoder

• Recap: The KL Divergence Loss

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

𝐾𝐿(𝒩(𝜇, 𝜎�)||𝒩 0,1 )

= O𝒩 𝜇, 𝜎� 𝑙𝑜𝑔𝒩(𝜇, 𝜎�)𝒩 0,1

𝑑𝑥

= O12𝜋𝜎�

𝑒� ��� 5

��5 𝑙𝑜𝑔

12𝜋𝜎�

𝑒� ��� 5

��5

12𝜋

𝑒��5�

𝑑𝑥

= O12𝜋𝜎�

𝑒� ��� 5

��5 log(1𝜎�𝑒�5� ��� 5

��5 )𝑑𝑥

=12O

12𝜋𝜎�

𝑒� ��� 5

��5 −𝑙𝑜𝑔𝜎� + 𝑥� −𝑥 − 𝜇 �

𝜎�𝑑𝑥

Page 69: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

69

Variational Autoencoder

• Recap: The KL Divergence Loss

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

𝐾𝐿(𝒩(𝜇, 𝜎�)||𝒩 0,1 )

=12O

12𝜋𝜎�

𝑒� ��� 5

��5 −𝑙𝑜𝑔𝜎� + 𝑥� −𝑥 − 𝜇 �

𝜎�𝑑𝑥

的注释 =12(−𝑙𝑜𝑔𝜎" + 𝜇" + 𝜎" − 1)

Page 70: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

70

Variational Autoencoder

• Recap: The KL Divergence Loss

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 71: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

71

Variational Autoencoder

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

• Optimizing the objective

encoder ideal reconstruction KLD

dataset

dataset

Page 72: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

72

Variational Autoencoder

• VAE is a Generative Model

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

𝑝 𝑍|𝑋 is not 𝑁(0,1)

Can we input 𝑁(0,1) to the decoder for sampling?

YES: the goal of KL is to make 𝑝 𝑍|𝑋 to be 𝑁(0,1)

Page 73: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

73

Variational Autoencoder

• VAE vs. Autoencoder

• VAE : distribution representation, p(z|x) is a distribution

• AE: feature representation, h = E(x) is deterministic

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 74: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

74

Variational Autoencoder

• Challenges

• Low quality images

• …

Auto-Encoding Variational Bayes. Diederik P. Kingma, Max Welling. ICLR 2013

Page 75: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

75

Summary: Take Home Message

• Autoencoders learn data representation in an unsupervised/ self-supervised way.• Autoencoders learn data representation but cannot model the data distribution 𝑝 𝑋 .• Different with vanilla autoencoder, in sparse autoencoder, the number of hidden units

can be greater than the number of input variables.• VAE• …• …• …• …• …• …

Page 76: From Autoencoder to Variational Autoencoder...Encoder Decoder Decoder wx=∑ 2wxyw(y) Auto-Encoding Variational Bayes. DiederikP. Kingma, Max Welling. ICLR 2013 44 Variational Autoencoder

Thanks

76