Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders...

5
Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational autoencoders to reconstruct inputs and learn meaningful representations of data was tested on the MNIST and Freyfaces datasets. I also explored their capacity as generative models by comparing samples generated by a variational autoencoder to those generated by generative adversarial networks. Finally, I attempted to apply them towards speech recognition. 1 Background Autoencoders are an unsupervised learning technique often used to generate useful representations of data. By setting the target output of a neural network equal to that of the input, the hidden layers can learn to capture useful properties. Typically, there is some regularization mechanism to ensure that the network learns statistical regularities of the data as opposed to an identity function. These mechanisms include sparsity constraints, hidden layer size constraints, and adding noise to the input. Historically, they’ve been important as pre-training done to initialize weights in deep networks. Autoencoders are also used for feature learning or dimensionality reduction [6]. Variational autoencoders (VAE) are a recent addition to the field that casts the problem in a variational framework, under which they become generative models [9]. We care about generative models because they can be used to do semi-supervised learning, generate sequences from sequences, generate more training data, and help us better understand our models [6][10]. 2 Variational Autoencoders Overview. Given a point x, we assume there is a recognition model p φ (z|x) that produces a distribution over hidden states z. This is the encoding step. We’re interested in finding p φ (z|x) and do so through a variational approximation q φ (z|x). There is similarly a probablistic decoding step p θ (x|z). Once we have both the encoder and decoder, we can sample a z from a given x and use it to generate a new x 0 . Notably, the variational parameters φ are learned jointly with the generative parameters θ. In order to make this trainable through standard backpropagation optimization procedures, the authors use the reparameterization trick to rewrite the random variable z as a deteriministic variable. The reparameterization trick is also referred to as elliptical standardization in [11]. With this trick, as is the hope in variational methods, sampling can usually be done without Markov Chain Monte Carlo methods. Specifically, we sample z (i) from q φ (z|x (i) ) through μ (i) + σ (i) (i) , where (i) is Gaussian noise with zero mean and the identity covariance matrix, and is an element-wise multiplication. The details of each part of the model are described as follows. Encoder. The encoder is a feed forward neural network that produces the mean and log variance used to sample the latent variables z. From C.2 of the appendix, the mean is given by μ = W 4 h + b 4 , the log variance is given by log σ 2 = W 5 h + b 5 , and the h is given by tanh(W 3 z + b 3 ). The W i ’s are the weights of the feed forward network. Sampler. Given the mean and log variance from the encoder, we sample from q φ (z|x (i) ) using the deteriminstic function created through the reparameterization trick. KL-Divergence. We calculate the KL-divergence of the variational approximation q φ (z|x (i) ) and the true distribution p θ (z) of the latent variable z. This is also used in the overall loss function and acts as a regularizer. Decoder. The decoder depends on whether the outputs are Gaussian or Bernoulli. In the case where our data is continuous, for example, we use a Gaussian decoder. Thus, our decoder is similar to our encoder in that it is a multivariate Gaussian with a diagonal covariance matrix. The decoder error is used in the overall loss function and represents the reconstruction loss. Overall loss function. The overall loss function L(x) used to train the autoencoder is the sum of the KL-divergence and the decoder error. L(x)= -D KL (q φ (z|x)||p θ (z)) + E q φ (z|x) [log p θ (x|z)] (1) The relationship to standard autoencoders is also clear in the loss function. The loss is composed of two expressions. The first captures the probability distribution of the latent variables and also acts as a regularizer. The second captures the idea of reconstruction error, which is used in all autoencoders.

Transcript of Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders...

Page 1: Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational

Variational Autoencoders

Eric Chu6.882: Bayesian Modeling and Inference

Abstract

The ability of variational autoencoders to reconstruct inputs and learn meaningful representations of data wastested on the MNIST and Freyfaces datasets. I also explored their capacity as generative models by comparingsamples generated by a variational autoencoder to those generated by generative adversarial networks. Finally,I attempted to apply them towards speech recognition.

1 Background

Autoencoders are an unsupervised learning technique often used to generate useful representations of data. By setting thetarget output of a neural network equal to that of the input, the hidden layers can learn to capture useful properties. Typically,there is some regularization mechanism to ensure that the network learns statistical regularities of the data as opposed to anidentity function. These mechanisms include sparsity constraints, hidden layer size constraints, and adding noise to the input.Historically, they’ve been important as pre-training done to initialize weights in deep networks. Autoencoders are also used forfeature learning or dimensionality reduction [6].

Variational autoencoders (VAE) are a recent addition to the field that casts the problem in a variational framework, under whichthey become generative models [9]. We care about generative models because they can be used to do semi-supervised learning,generate sequences from sequences, generate more training data, and help us better understand our models [6][10].

2 Variational Autoencoders

Overview. Given a point x, we assume there is a recognition model pφ(z|x) that produces a distribution over hidden states z.This is the encoding step. We’re interested in finding pφ(z|x) and do so through a variational approximation qφ(z|x). There issimilarly a probablistic decoding step pθ(x|z). Once we have both the encoder and decoder, we can sample a z from a given xand use it to generate a new x′. Notably, the variational parameters φ are learned jointly with the generative parameters θ.

In order to make this trainable through standard backpropagation optimization procedures, the authors use the reparameterizationtrick to rewrite the random variable z as a deteriministic variable. The reparameterization trick is also referred to as ellipticalstandardization in [11]. With this trick, as is the hope in variational methods, sampling can usually be done without MarkovChain Monte Carlo methods. Specifically, we sample z(i) from qφ(z|x(i)) through µ(i) + σ(i) � ε(i), where ε(i) is Gaussiannoise with zero mean and the identity covariance matrix, and � is an element-wise multiplication.

The details of each part of the model are described as follows.

Encoder. The encoder is a feed forward neural network that produces the mean and log variance used to sample the latentvariables z. From C.2 of the appendix, the mean is given by µ =W4h+ b4, the log variance is given by log σ2 =W5h+ b5,and the h is given by tanh(W3z + b3). The Wi’s are the weights of the feed forward network.

Sampler. Given the mean and log variance from the encoder, we sample from qφ(z|x(i)) using the deteriminstic function createdthrough the reparameterization trick.

KL-Divergence. We calculate the KL-divergence of the variational approximation qφ(z|x(i)) and the true distribution pθ(z) ofthe latent variable z. This is also used in the overall loss function and acts as a regularizer.

Decoder. The decoder depends on whether the outputs are Gaussian or Bernoulli. In the case where our data is continuous, forexample, we use a Gaussian decoder. Thus, our decoder is similar to our encoder in that it is a multivariate Gaussian with adiagonal covariance matrix. The decoder error is used in the overall loss function and represents the reconstruction loss.

Overall loss function. The overall loss function L(x) used to train the autoencoder is the sum of the KL-divergence and thedecoder error.

L(x) = −DKL(qφ(z|x)||pθ(z)) + Eqφ(z|x)[log pθ(x|z)] (1)

The relationship to standard autoencoders is also clear in the loss function. The loss is composed of two expressions. Thefirst captures the probability distribution of the latent variables and also acts as a regularizer. The second captures the idea ofreconstruction error, which is used in all autoencoders.

Page 2: Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational

3 Implementation Details

The Lua-based Torch7 library was used to implement the model. Neural networks are built by combining modules from thenn package. Each module must contain a method to calculate the output given an input and a method to calculate the gradientwith respect to an input. These methods are used to forward propagate inputs and backpropagate errors. I implemented theKL-Divergence and decoder error as nn modules.

Due to the re-parameterization trick, our variational approximation qφ(z|x(i)) is a Gaussian distribution. As described in theappendix, the KL-Divergence is thus given by:

KLD =1

2

J∑j=1

(1 + ln((σj)2)− (µj)

2 − (σj)2) (2)

While there are auto-differentiation libraries available, the derivatives are straightforward to derive. Keeping in mind thatthe inputs are the mean and log variance, we use the manipulation that σ2 = ln (exp σ2) to calculate the gradient of theKL-Divergence as follows.

∂KLD

∂µj= −µj

∂KLD

∂ ln σ2j

=1

2

(1− exp( ln σ2

j )) (3)

Similarly, the output of the Gaussian decoder module is given by:

D = ln N(µ, σ2) = −1

2ln 2π − 1

2ln σ2 − (x− µ)2

2σ2(4)

The derivatives of the Gaussian output are given by:

∂D

∂µ= −(x− µ) exp (−ln σ2)

∂D

∂ ln σ2= −1

2+

(x− µ)2

2σ2

(5)

There is a corresponding module for the Bernoulli-based decoder already built into nn.

4 Experiment: MNIST and Freyfaces

There are three main motivations for these experiments. First, we show that the implementation of the variational approximationworks by plotting the increasing lower bound of the log likelihood. Second, we show that VAE works as an autoencoder both inreconstructing inputs and in its ability to learn useful representations. Third, we aim to understand VAE as a generative model.

The experiments were run on the MNIST digits dataset and the Freyfaces dataset [9]. MNIST was split into 60000 trainingexamples and 10000 test examples. Freyfaces was split into 1600 training examples and 400 test examples. I used Adam with aninitial learning rate of 0.001 for the optimization procedure.

Likelihood variational lower bounds. The likelihood of the variational approximations at different dimensionalities of z areshown in Figure 1. The y-axis displays the lower bound, and the x-axis displays the number of training examples seen. The blueline corresponds to the training set, and the orange line corresponds to the test set. The fact that there is no extreme overfitting onthe test set indicates that the KL-Divergence in the loss function is indeed acting as a regularizer. While it appears the Freyfaceslowerbound might have continued to increase, training was capped at 107 training examples to limit computational cost. Thelower bound values are are similar to those of the paper.

Reconstruction. Next, we show that the VAE works as an autoencoder by quantiatively and qualitatively measuring its abilityto reconstruct an input x. The reconstruction error is calculated using a pixel-wise L2 norm and is displayed for differentdimensionalities of z in Table 1. Not surprisingly, as z increases, the error decreases because the model compresses less and hasgreater modeling capacity.

z = 2 z = 3 z = 5 z = 10 z = 20 z = 200MNIST Train 0.5094 0.4618 0.3960 0.3165 0.2806 0.2820

Test 0.5959 0.5343 0.4454 0.3431 0.29034 0.2899Freyfaces Train 0.1698 – 0.1197 0.0967 0.0856 –

Test 0.1863 – 0.1288 0.1028 0.0919 –

Table 1: Reconstruction errors

We can also turn to Figure 2 for a qualitative assessment of the reconstruction. As the dimensionality of the latent spaceincreases, the reconstructions becomes sharper. We also notice, for example, that the 9 in the third column only becomes correctlyreconstructed when z = 20 and z = 200.

Clustering of Latent Space. We further demonstrate VAE’s utility as an autoencoder by showing its ability to capture meaningfulreprsentations. Recalling that autoencoders are often used as dimensionality reduction, we plot the latent space in Figure 3.

2

Page 3: Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational

Figure 1: Likelihood variational lower bounds. First row = MNIST at latent variable sizes z in [2,3,5,10,20,200]. Second row =Freyfaces at latent variable sizes z in [2,5,10,20]. Blue = training set. Orange = validation set.

(a) VAE Reconstruction (b) VAE samples (c) GAN samples

Figure 2: a) VAE reconstructions. First row are the original inputs. Each subsequent row are the reconstructions at z =[2,3,5,10,20,200], with the second row being z = 2, the third row being z = 3, .... b) Samples generated from the 2D VAE bSamples generated from the GAN

Each color represents a different digit in the MNIST dataset. Plot a) shows how a reasonable clustering is already achievedwhen the latent space is only of dimension 2. Overlapping clusters include the cluster corresponding to the similar "4" and "9"digits. To create plot b), we use t-SNE [13] to project our 20 dimensional latent space onto 2 dimensions. Here, the clusters arewell-defined.

(a) z = 2 (b) z = 20, reduced to 2 dimensions using t-SNE

Figure 3: Clustering of latent space on MNIST

Walking over Latent Space. In our quest to understand VAE as a generative model, we can "walk" over the latent space andproduce samples. Images for both datasets and z = 2 are shown in Figure 4. In the Freyfaces walk, for example, the diagonalalong the space appears to correspond to smiling faces.

4.1 Generative Adversarial Networks (GAN)

In the interest of exploring VAE as a generative model, I also explored the recent generative adversarial network proposed in [5].A brief summary is as follows. A GAN is composed of two neural nets. The first, the generator, aims to produces samples that aresimilar to the data. The second, the discriminator, aims to distinguish between samples produced by the generator and samplesfrom the data. More specifically, the generator models latent input noise variables z and produces samples x = G(z; θ(G)).The discriminator is a function D(x; θ(D)) that produces the probability that x comes from the data. D is trained to maximize

3

Page 4: Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational

(a) MNIST, z = 2 (b) Freyfaces, z = 2

Figure 4: Walking over latent space

the correctness of its probability, while G is trained to minimize the probability that D is correct. This can be formulated as aminimax game with the following objective function:

minG

maxD

V (D,G) = Ex∼pdata(x)[log D(x)

]+ Ez∼pz(z)[log (1−D(G(z)))

](6)

G is finished training once D produces a probability of 0.5 for all inputs. GANs are similar to VAEs in that they can be trainedwithout sampling through standard backpropagation optimization procedures. In practice, however, they can be difficult to trainbecause the two networks D and G are not symmetric in task difficulty; teaching a network to produce images is harder thanteaching one to distinguish whether an image is realistic or not. After some attempt to implement it in Torch, I turned to thePython-based code released with the paper.

In plots b and c of Figure 2, we can compare MNIST samples produced by a VAE to those from a GAN. The VAE samplesare produced by randomly sampling from a Gaussian distribution in the trained network with z = 2. The GAN samples areproduced by passing in a blank image with Gaussian noise. Qualitatively, the VAE samples look smoother, while the GANsamples have sharper corners and look as if produced by jerkier motions. I believe that the smoothness comes from the pixel-wisereconstruction loss in the VAE objective function. GAN’s, on the other hand, appear to capture local features better than theglobal style. As pixel-wise loss doesn’t reflect human ability to ignore slight rotations and translations, perhaps a way to bridgethis gap is to use a "perceptual loss" built from the decoder layers instead of the pixel-wise reconstruction loss. This technique isused in several papers as described in [8].

5 Experiment: Speech

Autoencoders are more fun and useful when applied to non-toy datasets, and I wanted to see if a VAE could be used to learnvariations in speaking accents. This could be used for pre-training a speech recognition network, language identification, andspeaker identification. To that end, I scraped 2000 recordings of speakers from different countries speaking the same Englishparagraph from the speech accent archive [12]. In order to pass a computationally feasible sized input, I had to segment eachrecording and extract the first sentence "Please call Stella." Automatic segmentation using an off-the-shelf segmenter and myown speech recognition system were both inconsistent, so I manually segmented the 600 recordings by native English speakers(the accents still range from American Southern to Northern Irish). Each recording was transformed into a spectrogram using aHamming window of size 512 and stride of 256. These spectrograms were fed as inputs into the VAE, with spectrogram-likeoutputs produced as well. While perfect reconstruction from spectrogram back to audio is impossible, intelligible reconstructionsare possible through an iterative algorithm described by [2] and available in the MATLAB Spectrogram Inversion toolbox.

My current experiments haven’t been able to produce sensible reconstructions. I suspect that because the inputs are much largerand more complex than either a MNIST digit or Freyface, the encoders and decoders should correspondingly be made deeperand more complex.

6 Future Work and Conclusions

With autoencoders no longer as necessary for training deep networks [6], VAEs become more interesting not because of theirpower as autoencoders, but because of their formulation as probabilistic generative models. This formulation is easily extendedto a semi-supervised format by replacing p(z|x) with p(z|x, y), where y’s are observed for a subset of the points in X [10]. Inthe speech recognition context, for example, the y’s could be accents. Questions remain, however, about how one should evaluategenerative models.

There are also recurrent extensions of VAE I would like to explore. [1] and [4] both use the VAE framework on sequential inputdata. The DRAW network in [7] introduces a sequential VAE with an attention mechanism. This ability would be highly usefulin end-to-end speech recognition systems, which rely on stacked recurrent layers. I also made a pass at applying the VAE to thehuman sketch dataset collected in [3]. There are 250 categories of objects, each of which has 80 Mechanical Turked sketches.Unfortunately, my VAE produced very low probabilities for each pixel. I believe this is because the sketches are extremely sparseand that a recurrent formulation incorporating ordered stroke data would be more fruitful.

4

Page 5: Variational Autoencodersechu/assets/projects/6882/6... · 2020. 7. 9. · Variational Autoencoders Eric Chu 6.882: Bayesian Modeling and Inference Abstract The ability of variational

References

[1] Chung, Junyoung, et al. "A recurrent latent variable model for sequential data." Advances in neural information processing systems. 2015.

[2] D. Griffin and J. Lim. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process.,32(2):236-243, 1984.

[3] Eitz, Mathias, James Hays, and Marc Alexa. "How do humans sketch objects?." ACM Trans. Graph. 31.4 (2012): 44-1.

[4] Fabius, Otto, and Joost R. van Amersfoort. "Variational recurrent auto-encoders." arXiv preprint arXiv:1412.6581 (2014).

[5] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in Neural Information Processing Systems. 2014.

[6] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Online.

[7] Gregor, Karol, et al. "DRAW: A recurrent neural network for image generation." arXiv preprint arXiv:1502.04623 (2015).

[8] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. "Perceptual Losses for Real-Time Style Transfer and Super-Resolution." arXiv preprintarXiv:1603.08155 (2016).

[9] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).

[10] Kingma, Diederik P., et al. "Semi-supervised learning with deep generative models." Advances in Neural Information Processing Systems.2014.

[11] Kucukelbir, Alp, et al. "Automatic variational inference in stan." Advances in Neural Information Processing Systems. 2015.

[12] "Speech Accent Archive." Speech Accent Archive. N.p., n.d. Web. 15 Mar. 2016.

[13] Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of Machine Learning Research 9.2579-2605(2008): 85.

5