Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders -...
Transcript of Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders -...
![Page 1: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/1.jpg)
Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu
Neural Discrete Representation Learning
![Page 2: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/2.jpg)
Goal: Estimate the probability distribution of high-dimensional data
Such as images, audio, video, text, ...
Motivation:
Learn the underlying structure in data.
Capture the dependencies between the variables.
Generate new data with similar properties.
Learn useful features from the data in an unsupervised fashion.
Generative Models
![Page 3: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/3.jpg)
Autoregressive Models
![Page 4: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/4.jpg)
Recent Autoregressive models at DeepMind
PixelRNN
PixelCNN
White Whale
Hartebeest
Tiger
Geyser
Video Pixel Networks
WaveNetByteNet
van den Oord et al, 2016ab
van den Oord et al, 2016c
Kalchbrenner et al, 2016a
Kalchbrenner et al, 2016b
![Page 5: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/5.jpg)
Modeling Audio
![Page 6: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/6.jpg)
Causal Convolution
Input
HiddenLayer
![Page 7: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/7.jpg)
Causal Convolution
Input
HiddenLayer
HiddenLayer
![Page 8: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/8.jpg)
Causal Convolution
Input
HiddenLayer
HiddenLayer
HiddenLayer
![Page 9: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/9.jpg)
Causal Convolution
Input
HiddenLayer
HiddenLayer
HiddenLayer
Output
![Page 10: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/10.jpg)
Causal Convolution
Input
HiddenLayer
HiddenLayer
HiddenLayer
Output
![Page 11: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/11.jpg)
Causal Dilated Convolution
Input
![Page 12: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/12.jpg)
Input
HiddenLayer
Causal Dilated Convolution
![Page 13: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/13.jpg)
Input
HiddenLayer
HiddenLayer
dilation=2
dilation=1
Causal Dilated Convolution
![Page 14: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/14.jpg)
Input
HiddenLayer
HiddenLayer
HiddenLayer
dilation=2
dilation=4
dilation=1
Causal Dilated Convolution
![Page 15: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/15.jpg)
Input
HiddenLayer
HiddenLayer
HiddenLayer
Output
dilation=4
dilation=2
dilation=8
dilation=1
Causal Dilated Convolution
![Page 16: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/16.jpg)
Input
HiddenLayer
HiddenLayer
HiddenLayer
Output
dilation=4
dilation=2
dilation=8
dilation=1
Causal Dilated Convolution
![Page 17: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/17.jpg)
Multiple Stacks
![Page 18: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/18.jpg)
Sampling
![Page 19: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/19.jpg)
Speaker-conditional Generation
...
Does not depend on timestep
Speaker embedding
![Page 20: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/20.jpg)
Text-To-Speech samples
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
![Page 21: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/21.jpg)
Speaker-conditional samples(but not conditioned on text)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
![Page 22: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/22.jpg)
Piano Music samples
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
![Page 23: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/23.jpg)
VQ-VAE
- Towards modeling a latent space- Learn meaningful representations.- Abstract away noise and details.- Model what’s important in a compressed latent representation.
- Why discrete?- Many important real-world things are discrete.- Arguably easier to model for the prior (e.g., softmax vs RNADE)- Continuous representations are often inherently discretized by encoder/decoder.
![Page 24: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/24.jpg)
VQ-VAE
Related work:PixelVAE (Gulrajani et al, 2016)Variational Lossy AutoEncoder (Chen et al, 2016)
![Page 25: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/25.jpg)
VQ-VAE
![Page 26: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/26.jpg)
VQ-VAE
![Page 27: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/27.jpg)
Images
![Page 28: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/28.jpg)
ImageNet reconstructionsOriginal 128x128 images Reconstructions
![Page 29: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/29.jpg)
VQ-VAE - Sample
![Page 30: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/30.jpg)
ImageNet samples
![Page 31: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/31.jpg)
DM-Lab Samples
![Page 32: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/32.jpg)
3 Global Latents Reconstruction
![Page 33: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/33.jpg)
3 Global Latents Reconstruction
Originals
Reconstructions from compressed representations (27 bits per image).
![Page 34: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/34.jpg)
Video Generation in the latent space
![Page 35: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/35.jpg)
Speech
![Page 36: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/36.jpg)
https://avdnoord.github.io/homepage/vqvae/
![Page 37: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/37.jpg)
Speech - reconstruction
Original Reconstruction
![Page 38: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/38.jpg)
Speech - Sample from prior
![Page 39: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/39.jpg)
https://avdnoord.github.io/homepage/vqvae/
![Page 40: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/40.jpg)
Speech - speaker conditional
![Page 41: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/41.jpg)
https://avdnoord.github.io/homepage/vqvae/
![Page 42: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/42.jpg)
Unsupervised Learning of phonemes
Encoder DecoderDiscrete codes alphabet = codebook
Phonemes
![Page 43: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/43.jpg)
Unsupervised Learning of phonemesPh
onem
es
Discrete codes
41-way classification49.3% accuracy fully unsupervised
![Page 44: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/44.jpg)
References and related workPixel Recurrent Neural Networks - van den Oord et al, ICML 2016
Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016
WaveNet: A Generative Model For Raw Audio - van den Oord et al, Arxiv 2016
Neural Machine Translation in Linear Time - Kalchbrenner et al, Arxiv 2016
Video Pixel Networks - Kalchbrenner et al, ICML 2017
Neural Discrete Representation Learning - van den Oord et al, NIPS 2017
Related work:
The Neural Autoregressive Distribution Estimator - Larochelle et al, AISTATS 2011
Generative image modeling using spatial LSTMs - Theis et al, NIPS 2015
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model - Mehri et al, ICLR 2017
PixelVAE: A Latent Variable Model for Natural Images - Gulrajani et al, ICLR 2017
Variational Lossy Autoencoder - Chen et al, ICLR 2017
Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations - Agustsson et al, NIPS 2017
![Page 45: Neural Discrete Representation Learning · Conditional Image Generation with PixelCNN Decoders - van den Oord et al, NIPS 2016 WaveNet: A Generative Model For Raw Audio - van den](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f6203b2e488741798165d67/html5/thumbnails/45.jpg)
Thank you!