The Thermodynamic Variational Objective · Auto-encoding variational Bayes. In International...

1
Thermodynamic Integration and Variational Inference Experiments The TVO is a K-term Riemann integral approximation to . The ELBO is a 1-term Riemann integral approximation to . log p θ (x) log p θ (x) The Thermodynamic Variational Objective Vaden Masrani, Tuan Anh Le, Frank Wood [email protected], [email protected], [email protected] TL;DR The Thermodynamic Variational Objective (TVO) Optimizing the TVO The TVO is a new objective for training both continuous and discrete deep generative models that is as broadly applicable as the ELBO. The TVO achieves state-of-the-art model and inference network learning without using the reparameterization trick. The TVO is a generalization of the objectives used in variational inference [5], variational autoencoders [7,8], wake sleep [3], and inference compilation [6] The TVO arises from a novel connection between thermodynamic integration (TI) and variational inference (VI). β 1 elbo(θ ; φ; x) tvo(θ ; φ; x) log p θ (x) 0 β 1 0 β 1 0 E π β h log p θ (x;z) q φ (zjx) i TI refresher Consider two densities with intractable normalizing constants Form a geometric path between and using scalar parameter ˜ π 1 (z) ˜ π 0 (z) β π β (z) := ˜ π β (z) Z β = ˜ π 1 (z) β ˜ π 0 (z) 1β Z β Then compute using the central TI identity: log(Z 1 / Z 0 ) The integrand is typically evaluated at multiple points using MCMC, then approximated using numerical methods β Consider the model and inference network Form a geometric path between and using scalar parameter p θ (x, z) q ϕ (z | x) β With this path, the first derivative of the potential is the "instantaneous ELBO"! log p θ (x) 0= 1 0 π β [ log p θ (x, z) q ϕ (z | x) ] dβ Plugging in to , we arrive at the thermodynamic variational identity (TVI) Integrand strictly increasing Is the ELBO at β =0 Is the EUBO at β =1 1 K ELBO(θ, ϕ, x)+ K1 k=1 π β k [ log p θ (x, z) q ϕ (z | x) ] 1 0 π β [ log p θ (x, z) q ϕ (z | x) ] dβ = log p θ (x) TVO(θ, ϕ, x) The Thermodynamic Variational Identity (TVI) TI is a technique from physics to calculate the log ratio of two unknown normalizing constants [1, 2]. ˜ π β (x, z s ) q ϕ (z s | x) = p θ (x, z s ) β q ϕ (z s | x) 1β q ϕ (z s | x) = ( p θ (x, z s ) q ϕ (z s | x) ) β = w β s Gradients can be computed using the score- function based covariance gradient estimator = π θ [ ( f (z) π θ [ f (z) ] )( θ log ˜ π θ (z) π θ [ θ log ˜ π θ (z) ] ) ] θ π θ [ f (z) ] = Cov π θ [ f (z), θ log ˜ π θ (z) ] Unnormalized, easy to compute (unlike reinforce: ) ) θ π θ [ f (z)] = π θ [ f (z) θ log ] Model: sigmoid belief network Dataset: MNIST Baselines: VIMCO [9], reweighted wake sleep (RWS) [4] Discrete β 1 0 Eπβ h log pθ(x;z) qφ(zjx) i β tvo(θ; φ; x) We also investigated the eect of locations β 1:K We have empirically observed the integrand has a point of maximum curvature, β* β 1 β* β 1 β* β 1 β* 0.2 < β* < 0.4 Connecting TI + VI Generalizing Existing Objective β log ˜ π β (z) = log p θ (x, z) q ϕ (z | x) ˜ π 1 (z) := p θ (x, z) ˜ π 0 (z) := q ϕ (z | x) Z 1 = p θ (x) Z 0 =1 π 1 (z)= ˜ π 1 (z) Z 1 π 0 (z)= ˜ π 0 (z) Z 0 π β (z) := ˜ π β (z) Z β = p θ (x, z) β q ϕ (z | x) 1β Z β log(Z 1 ) log(Z 0 )= 1 0 π β [ β log ˜ π β (z) ] dβ is referred to as the "potential" in physics log ˜ π β (z) To use the TVO as an objective function, we need to be able to compute expectations and gradients Expectations under can be computed simply by exponentiating importance weights by π β (z) β Equivalent to "average" baseline commonly used w/ reinforce Replacement for using similar derivation as θ log Z θ Partitions are cheap (because we can reuse samples!) References The left and right endpoints of the TVI correspond to the ELBO and EUBO, respectively ELBO = q ϕ (z|x) [ log p θ (x, z) q ϕ (z | x) ] EUBO = p θ (z|x) [ log p θ (x, z) q ϕ (z | x) ] Using a right Riemann sum results in an upper bound to that can be minimized log p θ (x) TVO U (θ, ϕ, x) := 1 K EUBO(θ, ϕ, x)+ K1 k=1 π β k [ log p θ (x, z) q ϕ (z | x) ] log p θ (x) Methods that optimize the ELBO or EUBO are 1-term left / right Riemann approximations to the TVI ELBO objectives EUBO objectives Variational Inference [5] Variational AutoEncoders [7,8] Wake in WS [3] Inference Compilation [6] Sleep in WS [3] Sleep- in RWS [4] ϕ [1] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185, 1998. [2] Radford M Neal. Probabilistic inference using Markov chain Monte Carlo methods. 1993. [3] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The “wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995 [4] Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. In International Conference on Learning Representations, 2015. [5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017 [6] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900, 2016. [7] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [8] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014. [9] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016. [10] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016. π θ (z) Inference networks learned with the TVO are better - i.e. closer to as measured by the KL divergence - than VIMCO and RWS. p θ (z | x) Inference network learning Model learning Continuous COV estimator vs reparam & REINFORCE Variance Variance Model: Variational autoencoder Dataset: MNIST Baselines: Variational Autoencoder [7,8], Importance-weighted Auto Encoder [4] Model learning TVO (red) achieves SOTA in terms of convergence rate and final test log evidence for S=5, 10. At S = 50 VIMCO achieves a higher test log evidence but converges more slowly. Standard deviation of the gradient estimator for each objective. TVO has lowest variance, VIMCO is highest variance, RWS is in the middle. The TVO outperforms the VAE and performs competitively with IWAE at 50 samples, despite not using the reparameterization trick. IWAE is the top performing objective in all cases. Standard deviation of the gradient estimator for each objective. The TVO has lower variance than IWAE but higher than VAE. The covariance estimator has higher variance than the reparameterization trick for all S but much lower than REINFORCE, which is numerically unstable. (ELBO objective used in all cases)

Transcript of The Thermodynamic Variational Objective · Auto-encoding variational Bayes. In International...

Page 1: The Thermodynamic Variational Objective · Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [8] Danilo Jimenez Rezende, Shakir Mohamed,

Thermodynamic Integration and Variational Inference

Experiments

The TVO is a K-term Riemann integral approximation to . The ELBO is a 1-term Riemann integral

approximation to . log pθ(x)

log pθ(x)

The Thermodynamic Variational Objective Vaden Masrani, Tuan Anh Le, Frank Wood

[email protected], [email protected], [email protected]

TL;DR The Thermodynamic Variational Objective (TVO)

Optimizing the TVO

‣ The TVO is a new objective for training both continuous and discrete deep generative models that is as broadly applicable as the ELBO.

‣ The TVO achieves state-of-the-art model and inference network learning without using the reparameterization trick.

‣ The TVO is a generalization of the objectives used in variational inference [5], variational autoencoders [7,8], wake sleep [3], and inference compilation [6]

‣ The TVO arises from a novel connection between thermodynamic integration (TI) and variational inference (VI).

β 1

elbo(θ;φ;x)tvo(θ;φ;x) log pθ(x)

0 β 10 β 10

≤ ≤

Eπβ

h

log pθ(x;z)qφ(zjx)

i

β 1

elbo(θ;φ;x)tvo(θ;φ;x) log pθ(x)

0 β 10 β 10

≤ ≤

Eπβ

h

log pθ(x;z)qφ(zjx)

i

TI refresher

‣ Consider two densities with intractable normalizing constants

‣ Form a geometric path between and using scalar parameter

π̃1(z)π̃0(z) β

πβ(z) :=π̃β(z)

Zβ=

π̃1(z)βπ̃0(z)1−β

‣ Then compute using the central TI identity:

log(Z1/Z0)

‣ The integrand is typically evaluated at multiple points using MCMC, then approximated

using numerical methodsβ

‣ Consider the model and inference network

‣ Form a geometric path between and using scalar parameter

pθ(x, z)qϕ(z |x) β

‣ With this path, the first derivative of the potential is the "instantaneous ELBO"!

log pθ(x) − 0 = ∫1

0𝔼πβ [log

pθ(x, z)qϕ(z |x) ] dβ

‣ Plugging in to , we arrive at the thermodynamic variational identity (TVI)

Integrand strictly increasing

Is the ELBO at β = 0

Is the EUBO at β = 1

1K

ELBO(θ, ϕ, x) +K−1

∑k=1

𝔼πβk [logpθ(x, z)qϕ(z |x) ] ≤ ∫

1

0𝔼πβ [log

pθ(x, z)qϕ(z |x) ] dβ = log pθ(x)

TVO(θ, ϕ, x) The Thermodynamic Variational Identity (TVI)

‣ TI is a technique from physics to calculate the log ratio of two unknown normalizing constants [1, 2].

π̃β(x, zs)qϕ(zs |x)

=pθ(x, zs)βqϕ(zs |x)1−β

qϕ(zs |x)= ( pθ(x, zs)

qϕ(zs |x) )β

= wβs

‣ Gradients can be computed using the score-function based covariance gradient estimator

= 𝔼πθ [(f(z) − 𝔼πθ [f(z)]) (∇θlog π̃θ(z) − 𝔼πθ [∇θlog π̃θ(z)])]∇θ𝔼πθ [f(z)] = Covπθ [f(z), ∇θlog π̃θ(z)] Unnormalized, easy to compute

(unlike reinforce: ) )∇θ𝔼πθ[ f (z)] = 𝔼πθ

[ f (z)∇θ log ]

Model: sigmoid belief networkDataset: MNIST Baselines: VIMCO [9], reweighted wake sleep (RWS) [4] Discrete

β 10

Eπβ

h

log pθ(x;z)qφ(zjx)

i

β∗

tvo(θ;φ;x)

We also investigated the effect of locations β1:K

We have empirically observed the integrand has a point of maximum curvature, β*

β1 ≪ β* β1 ≈ β*

β1 ≫ β* 0.2 < β* < 0.4

Connecting TI + VIGeneralizing Existing Objective

∂∂β

log π̃β(z) = logpθ(x, z)qϕ(z |x)

π̃1(z) := pθ(x, z)π̃0(z) := qϕ(z |x)

Z1 = pθ(x)Z0 = 1

π1(z) =π̃1(z)

Z1π0(z) =

π̃0(z)Z0

πβ(z) :=π̃β(z)

Zβ=

pθ(x, z)βqϕ(z |x)1−β

log(Z1) − log(Z0) = ∫1

0𝔼πβ [ ∂

∂βlog π̃β(z)] dβ

‣ is referred to as the "potential" in physicslog π̃β(z)

‣ To use the TVO as an objective function, we need to be able to compute expectations and gradients

‣ Expectations under can be computed simply by exponentiating importance weights by

πβ(z)β

Equivalent to "average" baseline commonly used w/ reinforce Replacement for using similar derivation as ∇θ log Zθ

Partitions are cheap (because we can reuse samples!)

References

‣ The left and right endpoints of the TVI correspond to the ELBO and EUBO, respectively

ELBO = 𝔼qϕ(z|x) [logpθ(x, z)qϕ(z |x) ] EUBO = 𝔼pθ(z|x) [log

pθ(x, z)qϕ(z |x) ]

‣ Using a right Riemann sum results in an upper bound to that can be minimized log pθ(x)

TVOU(θ, ϕ, x) :=1K

EUBO(θ, ϕ, x) +K−1

∑k=1

𝔼πβk [logpθ(x, z)qϕ(z |x) ] ≥ log pθ(x)

‣ Methods that optimize the ELBO or EUBO are 1-term left / right Riemann approximations to the TVI

ELBO objectives

EUBO objectives

‣ Variational Inference [5]‣ Variational AutoEncoders [7,8]‣ Wake in WS [3]

‣ Inference Compilation [6]‣ Sleep in WS [3]‣ Sleep- in RWS [4]ϕ

[1] Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185, 1998. [2] Radford M Neal. Probabilistic inference using Markov chain Monte Carlo methods. 1993. [3] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The “wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995[4] Jörg Bornschein and Yoshua Bengio. Reweighted wake-sleep. In International Conference on Learning Representations, 2015. [5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017 [6] Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900, 2016.

[7] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. [8] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.[9] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016. [10] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Learning Representations, 2016.

πθ(z)

Inference networks learned with the TVO are better - i.e. closer to

as measured by the KL divergence - than VIMCO and RWS.

pθ(z |x)

Inference network learning

Model learning

Continuous

COV estimator vs reparam & REINFORCE

Variance

Variance

Model: Variational autoencoderDataset: MNIST Baselines: Variational Autoencoder [7,8], Importance-weighted Auto Encoder [4]

Model learning TVO (red) achieves SOTA in terms of convergence rate and final test log evidence for S=5, 10. At S = 50 VIMCO achieves a higher test log evidence but converges more slowly.

Standard deviation of the gradient estimator for each objective. TVO has lowest variance, VIMCO is highest variance, RWS is in the middle.

The TVO outperforms the VAE and performs competitively with IWAE at 50 samples, despite not using the reparameterization trick. IWAE is the top performing objective in all cases.

Standard deviation of the gradient estimator for each objective. The TVO has lower variance than IWAE but higher than VAE.

The covariance estimator has higher variance than the reparameterization trick for all S but much lower than REINFORCE, which is numerically unstable. (ELBO objective used in all cases)