Gradient Estimation Using Stochastic Computation...
Transcript of Gradient Estimation Using Stochastic Computation...
Gradient Estimation Using StochasticComputation Graphs
Yoonho Lee
Department of Computer Science and EngineeringPohang University of Science and Technology
May 9, 2017
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ,w)]
Approximate OθR(θ) can be used for:
I Computational Finance
I Reinforcement Learning
I Variational Inference
I Explaining the brain with neural networks
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ,w)] =
∫Ωf (θ,w)Pθ(dw)
=
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
OθR(θ) = Oθ
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
=
∫ΩOθ(f (θ,w)
Pθ(dw)
G (dw))G (dw)
= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)
G (dw)+ f (θ,w)
OθPθ(dw)
G (dw)]
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ,w)] =
∫Ωf (θ,w)Pθ(dw)
=
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
OθR(θ) = Oθ
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
=
∫ΩOθ(f (θ,w)
Pθ(dw)
G (dw))G (dw)
= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)
G (dw)+ f (θ,w)
OθPθ(dw)
G (dw)]
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ,w)] =
∫Ωf (θ,w)Pθ(dw)
=
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
OθR(θ) = Oθ
∫Ωf (θ,w)
Pθ(dw)
G (dw)G (dw)
=
∫ΩOθ(f (θ,w)
Pθ(dw)
G (dw))G (dw)
= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)
G (dw)+ f (θ,w)
OθPθ(dw)
G (dw)]
Stochastic Gradient Estimation
So far we have
OθR(θ) = Ew∼G(dw)[Oθf (θ,w)Pθ(dw)
G (dw)+ f (θ,w)
OθPθ(dw)
G (dw)]
Letting G = Pθ0 and evaluating at θ = θ0, we get
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w)
Pθ(dw)
Pθ0(dw)+ f (θ,w)
OθPθ(dw)
Pθ0(dw)]
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
Stochastic Gradient Estimation
So far we have
OθR(θ) = Ew∼G(dw)[Oθf (θ,w)Pθ(dw)
G (dw)+ f (θ,w)
OθPθ(dw)
G (dw)]
Letting G = Pθ0 and evaluating at θ = θ0, we get
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w)
Pθ(dw)
Pθ0(dw)+ f (θ,w)
OθPθ(dw)
Pθ0(dw)]
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
Stochastic Gradient Estimation
So far we have
R(θ) = Ew∼Pθ(dw)[f (θ,w)]
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
1. Assuming w fully determines f :
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
2. Assuming Pθ(dw) is independent of θ:
OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]
Stochastic Gradient Estimation
So far we have
R(θ) = Ew∼Pθ(dw)[f (θ,w)]
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
1. Assuming w fully determines f :
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
2. Assuming Pθ(dw) is independent of θ:
OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]
Stochastic Gradient EstimationSF vs PD
R(θ) = Ew∼Pθ(dw)[f (θ,w)]
Score Function (SF) Estimator:
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
Pathwise Derivative (PD) Estimator:
OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]
I SF allows discontinuous f .I SF requires sample values f ; PD requires derivatives f ′.I SF takes derivatives over the probability distribution of w ; PD
takes derivatives over the function f .I SF has high variance on high-dimensional w .I PD has high variance on rough f .
Stochastic Gradient EstimationSF vs PD
R(θ) = Ew∼Pθ(dw)[f (θ,w)]
Score Function (SF) Estimator:
OθR(θ)
∣∣∣∣θ=θ0
= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]
∣∣∣∣θ=θ0
Pathwise Derivative (PD) Estimator:
OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]
I SF allows discontinuous f .I SF requires sample values f ; PD requires derivatives f ′.I SF takes derivatives over the probability distribution of w ; PD
takes derivatives over the function f .I SF has high variance on high-dimensional w .I PD has high variance on rough f .
Stochastic Gradient EstimationChoices for PD estimator
blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Gradient Estimation Using Stochastic ComputationGraphs (NIPS 2015)
John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
Stochastic Computation Graphs
SF estimator:
OθR(θ)
∣∣∣∣θ=θ0
= Ex∼Pθ0(dx)[f (x)OθlnPθ(dx)]
∣∣∣∣θ=θ0
PD estimator:
OθR(θ) = Ez∼P(dz)[Oθf (x(θ, z))]
Stochastic Computation Graphs
I Stochastic computation graphs are a design choice ratherthan an intrinsic property of a problem
I Stochastic nodes ’block’ gradient flow
Stochastic Computation GraphsNeural Networks
I Neural Networks are a special case of stochastic computationgraphs
Stochastic Computation GraphsDeriving Unbiased Estimators
Condition 1. (Differentiability Requirements) Given input nodeθ ∈ Θ, for all edges (v ,w) which satisfy θ ≺D v and θ ≺D w , thefollowing condition holds: if w is deterministic, ∂w
∂v exists, and if w
is stochastic, ∂∂v p(w |PARENTSw ) exists.
Theorem 1. Suppose θ ∈ Θ satisfies Condition 1. The followingequations hold:
∂
∂θE[∑c∈C
c]
= E[ ∑
w∈Sθ≺Dw
( ∂∂θ
logp(w |DEPSw ))Qw +
∑c∈Cθ≺Dc
∂
∂θc(DEPSc)
]
= E[∑c∈C
c∑w∈Sθ≺Dw
∂
∂θlogp(w |DEPSw ) +
∑c∈Cθ≺Dc
∂
∂θc(DEPSc)
]
Stochastic Computation GraphsSurrogate loss functions
We have this equation from Theorem 1:
∂
∂θE[∑c∈C
c
]= E
[ ∑w∈Sθ≺Dw
( ∂∂θ
logp(w |DEPSw ))Qw +
∑c∈Cθ≺Dc
∂
∂θc(DEPSc)
]
Note that by defining
L(Θ,S) :=∑w∈S
logp(w |DEPSw )Qw +∑c∈C
c(DEPSc)
we have
∂
∂θE[∑c∈C
c
]= E
[∂
∂θL(Θ,S)
]
Stochastic Computation GraphsImplementations
I Making stochastic computation graphs with neural networkpackages is easy
tensorflow.org/api_docs/python/tf/contrib/bayesflow/stochastic_gradient_estimators
github.com/pytorch/pytorch/blob/6a69f7007b37bb36d8a712cdf5cebe3ee0cc1cc8/torch/autograd/
variable.py
github.com/blei-lab/edward/blob/master/edward/inferences/klqp.py
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Policy GradientSurrogate Loss
1
I The policy gradient algorithm is Theorem 1 applied to theMDP graph
1Richard S Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”.In: NIPS (1999).
Stochastic Value GradientSVG(0)
Learn Qφ, and update2
∆θ ∝ Oθ
T∑t=1
Qφ(st , πθ(st , εt))
2Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Stochastic Value GradientSVG(1)
Learn Vφ and dynamics model f such that st+1 = f (st , at) + δt .Given transition (st , at , st+1), infer noise δt .
3
∆θ ∝ Oθ
T∑t=1
rt + γVφ(f (st , at) + δt)
3Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Stochastic Value GradientSVG(∞)
Learn f , infer all noise variables δt . Differentiate through wholegraph.4
4Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Deterministic Policy Gradient
5
5David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML (2014).
Evolution Strategies
6
6Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv(2017).
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Neural Variational Inference
7
Where
L(x , θ, φ) = Eh∼Q(h|x)[logPθ(x , h)− logQφ(h|x)]
I Score function estimator
7Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learning in Belief Networks”. In: ICML(2014).
Variational Autoencoder
8
I Pathwise derivative estimator applied to the exact same graphas NVIL
8Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ICLR (2014).
DRAW
9
I Sequential version of VAE
I Same graph as SVG(∞) with (known) deterministicdynamics(e=state, z=action, L=reward)
I Similarly, VAE can be seen as SVG(∞) with 1 timestep anddeterministic dynamics
9Karol Gregor et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: ICML (2015).
GAN
10
I Connection to SVG(0) has been established in [10]
I Actor has no access to state. Observation is a random choicebetween real or fake image. Reward is 1 if real, 0 if fake. Dlearns value function of observation. G’s learning signal is D.
10David Pfau and Oriol Vinyals. “Connecting Generative Adversarial Networks and Actor-Critic Methods”. In:(2016).
Bayes by backprop
11
and our cost function f is:
f (D, θ) = Ew∼q(w |θ)[f (w , θ)]
= Ew∼q(w |θ)[−logp(D|w) + logq(w |θ)− logp(w)]
I From the point of view of stochastic computation graphs,weights and activations are not different
I Estimate ELBO gradient of activation with PD estimator:VAE
I Estimate ELBO gradient of weight with PD estimator: Bayesby Backprop
11Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: ICML (2015).
Summary
I Stochastic computation graphs explain many differentalgorithms with one gradient estimator(Theorem 1)
I Stochastic computation graphs give us insight into thebehavior of estimators
I Stochastic computation graphs reveal equivalencies:I NVIL[7]=Policy gradient[12]I VAE[6]/DRAW[4]=SVG(∞)[5]I GAN[3]=SVG(0)[5]
References I
[1] Charles Blundell et al. “Weight Uncertainty in NeuralNetworks”. In: ICML (2015).
[2] Michael Fu. “Stochastic Gradient Estimation”. In: (2005).
[3] Ian J. Goodfellow et al. “Generative Adversarial Networks”.In: NIPS (2014).
[4] Karol Gregor et al. “DRAW: A Recurrent Neural NetworkFor Image Generation”. In: ICML (2015).
[5] Nicolas Heess et al. “Learning Continuous Control Policiesby Stochastic Value Gradients”. In: NIPS (2015).
[6] Diederik P Kingma and Max Welling. “Auto-EncodingVariational Bayes”. In: ICLR (2014).
[7] Andriy Mnih and Karol Gregor. “Neural Variational Inferenceand Learning in Belief Networks”. In: ICML (2014).
References II
[8] David Pfau and Oriol Vinyals. “Connecting GenerativeAdversarial Networks and Actor-Critic Methods”. In: (2016).
[9] Tim Salimans et al. “Evolution Strategies as a ScalableAlternative to Reinforcement Learning”. In: arXiv (2017).
[10] John Schulman et al. “Gradient Estimation Using StochasticComputation Graphs”. In: NIPS (2015).
[11] David Silver et al. “Deterministic Policy GradientAlgorithms”. In: ICML (2014).
[12] Richard S Sutton et al. “Policy Gradient Methods forReinforcement Learning with Function Approximation”. In:NIPS (1999).