Gradient Estimation Using Stochastic Computation...

42
Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017

Transcript of Gradient Estimation Using Stochastic Computation...

Gradient Estimation Using StochasticComputation Graphs

Yoonho Lee

Department of Computer Science and EngineeringPohang University of Science and Technology

May 9, 2017

Outline

Stochastic Gradient Estimation

Stochastic Computation Graphs

Stochastic Computation Graphs in RL

Other Examples of Stochastic Computation Graphs

Outline

Stochastic Gradient Estimation

Stochastic Computation Graphs

Stochastic Computation Graphs in RL

Other Examples of Stochastic Computation Graphs

Stochastic Gradient Estimation

R(θ) = Ew∼Pθ(dw)[f (θ,w)]

Approximate OθR(θ) can be used for:

I Computational Finance

I Reinforcement Learning

I Variational Inference

I Explaining the brain with neural networks

Stochastic Gradient Estimation

R(θ) = Ew∼Pθ(dw)[f (θ,w)] =

∫Ωf (θ,w)Pθ(dw)

=

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

OθR(θ) = Oθ

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

=

∫ΩOθ(f (θ,w)

Pθ(dw)

G (dw))G (dw)

= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)

G (dw)+ f (θ,w)

OθPθ(dw)

G (dw)]

Stochastic Gradient Estimation

R(θ) = Ew∼Pθ(dw)[f (θ,w)] =

∫Ωf (θ,w)Pθ(dw)

=

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

OθR(θ) = Oθ

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

=

∫ΩOθ(f (θ,w)

Pθ(dw)

G (dw))G (dw)

= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)

G (dw)+ f (θ,w)

OθPθ(dw)

G (dw)]

Stochastic Gradient Estimation

R(θ) = Ew∼Pθ(dw)[f (θ,w)] =

∫Ωf (θ,w)Pθ(dw)

=

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

OθR(θ) = Oθ

∫Ωf (θ,w)

Pθ(dw)

G (dw)G (dw)

=

∫ΩOθ(f (θ,w)

Pθ(dw)

G (dw))G (dw)

= Ew∼G(dw)[Oθf (θ,w)Pθ(dw)

G (dw)+ f (θ,w)

OθPθ(dw)

G (dw)]

Stochastic Gradient Estimation

So far we have

OθR(θ) = Ew∼G(dw)[Oθf (θ,w)Pθ(dw)

G (dw)+ f (θ,w)

OθPθ(dw)

G (dw)]

Letting G = Pθ0 and evaluating at θ = θ0, we get

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w)

Pθ(dw)

Pθ0(dw)+ f (θ,w)

OθPθ(dw)

Pθ0(dw)]

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

Stochastic Gradient Estimation

So far we have

OθR(θ) = Ew∼G(dw)[Oθf (θ,w)Pθ(dw)

G (dw)+ f (θ,w)

OθPθ(dw)

G (dw)]

Letting G = Pθ0 and evaluating at θ = θ0, we get

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w)

Pθ(dw)

Pθ0(dw)+ f (θ,w)

OθPθ(dw)

Pθ0(dw)]

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

Stochastic Gradient Estimation

So far we have

R(θ) = Ew∼Pθ(dw)[f (θ,w)]

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

1. Assuming w fully determines f :

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

2. Assuming Pθ(dw) is independent of θ:

OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]

Stochastic Gradient Estimation

So far we have

R(θ) = Ew∼Pθ(dw)[f (θ,w)]

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[Oθf (θ,w) + f (θ,w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

1. Assuming w fully determines f :

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

2. Assuming Pθ(dw) is independent of θ:

OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]

Stochastic Gradient EstimationSF vs PD

R(θ) = Ew∼Pθ(dw)[f (θ,w)]

Score Function (SF) Estimator:

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

Pathwise Derivative (PD) Estimator:

OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]

I SF allows discontinuous f .I SF requires sample values f ; PD requires derivatives f ′.I SF takes derivatives over the probability distribution of w ; PD

takes derivatives over the function f .I SF has high variance on high-dimensional w .I PD has high variance on rough f .

Stochastic Gradient EstimationSF vs PD

R(θ) = Ew∼Pθ(dw)[f (θ,w)]

Score Function (SF) Estimator:

OθR(θ)

∣∣∣∣θ=θ0

= Ew∼Pθ0(dw)[f (w)OθlnPθ(dw)]

∣∣∣∣θ=θ0

Pathwise Derivative (PD) Estimator:

OθR(θ) = Ew∼P(dw)[Oθf (θ,w)]

I SF allows discontinuous f .I SF requires sample values f ; PD requires derivatives f ′.I SF takes derivatives over the probability distribution of w ; PD

takes derivatives over the function f .I SF has high variance on high-dimensional w .I PD has high variance on rough f .

Stochastic Gradient EstimationChoices for PD estimator

blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/

Outline

Stochastic Gradient Estimation

Stochastic Computation Graphs

Stochastic Computation Graphs in RL

Other Examples of Stochastic Computation Graphs

Gradient Estimation Using Stochastic ComputationGraphs (NIPS 2015)

John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

Stochastic Computation Graphs

SF estimator:

OθR(θ)

∣∣∣∣θ=θ0

= Ex∼Pθ0(dx)[f (x)OθlnPθ(dx)]

∣∣∣∣θ=θ0

PD estimator:

OθR(θ) = Ez∼P(dz)[Oθf (x(θ, z))]

Stochastic Computation Graphs

I Stochastic computation graphs are a design choice ratherthan an intrinsic property of a problem

I Stochastic nodes ’block’ gradient flow

Stochastic Computation GraphsNeural Networks

I Neural Networks are a special case of stochastic computationgraphs

Stochastic Computation GraphsFormal Definition

Stochastic Computation GraphsDeriving Unbiased Estimators

Condition 1. (Differentiability Requirements) Given input nodeθ ∈ Θ, for all edges (v ,w) which satisfy θ ≺D v and θ ≺D w , thefollowing condition holds: if w is deterministic, ∂w

∂v exists, and if w

is stochastic, ∂∂v p(w |PARENTSw ) exists.

Theorem 1. Suppose θ ∈ Θ satisfies Condition 1. The followingequations hold:

∂θE[∑c∈C

c]

= E[ ∑

w∈Sθ≺Dw

( ∂∂θ

logp(w |DEPSw ))Qw +

∑c∈Cθ≺Dc

∂θc(DEPSc)

]

= E[∑c∈C

c∑w∈Sθ≺Dw

∂θlogp(w |DEPSw ) +

∑c∈Cθ≺Dc

∂θc(DEPSc)

]

Stochastic Computation GraphsSurrogate loss functions

We have this equation from Theorem 1:

∂θE[∑c∈C

c

]= E

[ ∑w∈Sθ≺Dw

( ∂∂θ

logp(w |DEPSw ))Qw +

∑c∈Cθ≺Dc

∂θc(DEPSc)

]

Note that by defining

L(Θ,S) :=∑w∈S

logp(w |DEPSw )Qw +∑c∈C

c(DEPSc)

we have

∂θE[∑c∈C

c

]= E

[∂

∂θL(Θ,S)

]

Stochastic Computation GraphsSurrogate loss functions

Stochastic Computation GraphsImplementations

I Making stochastic computation graphs with neural networkpackages is easy

tensorflow.org/api_docs/python/tf/contrib/bayesflow/stochastic_gradient_estimators

github.com/pytorch/pytorch/blob/6a69f7007b37bb36d8a712cdf5cebe3ee0cc1cc8/torch/autograd/

variable.py

github.com/blei-lab/edward/blob/master/edward/inferences/klqp.py

Outline

Stochastic Gradient Estimation

Stochastic Computation Graphs

Stochastic Computation Graphs in RL

Other Examples of Stochastic Computation Graphs

MDP

I The MDP formulation itself is a stochastic computation graph

Policy GradientSurrogate Loss

1

I The policy gradient algorithm is Theorem 1 applied to theMDP graph

1Richard S Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”.In: NIPS (1999).

Stochastic Value GradientSVG(0)

Learn Qφ, and update2

∆θ ∝ Oθ

T∑t=1

Qφ(st , πθ(st , εt))

2Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).

Stochastic Value GradientSVG(1)

Learn Vφ and dynamics model f such that st+1 = f (st , at) + δt .Given transition (st , at , st+1), infer noise δt .

3

∆θ ∝ Oθ

T∑t=1

rt + γVφ(f (st , at) + δt)

3Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).

Stochastic Value GradientSVG(∞)

Learn f , infer all noise variables δt . Differentiate through wholegraph.4

4Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).

Deterministic Policy Gradient

5

5David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML (2014).

Evolution Strategies

6

6Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv(2017).

Outline

Stochastic Gradient Estimation

Stochastic Computation Graphs

Stochastic Computation Graphs in RL

Other Examples of Stochastic Computation Graphs

Neural Variational Inference

7

Where

L(x , θ, φ) = Eh∼Q(h|x)[logPθ(x , h)− logQφ(h|x)]

I Score function estimator

7Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learning in Belief Networks”. In: ICML(2014).

Variational Autoencoder

8

I Pathwise derivative estimator applied to the exact same graphas NVIL

8Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ICLR (2014).

DRAW

9

I Sequential version of VAE

I Same graph as SVG(∞) with (known) deterministicdynamics(e=state, z=action, L=reward)

I Similarly, VAE can be seen as SVG(∞) with 1 timestep anddeterministic dynamics

9Karol Gregor et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: ICML (2015).

GAN

10

I Connection to SVG(0) has been established in [10]

I Actor has no access to state. Observation is a random choicebetween real or fake image. Reward is 1 if real, 0 if fake. Dlearns value function of observation. G’s learning signal is D.

10David Pfau and Oriol Vinyals. “Connecting Generative Adversarial Networks and Actor-Critic Methods”. In:(2016).

Bayes by backprop

11

and our cost function f is:

f (D, θ) = Ew∼q(w |θ)[f (w , θ)]

= Ew∼q(w |θ)[−logp(D|w) + logq(w |θ)− logp(w)]

I From the point of view of stochastic computation graphs,weights and activations are not different

I Estimate ELBO gradient of activation with PD estimator:VAE

I Estimate ELBO gradient of weight with PD estimator: Bayesby Backprop

11Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: ICML (2015).

Summary

I Stochastic computation graphs explain many differentalgorithms with one gradient estimator(Theorem 1)

I Stochastic computation graphs give us insight into thebehavior of estimators

I Stochastic computation graphs reveal equivalencies:I NVIL[7]=Policy gradient[12]I VAE[6]/DRAW[4]=SVG(∞)[5]I GAN[3]=SVG(0)[5]

References I

[1] Charles Blundell et al. “Weight Uncertainty in NeuralNetworks”. In: ICML (2015).

[2] Michael Fu. “Stochastic Gradient Estimation”. In: (2005).

[3] Ian J. Goodfellow et al. “Generative Adversarial Networks”.In: NIPS (2014).

[4] Karol Gregor et al. “DRAW: A Recurrent Neural NetworkFor Image Generation”. In: ICML (2015).

[5] Nicolas Heess et al. “Learning Continuous Control Policiesby Stochastic Value Gradients”. In: NIPS (2015).

[6] Diederik P Kingma and Max Welling. “Auto-EncodingVariational Bayes”. In: ICLR (2014).

[7] Andriy Mnih and Karol Gregor. “Neural Variational Inferenceand Learning in Belief Networks”. In: ICML (2014).

References II

[8] David Pfau and Oriol Vinyals. “Connecting GenerativeAdversarial Networks and Actor-Critic Methods”. In: (2016).

[9] Tim Salimans et al. “Evolution Strategies as a ScalableAlternative to Reinforcement Learning”. In: arXiv (2017).

[10] John Schulman et al. “Gradient Estimation Using StochasticComputation Graphs”. In: NIPS (2015).

[11] David Silver et al. “Deterministic Policy GradientAlgorithms”. In: ICML (2014).

[12] Richard S Sutton et al. “Policy Gradient Methods forReinforcement Learning with Function Approximation”. In:NIPS (1999).

Thank You