Training neural ODEs for density estimation
Transcript of Training neural ODEs for density estimation
Training neural ODEs for density estimation
Chris Finlay
in collaboration withJorn-Henrik Jacobsen, Levon Nurbekyan and Adam Oberman
Paper: “How to train your Neural ODE”
IPAM HJB IIApril 23, 2020
0.
Table of Contents
BackgroundDensity estimation with Normalizing flowsNeural ODEs
FFJORD: Neural ODEs + Normalizing flows
Regularized neural ODEs
Results
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 1 / 22
1. Background Density estimation with Normalizing flows
Density estimation
Training data
p(dog) = ?
Density estimation is a fundamental problemin ML
I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p
Typically done through maximumlog-likelihood
I select a family of models pθI solve maxθ
∑log pθ(Xi )
Two issues arise
1. How to parameterize pθ?
2. How to sample from the learneddistribution pθ? Ie, generate new data?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22
1. Background Density estimation with Normalizing flows
Density estimation
Training data
p(dog) = ?
Density estimation is a fundamental problemin ML
I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p
Typically done through maximumlog-likelihood
I select a family of models pθI solve maxθ
∑log pθ(Xi )
Two issues arise
1. How to parameterize pθ?
2. How to sample from the learneddistribution pθ? Ie, generate new data?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22
1. Background Density estimation with Normalizing flows
Density estimation
Training data
p(dog) = ?
Density estimation is a fundamental problemin ML
I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p
Typically done through maximumlog-likelihood
I select a family of models pθI solve maxθ
∑log pθ(Xi )
Two issues arise
1. How to parameterize pθ?
2. How to sample from the learneddistribution pθ? Ie, generate new data?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22
1. Background Density estimation with Normalizing flows
Density estimation and generation with normalizing flows
Today, the gospel of Normalizing Flows [TT13, RM15]
Density estimation
I suppose we have an invertible map fθ : X 7→ Z
I Change of variables applied to log-densities:
log pθ(x) = log pN (fθ(x)) + log det
[dfθ(x)
dx
](1)
I here N is the standard normal distribution
Generation
I sample z ∼ NI compute x = f −1
θ (z)
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 3 / 22
1. Background Density estimation with Normalizing flows
That’s nice, but. . .
Unfortunately this approach has two difficultproblems:
1. Constructing invertible deep neuralnetworks is hard. (but see eg [CBD+19])
2. Need to compute Jacobian determinants.Also hard.
Both of these tasks can be made easier byputting structural constraints on the model.However these constraints tend to degrademodel performance.
Is there an easier way?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 4 / 22
1. Background Density estimation with Normalizing flows
That’s nice, but. . .
Unfortunately this approach has two difficultproblems:
1. Constructing invertible deep neuralnetworks is hard. (but see eg [CBD+19])
2. Need to compute Jacobian determinants.Also hard.
Both of these tasks can be made easier byputting structural constraints on the model.However these constraints tend to degrademodel performance.
Is there an easier way?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 4 / 22
1. Background Neural ODEs
Neural ODEs
Source: [CRB+18],Figure 1
Neural ODEs [HR17, CRB+18] aregeneralizations of Residual Networks, wherelayer index t is a continuous variable called“time”.
I Compare one layer of a Residual Network
x t+1 = x t + vθ(x t , t)
I with an Euler step discretization of theODE x = vθ(x , t), with step size τ
x t+1 = x t + τvθ(x t , t)
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 5 / 22
1. Background Neural ODEs
Neural ODEs
Source: [CRB+18],Figure 1
I Rather than fixing the number of layers(time steps) in the ResNet beforehand
I Solve the ODE{x = vθ(x , t)
x(0) = x0
(2)
with an adaptive ODE solver, up to someend time T . Here x0 is the input to the“network”
I The function so defined is the solutionmap:
fθ(x0) := x(T ) = x0 +
∫ T
0vθ(x(s), s) ds
Benefits: memory savings; tradeoff between accuracy and speedChris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 6 / 22
1. Background Neural ODEs
Aside: Differentiating losses of neural ODEs
Suppose we have a loss function L(x(T )) which depends only onthe final state x(T ). How do we differentiate wrt θ?
I Naive approach: backpropagate through the computationgraph of the ODE solver
I Optimal control: use sensitivity analysis [PMB+62]. Thegradient dL
dθ is given by
dL(x(T ))
dθ= µ(0)
where µ, λ and x solve the following ODE (run backwards intime)
µ = −λT ddθvθ(x(t), t), µ(T ) = 0
λ = −λT∇xvθ(x(t), t), λ(T ) = dL(x(T ))dx(T )
x = −vθ(x(t), t), x(T ) = xT
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 7 / 22
2. FFJORD: Neural ODEs + Normalizing flows
Neural ODEs and Normalizing Flows
We can overcome the two difficulties of normalizing flows byexploiting properties of dynamical systems
1. If vθ is Lipschitz, then the solution map is invertible!(Picard-Lindelof)
f −1θ (x(T )) = x(T ) +
∫ 0
Tvθ(x(s), s) ds
I ie just need to solve the backwards dynamics{x = −vθ(x , t)
x(0) = xT(3)
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 8 / 22
2. FFJORD: Neural ODEs + Normalizing flows
Neural ODEs and Normalizing Flows
We can overcome the two difficulties of normalizing flows byexploiting properties of dynamical systems
2. Log-determinants (of particles solving the ODE) have abeautiful time derivative [Vil03, p. 114]
d
dslog det
[dx(s)
dx0
]= div (v) (x(s), s)
where div (·) is the divergence operator, div (f ) =∑
i∂fi∂xi
I ie
log det
[dfθ(x)
dx
]=
∫ T
0
div (v) (x(s), s)ds
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 9 / 22
2. FFJORD: Neural ODEs + Normalizing flows
FFJORD: density estimation
Putting these two observations together, we arrive at the FFJORDalgorithm (Free-form Jacobian Of Reversible Dynamics) [GCB+19]
Density estimationTo learn the distribution pθ, solve the following log-likelihoodoptimization problem
maxθ
∑i
log pN (zi (T )) +
∫ T
0div (vθ) (zi (s), s)ds
where zi (s) satisfies the ODE{zi (s) = vθ(zi (s), s)
zi (0) = xi
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 10 / 22
2. FFJORD: Neural ODEs + Normalizing flows
That’s nice, but. . .
Isn’t it really hard to compute the divergenceterm
div (vθ) (x , s) =∑j
∂v jθ(x , s)
∂xj
in high dimensions?
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 11 / 22
2. FFJORD: Neural ODEs + Normalizing flows
Trace estimates
I Divergence is the trace of the Jacobian∇xvθ(x , t)
I So we can use trace estimates toapproximate
div (vθ) (x , s) = Tr(∇xvθ(x , t))
= Eη[ηT∇xvθ(x , t)η
]where η ∼ N (0, 1)
I this can be computed quickly with reversemode automatic differentiation
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 12 / 22
2. FFJORD: Neural ODEs + Normalizing flows
FFJORD: generation
FFJORD [GCB+19] generation (density sampling) is simple
GenerationSample z ∼ N , and solve{
x(s) = −vθ(x(s), s)
x(0) = z
Ie, x = z +∫ 0T vθ(z(s), s) ds
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 13 / 22
3. Regularized neural ODEs
Need for regularity
target
x
sourceA generic solution
FFJORD is promising, but there are noconstraints placed on the paths the particlestake
I As long as source (data) distribution ismapped to target (normal) distribution,the log-likelihood is maximized
I ie solutions are not unique
I If the particle paths are “wobbly” theadaptive ODE solver has to take manytiny steps, with many functionevaluations. This is time consuming
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 14 / 22
3. Regularized neural ODEs
Solver is slowed by number of function evaluations
Function evaluations vsJacobian norm
I the number of function evaluations (timesteps) taken by the solver is related to thetotal derivative
dv(x(t), t)
dt= [∇v(x(t), t)] v(x(t), t)
+∂
∂tv(x(t), t)
I In other words, if we can control the sizeof v and ∇v , we can reduce the numberof function evaluations (time steps) takenby the ODE solver
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 15 / 22
3. Regularized neural ODEs
Proposal: regularize objective to decrease # of FEs
0
1
tp(
z(x,
T))
target
x
p(x)
sourceOT map
Add two terms to the objective to encourageregularity of trajectories:
I∫ T
0 ‖vθ(x(s), s)‖2 ds, the kinetic energy.This is closely related to the OptimalTransport cost between distributions(Benamou-Brenier)
I∫ T
0 ‖∇xvθ(x(s), s)‖2F ds, a Frobenius
norm penalty on the Jacobian
I Frobenius norms can be computed againwith trace estimate:
‖A‖2F = Tr(ATA) = Eη
[ηTATAη
]= E
[‖Aη‖2
]ie can recycle computations fromdivergence estimate
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 16 / 22
3. Regularized neural ODEs
Proposal: regularize objective to decrease # of FEs
0
1
tp(
z(x,
T))
target
x
p(x)
sourceOT map
Add two terms to the objective to encourageregularity of trajectories:
I∫ T
0 ‖vθ(x(s), s)‖2 ds, the kinetic energy.This is closely related to the OptimalTransport cost between distributions(Benamou-Brenier)
I∫ T
0 ‖∇xvθ(x(s), s)‖2F ds, a Frobenius
norm penalty on the Jacobian
I Frobenius norms can be computed againwith trace estimate:
‖A‖2F = Tr(ATA) = Eη
[ηTATAη
]= E
[‖Aη‖2
]ie can recycle computations fromdivergence estimate
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 16 / 22
4. Results
Test log-likelihood vs wall clock time
0 20 40 60Wall Clock Time (hours)
0.95
1.00
1.05
1.10
1.15
1.20
Bits
/dim
vanilla FFJORDFFJORD RNODE (ours)
MNIST
0 10 20 30 40 50Wall Clock Time (hours)
3.4
3.6
3.8
4.0
4.2
4.4
4.6
Bits
/dim
vanilla FFJORDFFJORD RNODE (ours)
CIFAR10
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 17 / 22
4. Results
Ablation study: regularity measures on MNIST
0 25 50 75 100Epoch
0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Mea
n Ja
cobi
an F
robe
nius
nor
m
no regularizationkinetic penalty onlyJacobian penalty onlyRNODE (both penalties)
Jacobian norm vs epoch
0 25 50 75 100Epoch
4
6
8
10
12
14
16
Mea
n ki
netic
ene
rgy
no regularizationkinetic penalty onlyJacobian penalty onlyRNODE (both penalties)
Kinetic energy vs epoch Function evals vs epoch
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 18 / 22
4. Results
Some numbers
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 19 / 22
4. Results
Some pictures: MNIST & CIFAR10
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 20 / 22
4. Results
Some bigger pictures: ImageNet64
Real ImageNet64 images Generated ImageNet64 images(FFJORD RNODE)
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 21 / 22
4. Results
Comments
I The Jacobian norm penalty can be viewed as a continuoustime analogue of layer-wise Lipschitz regularization inResNets. This helps ensure particle paths do not cross, andhelps keep networks numerically invertible
I Without regularization, it is very difficult to train neural ODEswith fixed step-size ODE solvers. Need adaptive ODE solvers
I However with regularization, neural ODEs can be trained withfixed step-size ODE solvers (eg a four stage Runge-Kuttascheme)
I On large images (> 64 pixel width) vanilla FFJORD will nottrain: adaptive solver’s time step underflows
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 22 / 22
5. References
References I
TQ Chen, J Behrmann, D Duvenaud and JH Jacobsen, Residual Flows forInvertible Generative Modeling, NeurIPS 2019: 9913–9923.
TQ Chen, Y Rubanova, J Bettencourt and D Duvenaud, Neural OrdinaryDifferntial Equations, NeurIPS 2018: 6572 – 6583.
W Grathwohl, RT Chen, J Bettencourt, I Sutskever and D Duvenaud,FFJORD: Free-Form Continuous Dynamics for Scalable ReversibleGenerative Models, ICLR 2019.
E Haber and L Ruthotto, Stable architectures for deep neural networks,Inverse Problems, 34 (2017), no. 1, 014004.
L Pontryagin, EF Mischenko, VG Boltyanskii and RV Gamkrelidze, Themathematical theory of optimal processes, 1962.
RJ Rezende and S Mohamed, Variational Inference with NormalizingFlows, ICML 2015: 1530–1538.
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 23 / 22
5. References
References II
EG Tabak and CV Turner, A family of nonparametric density estimationalgorithms, Communication on Pure and Applied Mathematics, 66 (2013),no. 2, 145–164.
C Villani, Topics in Optimal Transport, Graduate studies in mathematics.AMS 2003, ISBN 9780821833124.
Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 24 / 22