Inductive Bias and Optimizationin Deep Learning
Nati Srebro (TTIC)
Based on work with Behnam Neyshabur (TTICโGoogle), Suriya Gunasekar (TTICโMSR),Ryota Tomioka (TTICโMSR), Srinadh Bhojanapalli (TTICโGoogle),Blake Woodworth, Pedro Savarese, David McAllester (TTIC), Greg Ongie, Becca Willett (Chicago),Daniel Soudry, Elad Hoffer, Mor Shpigel, Itay Sofer (Technion), Ashia Wilson, Becca Roelofs, Mitchel Stern, Ben Recht (Berkeley),Russ Salakhutdinov (CMU), Jason Lee, Zhiyuan Li (Princeton), Yann LaCun (NYU/Facebook)
Feed Forward Neural Networksโข Fix architecture (connection graph ๐บ(๐, ๐ธ), transfer ๐)
โ๐บ ๐,๐ธ ,๐ = ๐๐ค ๐ฅ = ๐๐ข๐ก๐๐ข๐ก ๐๐ ๐๐๐ก ๐ค๐๐กโ ๐ค๐๐๐โ๐ก๐ ๐ค
โข Capacity / Generalization ability / Sample Complexity
โข เทฉ๐ถ( ๐ฌ ) (number of edges, i.e. number of weights)(with threshold ๐, or with RELU and finite precision; RELU with inf precision: เทฉฮ ๐ธ โ ๐๐๐๐กโ )
โข Expressive Power / Approximation
โข Any continuous function with huge network
โข Lots of interesting things naturally with small networks
โข Any time T computable function with network of size เทฉ๐ถ(๐ป)
โข Computation / Optimization
โข NP-hard to find weights even with 2 hidden units
โข Even if function exactly representable with single hidden layer with ฮ log๐ units, even with no noise, and even if we allow a much larger network when learning: no poly-time algorithm always works[Kearns Valiant 94; Klivans Sherstov 06; Daniely Linial Shalev-Shwartz โ14]
โข Magic property of reality that makes local search โworkโ
Different optimization algorithmโ Different bias in optimum reached
โ Different Inductive biasโ Different generalization properties
Need to understand optimization alg. not just as reaching some (global) optimum, but as reaching a specific optimum
All Functions
Cross-Entropy Training Loss
0/1 Training Error 0/1 Test Error
MN
IST
0 50 100 150 200 250 3000.015
0.02
0.025
0.03
0.035
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0 50 100 150 200 250 3000
0.5
1
1.5
2
2.5
0 50 100 150 200 250 3000.4
0.42
0.44
0.46
0.48
0.5Path-SGDSGD
CIF
AR
-10
0 50 100 150 200 250 3000.4
0.42
0.44
0.46
0.48
0.5
0 50 100 150 200 250 3000
0.05
0.1
0.15
0.2
0 50 100 150 200 250 3000
0.5
1
1.5
2
2.5
SVH
N
0 100 200 300 4000.12
0.13
0.14
0.15
0.16
0.17
0.18
0 100 200 300 4000
0.1
0.2
0.3
0.4
0 100 200 300 4000
0.5
1
1.5
2
2.5
0 100 200 300 4000.65
0.7
0.75
0 100 200 300 4000
0.2
0.4
0.6
0.8
0 100 200 300 4000
1
2
3
4
5
Epoch Epoch
CIF
AR
-10
0
Wit
h D
rop
ou
t
[Neyshabur Salakhudtinov S NIPSโ15]
SGD vs ADAM
Test
Err
or
(Pre
ple
xity
)
Trai
niE
rro
r (P
rep
lexi
ty)
Results on Penn Treebank using 3-layer LSTM
[Wilson Roelofs Stern S Recht, โThe Marginal Value of Adaptive Gradient Methods in Machine Learningโ, NIPSโ17]
The Deep Recurrent Residual Boosting MachineJoe Flow, DeepFace Labs
Section 1: IntroductionWe suggest a new amazing architecture and loss function that is great for learning. All you have to do to learn is fit the model on your training data
Section 2: Learning Contribution: our modelThe model class โ๐ค is amazing. Our learning method is:
๐๐ซ๐ ๐ฆ๐ข๐ง๐
๐
๐ฯ๐=๐๐ ๐๐๐๐(๐๐ ๐ ; ๐) (*)
Section 3: OptimizationThis is how we solve the optimization problem (*): [โฆ]
Section 4: ExperimentsIt works!
min๐โโ๐ร๐
๐๐๐ ๐๐๐ฃ๐๐ ๐ โ ๐ฆ 22 โก min
๐,๐โโ๐ร๐๐๐๐ ๐๐๐ฃ๐๐ ๐๐โค โ ๐ฆ
2
2
โข Underdetermined non-sensical problem, lots of useless global min
โข Since ๐, ๐ full dim, no constraint on ๐, all the same non-sense global min
2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2
2 2 1 4 52 4 1 4 2 3
1 3 1 1 4 34 2 2 5 3 1
๐ ๐ ๐ ร ๐โค=โ
What happens when we optimize by gradient descent on ๐ผ,๐ฝ ?
Unconstrained Matrix Completion
[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]
Gradient descent on ๐ ๐ผ, ๐ฝ gets to โgoodโ global minima
[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]
Gradient descent on ๐ ๐ผ, ๐ฝ generalizes better with smaller step size
Gradient descent on ๐ ๐ผ, ๐ฝ gets to โgoodโ global minima
[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]
[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]
[Yuanzhi Li, Hongyang Zhang, Tengyu Ma 2018][Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo 2019]
Grad Descent on ๐, ๐ with inf. small stepsize and initializationโmin nuclear norm solution
argmin ๐ โ ๐ . ๐ก. ๐๐๐ ๐ = ๐ฆ(exact and rigorous only under additional conditions!)
โ good generalization if Y (aprox) low rank
Predictor Space Parameter Space
๐
Optimization Geometry and hence Inductive Bias effected by:
โข Geometry of local search in parameter space
โข Choice of parameterization
๐, ๐
โข Matrix completion (also: reconstruction from linear measurements)
โข ๐ = ๐๐ is over-parametrization of all matrices ๐ โ โ๐ร๐
โข GD on ๐, ๐โ implicitly minimize ๐พ โ
โข Linear Convolutional Network:
โข Complex over-parametrization of linear predictors ๐ฝ
โข GD on weight โ implicitly minimize ๐ซ๐ญ๐ป ๐ท ๐ for ๐ =2
๐๐๐๐กโ.
(sparsity in frequency domain) [Gunasekar Lee Soudry S 2018]
min ๐ท ๐ ๐ . ๐ก. โ๐๐ฆ๐ ๐ฝ, ๐ฅ๐ โฅ 1
min ๐ซ๐ญ๐ป(๐ท) เต๐ ๐ณ๐ . ๐ก. โ๐๐ฆ๐ ๐ฝ, ๐ฅ๐ โฅ 1
Depth 2
Depth 5
โข Matrix completion (also: reconstruction from linear measurements)
โข ๐ = ๐๐ is over-parametrization of all matrices ๐ โ โ๐ร๐
โข GD on ๐, ๐โ implicitly minimize ๐พ โ
โข Linear Convolutional Network:
โข Complex over-parametrization of linear predictors ๐ฝ
โข GD on weight โ implicitly minimize ๐ซ๐ญ๐ป ๐ท ๐ for ๐ =2
๐๐๐๐กโ.
(sparsity in frequency domain) [Gunasekar Lee Soudry S 2018]
โข Infinite Width ReLU Net:
โข Parametrization of essentially all functions โ:โ๐ โ โ
โข Weight decay โ for ๐ = 1, implicitly minimizemax โซ ๐โฒโฒ ๐ ๐ , โโฒ โโ + โโฒ +โ
[Savarese Evron Soudry S 2019]
โข For ๐ > 1, implicitly minimize โซ ๐๐๐ +๐๐น๐๐ ๐๐ ๐
(roughly speaking; need to define more carefully to handle non-smoothness, extra correction term for linear part) [Ongie Willett Soudry S 2019]
All Functions Parameter Space
โ
Optimization Geometry and hence Inductive Bias effected by:
โข Geometry of local search in parameter space
โข Choice of parameterization
๐ค
Doesnโt it all boil down to the NTK?
Is it all just a Kernel?
๐ ๐, ๐ โ ๐(๐ค 0 , ๐ฅ) + โจ๐,๐๐ ๐ ๐ โฉ
Corresponding to a kernelized linear model with kernel:๐พ๐ค ๐ฅ, ๐ฅโฒ = โจโ๐ค๐ ๐ค, ๐ฅ , โ๐ค๐ ๐ค, ๐ฅโฒ โฉ
Kernel regime: 1st order approx about ๐ค 0 remains valid throughout optimization
๐พ๐ค ๐ก โ ๐พ๐ค 0 = ๐พ0
โ GD on squared loss converges to ๐๐ซ๐ ๐ฆ๐ข๐ง ๐ ๐ฒ๐๐. ๐. ๐ ๐๐ = ๐๐
๐๐(๐ฅ) = โ๐ค๐ ๐ค, ๐ฅfocus on โunbiased
initializationโ: ๐(๐ค 0 , ๐ฅ) = 0
Kernel Regime and Scale of Initโข For ๐ท-homogenous model, ๐ ๐๐ค, ๐ฅ = ๐๐ท๐ ๐ค, ๐ฅ , consider gradient flow with:
แถ๐ค๐ผ = โโ๐ฟ๐ ๐ค and ๐ค๐ผ 0 = ๐ผ๐ค0 with unbiased ๐ ๐ค0, ๐ฅ = 0
We are interested in ๐ค๐ผ โ = lim๐กโโ
๐ค๐ผ ๐ก
โข For squared loss, under some conditions [Chizat and Bach 18]:
lim๐ผโโ
sup๐ก
๐ค๐ผ1
๐ผ๐ทโ1๐ก โ ๐ค๐พ ๐ก = 0
and so ๐ ๐๐ถ โ ,๐๐ถโโ
๐๐ฒ(๐) where ๐๐ฒ = ๐๐ซ๐ ๐ฆ๐ข๐ง ๐ ๐ฒ๐๐ฌ. ๐ญ. ๐ ๐๐ = ๐๐
โข But: when ๐ผ โ 0, we got interesting, non-RKHS inductive bias (e.g. nuclear norm, sparsity)
Gradient flow of linear least squares w.r.t tangent kernel ๐พ0 at initialization
Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:
๐ ๐ค, ๐ฅ = ฯ๐ ๐ค+ ๐ 2 โ ๐คโ ๐ 2 ๐ฅ[๐] = โจ๐ฝ(๐ค), ๐ฅโฉ with ๐ฝ(๐ค) = ๐ค+2 โ ๐คโ
2
And unbiased initialization ๐ค๐ผ 0 = ๐ผ๐ (so that ๐ฝ ๐ค๐ผ 0 = 0).
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4
๐ (๐, ๐), ๐ฅ = ๐๐โค .๐๐๐๐(๐ฅ) 0
0 โ๐๐๐๐(๐ฅ)
๐ 0 = ๐ 0 = ๐ผ๐ผ
Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:
๐ ๐ค, ๐ฅ = ฯ๐ ๐ค+ ๐ 2 โ ๐คโ ๐ 2 ๐ฅ[๐] = โจ๐ฝ(๐ค), ๐ฅโฉ with ๐ฝ(๐ค) = ๐ค+2 โ ๐คโ
2
And unbiased initialization ๐ค๐ผ 0 = ๐ผ๐ (so that ๐ฝ ๐ค๐ผ 0 = 0).
Whatโs the implicit bias of grad flow w.r.t square loss ๐ฟ๐ ๐ค = ฯ๐ ๐ ๐ค, ๐ฅ๐ โ ๐ฆ๐2?
๐ฝ๐ผ โ = lim๐กโโ
๐ฝ(๐ค๐ผ ๐ก )
In Kernel Regime ๐ผ โ โ: ๐พ0 ๐ฅ, ๐ฅโฒ = 4โจ๐ฅ, ๐ฅโฒโฉ and so
๐ฝ๐ผ โ๐ผโโ
แ๐ฝ๐ฟ2 = arg min๐๐ฝ=๐ฆ
๐ฝ ๐
In Rich Regime ๐ผ โ 0: special case of MF with commutative measurements
๐ฝ๐ผ โ๐ผโ0
แ๐ฝ๐ฟ1 = arg min๐๐ฝ=๐ฆ
๐ฝ ๐
For any ๐ผ:๐ฝ๐ผ โ = ? ? ?
๐ฝ(๐ก) = ๐ค+ ๐ก 2 โ๐คโ ๐ก 2 ๐ฟ = ๐๐ฝ โ ๐ฆ 22
แถ๐ค+ ๐ก = โโ๐ฟ ๐ก = โ2๐โค๐ ๐ก โ 2๐ค+ ๐ก
แถ๐คโ ๐ก = โโ๐ฟ ๐ก = +2๐โค๐ ๐ก โ 2๐คโ ๐ก
๐ค+ ๐ก = ๐ค+ 0 โ exp โ2๐โคเถฑ0
๐ก
๐ ๐ ๐๐
๐คโ ๐ก = ๐คโ 0 โ exp +2๐โคเถฑ0
๐ก
๐ ๐ ๐๐
๐ฝ ๐ก = ๐ผ2 ๐โ4๐โค โซ0
๐ก๐ ๐ ๐๐ โ ๐4๐
โค โซ0๐ก๐ ๐ ๐๐
๐ฝ โ = ๐ผ2 ๐โ๐๐๐ โ ๐๐
โค๐ = 2๐ผ2 sinh๐โค๐
๐ ๐ก = ๐๐ฝ ๐ก โ ๐ฆ
๐๐ฝ โ = ๐ฆ
๐ = 4โซ0โ๐ ๐ ๐๐ โ โ๐
min๐ ๐ฝ ๐ . ๐ก. ๐๐ฝ = ๐ฆ
โ๐ ๐ฝโ = ๐โคฮฝ
๐๐ฝโ = ๐ฆ
๐ฝ โ = ๐ผ2 ๐โ๐๐๐ โ ๐๐
โค๐ = 2๐ผ2 sinh๐โค๐
๐๐ฝ โ = ๐ฆ
min๐ ๐ฝ ๐ . ๐ก. ๐๐ฝ = ๐ฆ
โ๐ ๐ฝโ = ๐โคฮฝ
โ๐ ๐ฝ = sinhโ1๐ฝ
2๐ผ2
๐๐ฝโ = ๐ฆ
sinhโ1๐ฝ โ
2๐ผ2= ๐โค๐
๐ ๐ฝ =
๐
โซ sinhโ1๐ฝ ๐
2๐ผ2= ๐ผ2
๐
๐ฝ ๐
๐ผ2sinhโ1
๐ฝ ๐
2๐ผ2โ 4 +
๐ฝ ๐
๐ผ2
2
๐๐ฝ โ = ๐ฆ
Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:
๐ ๐ค, ๐ฅ = ฯ๐ ๐ค+ ๐ 2 โ ๐คโ ๐ 2 ๐ฅ[๐] = โจ๐ฝ(๐ค), ๐ฅโฉ with ๐ฝ(๐ค) = ๐ค+2 โ ๐คโ
2
And unbiased initialization ๐ค๐ผ 0 = ๐ผ๐ (so that ๐ฝ ๐ค๐ผ 0 = 0).
Whatโs the implicit bias of grad flow w.r.t square loss ๐ฟ๐ ๐ค = ฯ๐ ๐ ๐ค, ๐ฅ๐ โ ๐ฆ๐2?
๐ฝ๐ผ โ = lim๐กโโ
๐ฝ(๐ค๐ผ ๐ก )
In Kernel Regime ๐ผ โ โ: ๐พ0 ๐ฅ, ๐ฅโฒ = 4โจ๐ฅ, ๐ฅโฒโฉ and so
๐ฝ๐ผ โ๐ผโโ
แ๐ฝ๐ฟ2 = arg min๐๐ฝ=๐ฆ
๐ฝ ๐
In Rich Regime ๐ผ โ 0: special case of MF with commutative measurements
๐ฝ๐ผ โ๐ผโ0
แ๐ฝ๐ฟ1 = arg min๐๐ฝ=๐ฆ
๐ฝ ๐
For any ๐ถ: ๐ฝ๐ผ โ = ๐๐๐๐๐๐๐ฟ๐ท=๐
๐ธ๐ถ ๐ท
where ๐ธ๐ถ ๐ท = ฯ๐๐๐ท[๐]
๐ถ๐and ๐ ๐ = ๐ โ ๐ + ๐๐ + ๐๐ฌ๐ข๐ง๐กโ๐
๐
๐
๐ฝ๐ผ โ = arg min๐๐ฝ=๐ฆ
๐๐ผ ๐ฝ
where ๐๐ผ ๐ฝ = ฯ๐ ๐๐ฝ[๐]
๐ผ2and ๐ ๐ = 2 โ 4 + ๐2 + ๐ sinhโ1
๐
2
Induced dynamics: แถ๐ฝ๐ผ = โ ๐ฝ๐ผ2 + 4๐ผ4 โโ๐ฟ๐ ๐ฝ๐ผ
Sparse Learning๐ฆ๐ = ๐ฝโ, ๐ฅ๐ + ๐(0,0.01)
๐ = 1000, ๐ฝโ 0 = 5, ๐ = 100
Sparse Learning๐ฆ๐ = ๐ฝโ, ๐ฅ๐ + ๐(0,0.01)๐ = 1000, ๐ฝโ 0 = ๐
How small does ๐ผ need to be to get ๐ฟ ๐ฝ๐ผ โ < 0.025
๐
More controlling parameters
โข Depth
โข Width
โข Optimization accuracy
โข Stepsize, batchsize, ??
Depth๐ฝ(๐ก) = ๐ค+ ๐ก ๐ท โ๐คโ ๐ก ๐ท
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6 ๐ฅ7 ๐ฅ8
Depth๐ฝ(๐ก) = ๐ค+ ๐ก ๐ท โ๐คโ ๐ก ๐ท ๐ฝ โ = arg min๐๐ท เต๐ฝ
๐ผ๐ท๐ . ๐ก. ๐๐ฝ = ๐ฆ
๐๐ท(๐ง)
โ๐ท(๐ง) = ๐ผ๐ท 1 + ๐ผ๐ทโ2๐ท ๐ท โ 2 ๐งโ1๐ทโ2 โ 1 โ ๐ผ๐ทโ2๐ท ๐ท โ 2 ๐ง
โ1๐ทโ2
๐๐ท(๐ฝ) =
๐
๐๐ท๐ฝ ๐
๐ผ๐ท
๐๐ท = โซ โ๐ทโ1
Depth๐ฝ(๐ก) = ๐ค+ ๐ก ๐ท โ๐คโ ๐ก ๐ท ๐ฝ โ = arg min๐๐ท เต๐ฝ
๐ผ๐ท๐ . ๐ก. ๐๐ฝ = ๐ฆ
For all depth ๐ท โฅ 2, ๐ฝ โ๐ผโ0
arg min๐๐ฝ=๐ฆ
๐ฝ ๐
โข Contrast with explicit reg: For ๐ ๐ผ ๐ฝ = min๐ฝ=๐ค+
๐ทโ๐คโ๐ท๐ค โ ๐ผ๐ 2
2 , ๐ ๐ผ ๐ฝ๐ผโ0
๐ฝ เต๐ ๐ซ
also observed by [Arora Cohen Hu Luo 2019]
โข Also with logistic loss, ๐ฝ โ๐ผโ0
โ ๐๐๐๐ ๐๐ ๐ฝ เต๐ ๐ซ[Gunasekar Lee Soudry Srebro 2018]
โข With sq loss, always โ โ โ๐, but for deep ๐ท, we get there quicker
๐๐ท(๐ง)
Depth
๐ฝ(๐ก) = ๐ค+ ๐ก ๐ท โ๐คโ ๐ก ๐ท ๐ฝ โ = arg min๐๐ท เต๐ฝ๐ผ๐ท
๐ . ๐ก. ๐๐ฝ = ๐ฆ
๐ 1,0, โฆ , 0
๐1
๐,1
๐, โฆ ,
1
๐
Sparse Learning with Depth๐ฆ๐ = ๐ฝโ, ๐ฅ๐ + ๐(0,0.01)๐ = 1000, ๐ฝโ 0 = ๐
How small does ๐ผ need to be to get ๐ฟ ๐ฝ๐ผ โ < 0.025
๐
๐ผ๐ท
Deep Learning
โข Expressive Power
โข We are searching over the space of all functionsโฆ
โฆ but with what inductive bias?
โข How does this bias look in function space?
โข Is it reasonable/sensible?
โข Capacity / Generalization ability / Sample Complexity
โข Whatโs the true complexity measure (inductive bias)?
โข How does it control generalization?
โข Computation / Optimization
โข How and where does optimization bias us? Under what conditions?
โข Magic property of reality under which deep learning โworksโ
Top Related