Download - Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Transcript
Page 1: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Inductive Bias and Optimizationin Deep Learning

Nati Srebro (TTIC)

Based on work with Behnam Neyshabur (TTICโ†’Google), Suriya Gunasekar (TTICโ†’MSR),Ryota Tomioka (TTICโ†’MSR), Srinadh Bhojanapalli (TTICโ†’Google),Blake Woodworth, Pedro Savarese, David McAllester (TTIC), Greg Ongie, Becca Willett (Chicago),Daniel Soudry, Elad Hoffer, Mor Shpigel, Itay Sofer (Technion), Ashia Wilson, Becca Roelofs, Mitchel Stern, Ben Recht (Berkeley),Russ Salakhutdinov (CMU), Jason Lee, Zhiyuan Li (Princeton), Yann LaCun (NYU/Facebook)

Page 2: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Feed Forward Neural Networksโ€ข Fix architecture (connection graph ๐บ(๐‘‰, ๐ธ), transfer ๐œŽ)

โ„‹๐บ ๐‘‰,๐ธ ,๐œŽ = ๐‘“๐‘ค ๐‘ฅ = ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘›๐‘’๐‘ก ๐‘ค๐‘–๐‘กโ„Ž ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก๐‘  ๐‘ค

โ€ข Capacity / Generalization ability / Sample Complexity

โ€ข เทฉ๐‘ถ( ๐‘ฌ ) (number of edges, i.e. number of weights)(with threshold ๐œŽ, or with RELU and finite precision; RELU with inf precision: เทฉฮ˜ ๐ธ โ‹… ๐‘‘๐‘’๐‘๐‘กโ„Ž )

โ€ข Expressive Power / Approximation

โ€ข Any continuous function with huge network

โ€ข Lots of interesting things naturally with small networks

โ€ข Any time T computable function with network of size เทฉ๐‘ถ(๐‘ป)

โ€ข Computation / Optimization

โ€ข NP-hard to find weights even with 2 hidden units

โ€ข Even if function exactly representable with single hidden layer with ฮ˜ log๐‘‘ units, even with no noise, and even if we allow a much larger network when learning: no poly-time algorithm always works[Kearns Valiant 94; Klivans Sherstov 06; Daniely Linial Shalev-Shwartz โ€™14]

โ€ข Magic property of reality that makes local search โ€œworkโ€

Page 3: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation
Page 4: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Different optimization algorithmโž” Different bias in optimum reached

โž” Different Inductive biasโž” Different generalization properties

Need to understand optimization alg. not just as reaching some (global) optimum, but as reaching a specific optimum

All Functions

Page 5: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Cross-Entropy Training Loss

0/1 Training Error 0/1 Test Error

MN

IST

0 50 100 150 200 250 3000.015

0.02

0.025

0.03

0.035

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

0 50 100 150 200 250 3000.4

0.42

0.44

0.46

0.48

0.5Path-SGDSGD

CIF

AR

-10

0 50 100 150 200 250 3000.4

0.42

0.44

0.46

0.48

0.5

0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

SVH

N

0 100 200 300 4000.12

0.13

0.14

0.15

0.16

0.17

0.18

0 100 200 300 4000

0.1

0.2

0.3

0.4

0 100 200 300 4000

0.5

1

1.5

2

2.5

0 100 200 300 4000.65

0.7

0.75

0 100 200 300 4000

0.2

0.4

0.6

0.8

0 100 200 300 4000

1

2

3

4

5

Epoch Epoch

CIF

AR

-10

0

Wit

h D

rop

ou

t

[Neyshabur Salakhudtinov S NIPSโ€™15]

Page 6: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

SGD vs ADAM

Test

Err

or

(Pre

ple

xity

)

Trai

niE

rro

r (P

rep

lexi

ty)

Results on Penn Treebank using 3-layer LSTM

[Wilson Roelofs Stern S Recht, โ€œThe Marginal Value of Adaptive Gradient Methods in Machine Learningโ€, NIPSโ€™17]

Page 7: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

The Deep Recurrent Residual Boosting MachineJoe Flow, DeepFace Labs

Section 1: IntroductionWe suggest a new amazing architecture and loss function that is great for learning. All you have to do to learn is fit the model on your training data

Section 2: Learning Contribution: our modelThe model class โ„Ž๐‘ค is amazing. Our learning method is:

๐š๐ซ๐ ๐ฆ๐ข๐ง๐’˜

๐Ÿ

๐’Žฯƒ๐’Š=๐Ÿ๐’Ž ๐’๐’๐’”๐’”(๐’‰๐’˜ ๐’™ ; ๐’š) (*)

Section 3: OptimizationThis is how we solve the optimization problem (*): [โ€ฆ]

Section 4: ExperimentsIt works!

Page 8: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

min๐‘‹โˆˆโ„๐‘›ร—๐‘›

๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘’๐‘‘ ๐‘‹ โˆ’ ๐‘ฆ 22 โ‰ก min

๐‘ˆ,๐‘‰โˆˆโ„๐‘›ร—๐‘›๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘’๐‘‘ ๐‘ˆ๐‘‰โŠค โˆ’ ๐‘ฆ

2

2

โ€ข Underdetermined non-sensical problem, lots of useless global min

โ€ข Since ๐‘ˆ, ๐‘‰ full dim, no constraint on ๐‘‹, all the same non-sense global min

2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2

2 2 1 4 52 4 1 4 2 3

1 3 1 1 4 34 2 2 5 3 1

๐’š ๐‘‹ ๐‘ˆ ร— ๐‘‰โŠค=โ‰ˆ

What happens when we optimize by gradient descent on ๐‘ผ,๐‘ฝ ?

Unconstrained Matrix Completion

[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]

Page 9: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Gradient descent on ๐’‡ ๐‘ผ, ๐‘ฝ gets to โ€œgoodโ€ global minima

[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]

Page 10: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Gradient descent on ๐’‡ ๐‘ผ, ๐‘ฝ generalizes better with smaller step size

Gradient descent on ๐’‡ ๐‘ผ, ๐‘ฝ gets to โ€œgoodโ€ global minima

[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]

Page 11: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

[Gunasekar Woodworth Bhojanapalli Neyshabur S 2017]

[Yuanzhi Li, Hongyang Zhang, Tengyu Ma 2018][Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo 2019]

Grad Descent on ๐‘ˆ, ๐‘‰ with inf. small stepsize and initializationโ†’min nuclear norm solution

argmin ๐‘‹ โˆ— ๐‘ . ๐‘ก. ๐‘œ๐‘๐‘  ๐‘‹ = ๐‘ฆ(exact and rigorous only under additional conditions!)

โ†’ good generalization if Y (aprox) low rank

Page 12: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Predictor Space Parameter Space

๐‘Š

Optimization Geometry and hence Inductive Bias effected by:

โ€ข Geometry of local search in parameter space

โ€ข Choice of parameterization

๐‘ˆ, ๐‘‰

Page 13: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

โ€ข Matrix completion (also: reconstruction from linear measurements)

โ€ข ๐‘Š = ๐‘ˆ๐‘‰ is over-parametrization of all matrices ๐‘Š โˆˆ โ„๐‘›ร—๐‘š

โ€ข GD on ๐‘ˆ, ๐‘‰โž” implicitly minimize ๐‘พ โˆ—

โ€ข Linear Convolutional Network:

โ€ข Complex over-parametrization of linear predictors ๐›ฝ

โ€ข GD on weight โž” implicitly minimize ๐‘ซ๐‘ญ๐‘ป ๐œท ๐’‘ for ๐‘ =2

๐‘‘๐‘’๐‘๐‘กโ„Ž.

(sparsity in frequency domain) [Gunasekar Lee Soudry S 2018]

min ๐œท ๐Ÿ ๐‘ . ๐‘ก. โˆ€๐‘–๐‘ฆ๐‘– ๐›ฝ, ๐‘ฅ๐‘– โ‰ฅ 1

min ๐‘ซ๐‘ญ๐‘ป(๐œท) เต—๐Ÿ ๐‘ณ๐‘ . ๐‘ก. โˆ€๐‘–๐‘ฆ๐‘– ๐›ฝ, ๐‘ฅ๐‘– โ‰ฅ 1

Depth 2

Depth 5

Page 14: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

โ€ข Matrix completion (also: reconstruction from linear measurements)

โ€ข ๐‘Š = ๐‘ˆ๐‘‰ is over-parametrization of all matrices ๐‘Š โˆˆ โ„๐‘›ร—๐‘š

โ€ข GD on ๐‘ˆ, ๐‘‰โž” implicitly minimize ๐‘พ โˆ—

โ€ข Linear Convolutional Network:

โ€ข Complex over-parametrization of linear predictors ๐›ฝ

โ€ข GD on weight โž” implicitly minimize ๐‘ซ๐‘ญ๐‘ป ๐œท ๐’‘ for ๐‘ =2

๐‘‘๐‘’๐‘๐‘กโ„Ž.

(sparsity in frequency domain) [Gunasekar Lee Soudry S 2018]

โ€ข Infinite Width ReLU Net:

โ€ข Parametrization of essentially all functions โ„Ž:โ„๐‘‘ โ†’ โ„

โ€ข Weight decay โž” for ๐‘‘ = 1, implicitly minimizemax โˆซ ๐’‰โ€ฒโ€ฒ ๐’…๐’™ , โ„Žโ€ฒ โˆ’โˆž + โ„Žโ€ฒ +โˆž

[Savarese Evron Soudry S 2019]

โ€ข For ๐‘‘ > 1, implicitly minimize โˆซ ๐๐’ƒ๐’…+๐Ÿ๐‘น๐’‚๐’…๐’๐’ ๐’‰

(roughly speaking; need to define more carefully to handle non-smoothness, extra correction term for linear part) [Ongie Willett Soudry S 2019]

Page 15: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

All Functions Parameter Space

โ„Ž

Optimization Geometry and hence Inductive Bias effected by:

โ€ข Geometry of local search in parameter space

โ€ข Choice of parameterization

๐‘ค

Page 16: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Doesnโ€™t it all boil down to the NTK?

Page 17: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Is it all just a Kernel?

๐’‡ ๐’˜, ๐’™ โ‰ˆ ๐‘“(๐‘ค 0 , ๐‘ฅ) + โŸจ๐’˜,๐“๐’˜ ๐ŸŽ ๐’™ โŸฉ

Corresponding to a kernelized linear model with kernel:๐พ๐‘ค ๐‘ฅ, ๐‘ฅโ€ฒ = โŸจโˆ‡๐‘ค๐‘“ ๐‘ค, ๐‘ฅ , โˆ‡๐‘ค๐‘“ ๐‘ค, ๐‘ฅโ€ฒ โŸฉ

Kernel regime: 1st order approx about ๐‘ค 0 remains valid throughout optimization

๐พ๐‘ค ๐‘ก โ‰ˆ ๐พ๐‘ค 0 = ๐พ0

โž” GD on squared loss converges to ๐š๐ซ๐ ๐ฆ๐ข๐ง ๐’‰ ๐‘ฒ๐ŸŽ๐’”. ๐’•. ๐’‰ ๐’™๐’Š = ๐’š๐’Š

๐“๐’˜(๐‘ฅ) = โˆ‡๐‘ค๐‘“ ๐‘ค, ๐‘ฅfocus on โ€œunbiased

initializationโ€: ๐‘“(๐‘ค 0 , ๐‘ฅ) = 0

Page 18: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Kernel Regime and Scale of Initโ€ข For ๐ท-homogenous model, ๐‘“ ๐‘๐‘ค, ๐‘ฅ = ๐‘๐ท๐‘“ ๐‘ค, ๐‘ฅ , consider gradient flow with:

แˆถ๐‘ค๐›ผ = โˆ’โˆ‡๐ฟ๐‘† ๐‘ค and ๐‘ค๐›ผ 0 = ๐›ผ๐‘ค0 with unbiased ๐‘“ ๐‘ค0, ๐‘ฅ = 0

We are interested in ๐‘ค๐›ผ โˆž = lim๐‘กโ†’โˆž

๐‘ค๐›ผ ๐‘ก

โ€ข For squared loss, under some conditions [Chizat and Bach 18]:

lim๐›ผโ†’โˆž

sup๐‘ก

๐‘ค๐›ผ1

๐›ผ๐ทโˆ’1๐‘ก โˆ’ ๐‘ค๐พ ๐‘ก = 0

and so ๐’‡ ๐’˜๐œถ โˆž ,๐’™๐œถโ†’โˆž

๐’‰๐‘ฒ(๐’™) where ๐’‰๐‘ฒ = ๐š๐ซ๐ ๐ฆ๐ข๐ง ๐’‰ ๐‘ฒ๐ŸŽ๐ฌ. ๐ญ. ๐’‰ ๐’™๐’Š = ๐’š๐’Š

โ€ข But: when ๐›ผ โ†’ 0, we got interesting, non-RKHS inductive bias (e.g. nuclear norm, sparsity)

Gradient flow of linear least squares w.r.t tangent kernel ๐พ0 at initialization

Page 19: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:

๐‘“ ๐‘ค, ๐‘ฅ = ฯƒ๐‘— ๐‘ค+ ๐‘— 2 โˆ’ ๐‘คโˆ’ ๐‘— 2 ๐‘ฅ[๐‘—] = โŸจ๐›ฝ(๐‘ค), ๐‘ฅโŸฉ with ๐›ฝ(๐‘ค) = ๐‘ค+2 โˆ’ ๐‘คโˆ’

2

And unbiased initialization ๐‘ค๐›ผ 0 = ๐›ผ๐Ÿ (so that ๐›ฝ ๐‘ค๐›ผ 0 = 0).

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4

๐‘“ (๐‘ˆ, ๐‘‰), ๐‘ฅ = ๐‘ˆ๐‘‰โŠค .๐‘‘๐‘–๐‘Ž๐‘”(๐‘ฅ) 0

0 โˆ’๐‘‘๐‘–๐‘Ž๐‘”(๐‘ฅ)

๐‘ˆ 0 = ๐‘‰ 0 = ๐›ผ๐ผ

Page 20: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:

๐‘“ ๐‘ค, ๐‘ฅ = ฯƒ๐‘— ๐‘ค+ ๐‘— 2 โˆ’ ๐‘คโˆ’ ๐‘— 2 ๐‘ฅ[๐‘—] = โŸจ๐›ฝ(๐‘ค), ๐‘ฅโŸฉ with ๐›ฝ(๐‘ค) = ๐‘ค+2 โˆ’ ๐‘คโˆ’

2

And unbiased initialization ๐‘ค๐›ผ 0 = ๐›ผ๐Ÿ (so that ๐›ฝ ๐‘ค๐›ผ 0 = 0).

Whatโ€™s the implicit bias of grad flow w.r.t square loss ๐ฟ๐‘  ๐‘ค = ฯƒ๐‘– ๐‘“ ๐‘ค, ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–2?

๐›ฝ๐›ผ โˆž = lim๐‘กโ†’โˆž

๐›ฝ(๐‘ค๐›ผ ๐‘ก )

In Kernel Regime ๐›ผ โ†’ โˆž: ๐พ0 ๐‘ฅ, ๐‘ฅโ€ฒ = 4โŸจ๐‘ฅ, ๐‘ฅโ€ฒโŸฉ and so

๐›ฝ๐›ผ โˆž๐›ผโ†’โˆž

แˆ˜๐›ฝ๐ฟ2 = arg min๐‘‹๐›ฝ=๐‘ฆ

๐›ฝ ๐Ÿ

In Rich Regime ๐›ผ โ†’ 0: special case of MF with commutative measurements

๐›ฝ๐›ผ โˆž๐›ผโ†’0

แˆ˜๐›ฝ๐ฟ1 = arg min๐‘‹๐›ฝ=๐‘ฆ

๐›ฝ ๐Ÿ

For any ๐›ผ:๐›ฝ๐›ผ โˆž = ? ? ?

Page 21: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

๐›ฝ(๐‘ก) = ๐‘ค+ ๐‘ก 2 โˆ’๐‘คโˆ’ ๐‘ก 2 ๐ฟ = ๐‘‹๐›ฝ โˆ’ ๐‘ฆ 22

แˆถ๐‘ค+ ๐‘ก = โˆ’โˆ‡๐ฟ ๐‘ก = โˆ’2๐‘‹โŠค๐‘Ÿ ๐‘ก โˆ˜ 2๐‘ค+ ๐‘ก

แˆถ๐‘คโˆ’ ๐‘ก = โˆ’โˆ‡๐ฟ ๐‘ก = +2๐‘‹โŠค๐‘Ÿ ๐‘ก โˆ˜ 2๐‘คโˆ’ ๐‘ก

๐‘ค+ ๐‘ก = ๐‘ค+ 0 โˆ˜ exp โˆ’2๐‘‹โŠคเถฑ0

๐‘ก

๐‘Ÿ ๐œ ๐‘‘๐œ

๐‘คโˆ’ ๐‘ก = ๐‘คโˆ’ 0 โˆ˜ exp +2๐‘‹โŠคเถฑ0

๐‘ก

๐‘Ÿ ๐œ ๐‘‘๐œ

๐›ฝ ๐‘ก = ๐›ผ2 ๐‘’โˆ’4๐‘‹โŠค โˆซ0

๐‘ก๐‘Ÿ ๐œ ๐‘‘๐œ โˆ’ ๐‘’4๐‘‹

โŠค โˆซ0๐‘ก๐‘Ÿ ๐œ ๐‘‘๐œ

๐›ฝ โˆž = ๐›ผ2 ๐‘’โˆ’๐‘‹๐‘‡๐‘  โˆ’ ๐‘’๐‘‹

โŠค๐‘  = 2๐›ผ2 sinh๐‘‹โŠค๐‘ 

๐‘Ÿ ๐‘ก = ๐‘‹๐›ฝ ๐‘ก โˆ’ ๐‘ฆ

๐‘‹๐›ฝ โˆž = ๐‘ฆ

๐‘  = 4โˆซ0โˆž๐‘Ÿ ๐œ ๐‘‘๐œ โˆˆ โ„๐‘š

Page 22: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

min๐‘„ ๐›ฝ ๐‘ . ๐‘ก. ๐‘‹๐›ฝ = ๐‘ฆ

โˆ‡๐‘„ ๐›ฝโˆ— = ๐‘‹โŠคฮฝ

๐‘‹๐›ฝโˆ— = ๐‘ฆ

๐›ฝ โˆž = ๐›ผ2 ๐‘’โˆ’๐‘‹๐‘‡๐‘  โˆ’ ๐‘’๐‘‹

โŠค๐‘  = 2๐›ผ2 sinh๐‘‹โŠค๐‘ 

๐‘‹๐›ฝ โˆž = ๐‘ฆ

Page 23: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

min๐‘„ ๐›ฝ ๐‘ . ๐‘ก. ๐‘‹๐›ฝ = ๐‘ฆ

โˆ‡๐‘„ ๐›ฝโˆ— = ๐‘‹โŠคฮฝ

โˆ‡๐‘„ ๐›ฝ = sinhโˆ’1๐›ฝ

2๐›ผ2

๐‘‹๐›ฝโˆ— = ๐‘ฆ

sinhโˆ’1๐›ฝ โˆž

2๐›ผ2= ๐‘‹โŠค๐‘ 

๐‘„ ๐›ฝ =

๐‘–

โˆซ sinhโˆ’1๐›ฝ ๐‘–

2๐›ผ2= ๐›ผ2

๐‘–

๐›ฝ ๐‘–

๐›ผ2sinhโˆ’1

๐›ฝ ๐‘–

2๐›ผ2โˆ’ 4 +

๐›ฝ ๐‘–

๐›ผ2

2

๐‘‹๐›ฝ โˆž = ๐‘ฆ

Page 24: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Scale of Init: Kernel vs RichConsider linear regression with squared parametrization:

๐‘“ ๐‘ค, ๐‘ฅ = ฯƒ๐‘— ๐‘ค+ ๐‘— 2 โˆ’ ๐‘คโˆ’ ๐‘— 2 ๐‘ฅ[๐‘—] = โŸจ๐›ฝ(๐‘ค), ๐‘ฅโŸฉ with ๐›ฝ(๐‘ค) = ๐‘ค+2 โˆ’ ๐‘คโˆ’

2

And unbiased initialization ๐‘ค๐›ผ 0 = ๐›ผ๐Ÿ (so that ๐›ฝ ๐‘ค๐›ผ 0 = 0).

Whatโ€™s the implicit bias of grad flow w.r.t square loss ๐ฟ๐‘  ๐‘ค = ฯƒ๐‘– ๐‘“ ๐‘ค, ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–2?

๐›ฝ๐›ผ โˆž = lim๐‘กโ†’โˆž

๐›ฝ(๐‘ค๐›ผ ๐‘ก )

In Kernel Regime ๐›ผ โ†’ โˆž: ๐พ0 ๐‘ฅ, ๐‘ฅโ€ฒ = 4โŸจ๐‘ฅ, ๐‘ฅโ€ฒโŸฉ and so

๐›ฝ๐›ผ โˆž๐›ผโ†’โˆž

แˆ˜๐›ฝ๐ฟ2 = arg min๐‘‹๐›ฝ=๐‘ฆ

๐›ฝ ๐Ÿ

In Rich Regime ๐›ผ โ†’ 0: special case of MF with commutative measurements

๐›ฝ๐›ผ โˆž๐›ผโ†’0

แˆ˜๐›ฝ๐ฟ1 = arg min๐‘‹๐›ฝ=๐‘ฆ

๐›ฝ ๐Ÿ

For any ๐œถ: ๐›ฝ๐›ผ โˆž = ๐’‚๐’“๐’ˆ๐’Ž๐’Š๐’๐‘ฟ๐œท=๐’š

๐‘ธ๐œถ ๐œท

where ๐‘ธ๐œถ ๐œท = ฯƒ๐’‹๐’’๐œท[๐’‹]

๐œถ๐Ÿand ๐’’ ๐’ƒ = ๐Ÿ โˆ’ ๐Ÿ’ + ๐’ƒ๐Ÿ + ๐’ƒ๐ฌ๐ข๐ง๐กโˆ’๐Ÿ

๐’ƒ

๐Ÿ

Page 25: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

๐›ฝ๐›ผ โˆž = arg min๐‘‹๐›ฝ=๐‘ฆ

๐‘„๐›ผ ๐›ฝ

where ๐‘„๐›ผ ๐›ฝ = ฯƒ๐‘— ๐‘ž๐›ฝ[๐‘—]

๐›ผ2and ๐‘ž ๐‘ = 2 โˆ’ 4 + ๐‘2 + ๐‘ sinhโˆ’1

๐‘

2

Induced dynamics: แˆถ๐›ฝ๐›ผ = โˆ’ ๐›ฝ๐›ผ2 + 4๐›ผ4 โŠ™โˆ‡๐ฟ๐‘  ๐›ฝ๐›ผ

Page 26: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Sparse Learning๐‘ฆ๐‘– = ๐›ฝโˆ—, ๐‘ฅ๐‘– + ๐‘(0,0.01)

๐‘‘ = 1000, ๐›ฝโˆ— 0 = 5, ๐‘š = 100

Page 27: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Sparse Learning๐‘ฆ๐‘– = ๐›ฝโˆ—, ๐‘ฅ๐‘– + ๐‘(0,0.01)๐‘‘ = 1000, ๐›ฝโˆ— 0 = ๐‘˜

How small does ๐›ผ need to be to get ๐ฟ ๐›ฝ๐›ผ โˆž < 0.025

๐‘š

Page 28: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

More controlling parameters

โ€ข Depth

โ€ข Width

โ€ข Optimization accuracy

โ€ข Stepsize, batchsize, ??

Page 29: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Depth๐›ฝ(๐‘ก) = ๐‘ค+ ๐‘ก ๐ท โˆ’๐‘คโˆ’ ๐‘ก ๐ท

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6 ๐‘ฅ7 ๐‘ฅ8

Page 30: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Depth๐›ฝ(๐‘ก) = ๐‘ค+ ๐‘ก ๐ท โˆ’๐‘คโˆ’ ๐‘ก ๐ท ๐›ฝ โˆž = arg min๐‘„๐ท เต—๐›ฝ

๐›ผ๐ท๐‘ . ๐‘ก. ๐‘‹๐›ฝ = ๐‘ฆ

๐‘ž๐ท(๐‘ง)

โ„Ž๐ท(๐‘ง) = ๐›ผ๐ท 1 + ๐›ผ๐ทโˆ’2๐ท ๐ท โˆ’ 2 ๐‘งโˆ’1๐ทโˆ’2 โˆ’ 1 โˆ’ ๐›ผ๐ทโˆ’2๐ท ๐ท โˆ’ 2 ๐‘ง

โˆ’1๐ทโˆ’2

๐‘„๐ท(๐›ฝ) =

๐‘–

๐‘ž๐ท๐›ฝ ๐‘–

๐›ผ๐ท

๐‘ž๐ท = โˆซ โ„Ž๐ทโˆ’1

Page 31: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Depth๐›ฝ(๐‘ก) = ๐‘ค+ ๐‘ก ๐ท โˆ’๐‘คโˆ’ ๐‘ก ๐ท ๐›ฝ โˆž = arg min๐‘„๐ท เต—๐›ฝ

๐›ผ๐ท๐‘ . ๐‘ก. ๐‘‹๐›ฝ = ๐‘ฆ

For all depth ๐ท โ‰ฅ 2, ๐›ฝ โˆž๐›ผโ†’0

arg min๐‘‹๐›ฝ=๐‘ฆ

๐›ฝ ๐Ÿ

โ€ข Contrast with explicit reg: For ๐‘…๐›ผ ๐›ฝ = min๐›ฝ=๐‘ค+

๐ทโˆ’๐‘คโˆ’๐ท๐‘ค โˆ’ ๐›ผ๐Ÿ 2

2 , ๐‘…๐›ผ ๐›ฝ๐›ผโ†’0

๐›ฝ เต—๐Ÿ ๐‘ซ

also observed by [Arora Cohen Hu Luo 2019]

โ€ข Also with logistic loss, ๐›ฝ โˆž๐›ผโ†’0

โˆ ๐‘†๐‘‚๐‘†๐‘ƒ ๐‘œ๐‘“ ๐›ฝ เต—๐Ÿ ๐‘ซ[Gunasekar Lee Soudry Srebro 2018]

โ€ข With sq loss, always โ€– โ‹… โ€–๐Ÿ, but for deep ๐ท, we get there quicker

๐‘ž๐ท(๐‘ง)

Page 32: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Depth

๐›ฝ(๐‘ก) = ๐‘ค+ ๐‘ก ๐ท โˆ’๐‘คโˆ’ ๐‘ก ๐ท ๐›ฝ โˆž = arg min๐‘„๐ท เต—๐›ฝ๐›ผ๐ท

๐‘ . ๐‘ก. ๐‘‹๐›ฝ = ๐‘ฆ

๐‘„ 1,0, โ€ฆ , 0

๐‘„1

๐‘‘,1

๐‘‘, โ€ฆ ,

1

๐‘‘

Page 33: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Sparse Learning with Depth๐‘ฆ๐‘– = ๐›ฝโˆ—, ๐‘ฅ๐‘– + ๐‘(0,0.01)๐‘‘ = 1000, ๐›ฝโˆ— 0 = ๐‘˜

How small does ๐›ผ need to be to get ๐ฟ ๐›ฝ๐›ผ โˆž < 0.025

๐‘š

๐›ผ๐ท

Page 34: Inductive Bias and Optimization in Deep Learningย ยท โ€ขLots of interesting things naturally with small networks โ€ขAny time T computable function with network of size ๐‘ถเทฉ( ) โ€ขComputation

Deep Learning

โ€ข Expressive Power

โ€ข We are searching over the space of all functionsโ€ฆ

โ€ฆ but with what inductive bias?

โ€ข How does this bias look in function space?

โ€ข Is it reasonable/sensible?

โ€ข Capacity / Generalization ability / Sample Complexity

โ€ข Whatโ€™s the true complexity measure (inductive bias)?

โ€ข How does it control generalization?

โ€ข Computation / Optimization

โ€ข How and where does optimization bias us? Under what conditions?

โ€ข Magic property of reality under which deep learning โ€œworksโ€