Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz...

69
Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science or Alchemy. Princeton 2019 Based on joint works with Amnon Shashua, Shaked Shammah, Ohad Shamir, Jonathan Fiat, Eran Malach, and Yonatan Wexler Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 1 / 57

Transcript of Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz...

Page 1: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogates

Shai Shalev-Shwartz

Mobileye, an Intel CompanyThe Hebrew University of Jerusalem

Deep Learning: Science or Alchemy. Princeton 2019

Based on joint works with Amnon Shashua, Shaked Shammah, Ohad Shamir, Jonathan Fiat,

Eran Malach, and Yonatan Wexler

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 1 / 57

Page 2: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Learning theory pre deep learning

PAC model: expressivity, generalization, optimization

The expressivity-generalization tradeoff is well understood by VC theoryOptimization is easy (only?) for linear models

Linear models work well in practice (AdaBoost, SVM)

Machine learning theoreticians at the time were “spoiled” by thesuccess of AdaBoost and SVM

Engineering is mainly about constructing features based on expertknowledge

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 2 / 57

Page 3: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 3 / 57

Page 4: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Back in 2014: Deep Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 4 / 57

Page 5: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Back in 2014: Deep Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 5 / 57

Page 6: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

What’s Great about Deep Learning

Deep learning is essentially a “differentiable programming language”

Fast development

Accelerate computation by dedicated hardware

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 6 / 57

Page 7: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

A differentiable programming language

less prior knowledgemore data

expert system

Shallow learning

deep networks

No Free Lunch

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 7 / 57

Page 8: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Back in 2015: Toward self driving beyond highways ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 8 / 57

Page 9: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Driving Policy - Challenges

Defensive/Aggressive Tradeoff — balance between cautiousnessfacing unexpected behavior of other drivers and not being toodefensive (paranoia)

Negotiation/Communication — in dense traffic we must negotiatewith other drivers/pedestrians

“The rules of breaking the rules ...”

Dealing with Uncertainty — is there a kid behind this car? Is this taxiparking or just picking up a passenger? Does this pedestrian intend tocross the street?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 9 / 57

Page 10: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Deep Reinforcement Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 10 / 57

Page 11: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 11 / 57

Page 12: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Sometimes things do not look good ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 12 / 57

Page 15: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Typical vs. Rare Cases ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 14 / 57

Page 16: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 15 / 57

Page 17: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

What’s Wrong with Deep Learning?

Sometimes we think it works, but it actually doesn’t ...

Sometimes it fails... What to do when it fails?

It doesn’t find the “shortest” explanation

It is hard to interpret what exactly the solution is

... and many more problems (e.g. Marcus’2017, Yuille & Liu’2019,Jia & Liang’2017)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 16 / 57

Page 18: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

“tried DL, it doesn’t work for my problem”

The most uninformative sentence a Professor/CTO can hear: “I’vetried X but it didn’t work”

Usually, if you understand “X” you can find out what went wrong

Very difficult to understand what went wrong if you don’t have a cluehow “X” works ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 17 / 57

Page 19: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

What to do when things do not work...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 18 / 57

Page 20: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Piecewise-linear Curves

Problem: Train a piecewise-linear curve detector

Input: f = (f(0), f(1), . . . , f(n− 1)) where

f(x) =

k∑r=1

ar[x− θr]+ , θr ∈ {0, . . . , n− 1}

Output: Curve parameters {ar, θr}kr=1

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 19 / 57

Page 21: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

First try: Deep AutoEncoder

Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)

Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)

Squared Loss: (Dw2(Ew1(f))− f)2

Doesn’t work well ...

500 iterations 10,000 iterations 50,000 iterations

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57

Page 22: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

First try: Deep AutoEncoder

Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)

Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)

Squared Loss: (Dw2(Ew1(f))− f)2

Doesn’t work well ...

500 iterations 10,000 iterations 50,000 iterations

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57

Page 23: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

It doesn’t find the “simplest” explanation

Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”

Lets start with a much simpler problem: how much is 3 by 2 ?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57

Page 24: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

It doesn’t find the “simplest” explanation

Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”

Lets start with a much simpler problem: how much is 3 by 2 ?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57

Page 25: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Learning to Multiply

1 2 3 4 5

1 1 3 51 3 5

2

3 3 9 153 9 15

4

5 5 15 255 15 25

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57

Page 26: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Learning to Multiply

1 2 3 4 5

1 1 2 3 4 51 2 3 7 5

2 2 4 6 8 103 21 11 34 21

3 3 6 9 12 153 12 9 22 15

4 4 8 12 16 2010 35 23 53 37

5 5 10 15 20 255 23 15 36 25

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57

Page 27: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Learning to Multiply

5 10

0

20

40

603i

ReLU

5 10

0

20

40

60

80

1004i

ReLU

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 23 / 57

Page 28: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Learning to Multiply: Details

Represented input as two sequences of bits (the binary representationsof the two numbers), so their product is(∑

i

ai2i

)∑j

bj2j

=∑i,j

aibj2i+j =

∑i,j

[ai + bj − 1]+ 2i+j

Can represent product with one hidden layer ReLU network

Even though a simple solution exists, SGD learns a completelydifferent solution

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 24 / 57

Page 29: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 25 / 57

Page 30: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

The Role of Theory

We miss a theoretical understanding of what’s really going on:

Inductive bias: What is the inductive bias? How does it depend onthe architecture, the initialization, and the optimization algorithm?

Generalization: does “understanding deep learning requirere-thinking generalization (Zhang et al)” ?

Optimization: When and how gradient-based optimization works?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 26 / 57

Page 31: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Example: Inductive bias of Initialization and Optimization

Theorem (Informal version of S., Shamir, Shammah 2016)

Fix some architecture. Any initialization scheme of a neural networkinduces an order over the set of functions expressible by the architecture,and gradient-based optimization is blind to all functions which are far inthe induced order

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 27 / 57

Page 32: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Plan of Attack

Plan of attack — Surrogates:

Build simpler models that behave similarly to DL on specific tasks,analyze them theoretically, reflect on the behavior of DL, thereforegaining insight on inductive bias, generalization, and potential failurepoints

Build simpler distributions for which we understand the behavior ofDL and which give us insight as to when and how DL works

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 28 / 57

Page 33: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 29 / 57

Page 34: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Gated Linear Units (GaLU)

Rectified Linear Unit (ReLU):

fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x

)

Gated Linear Unit (GaLU):

gw,u(x) = (1u>x>0) ·(w>x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57

Page 35: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Gated Linear Units (GaLU)

Rectified Linear Unit (ReLU):

fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x

)

Gated Linear Unit (GaLU):

gw,u(x) = (1u>x>0) ·(w>x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57

Page 36: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

ReLU and GaLU on multiplication table

5 100

20

40

60 3iReLU

5 100

50

100 4iReLU

5 100

20

403i

GaLU

5 100

20

40

60

80 4iGaLU

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 31 / 57

Page 37: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

ReLU, GaLU, and Linear networks on simple datasets

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 32 / 57

Page 38: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

One hidden layer GaLU networks, single output

Optimization:

The gates are un-trainable by SGD (the gradient is always zero)

The weights are provably trainable (convex problem)

N(x) =∑j

αjgwj ,uj (x) =∑j

gαjwj ,uj (x) =∑j

gw̃j ,uj (x)

=∑j

(1u>j x>0

)·(w̃>j x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 33 / 57

Page 39: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

One hidden layer GaLU networks, multiple outputs

Optimization:

A convex problem with low rank constraints

Approximations are well known, but the constraints do not seem tomatter

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 34 / 57

Page 40: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

One hidden layer: Expressivity

Every ReLU network can be expressed by a GaLU network (simply setuj = wj)

But, we do not optimize over uj

So, the question is: what can be expressed by a GaLU network,assuming that uj are fixed to their initialization?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 35 / 57

Page 41: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

GaLU networks can Memorize

A simple measure of model capacity: can we memorize a randomsample?

Theorem

Assume m random examples and consider GaLU network with kneurons and let d be the input dimension. Then,

E[minwLS(w)

]= 1− rank(X̄)

m

where X̄ ∈ Rm,kd is composed of k repetitions of the data matrixgated by the random GaLU filters.

The rank is approximately full, meaning that we can memorize a

dataset if kd∼> m

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 36 / 57

Page 42: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

GaLU vs. ReLU for memorization

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 37 / 57

Page 43: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Clustered Piece-wise Linear Models

GaLU networks can be shown to be optimal for highly clusteredpiece-wise linear distributions

But, even if the data comes exactly from such distribution, thelearned weights are a random linear combination of the “real” weights

A hint as to why DNNs are uninterpretable

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 38 / 57

Page 44: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Rethinking Generalization

Zhang el al: Understanding deep learning requires re-thinkinggeneralization

Learning theory bounds usually take the form

LD(h) ≤ LS(h) +

(c(H)

m

)pwhere c(H) is some capacity measure of the hypothesis class

Zhang el al showed that SGD on the same network architectureoverfit random labels and generalize on “true” labels

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 39 / 57

Page 45: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Rethinking Generalization

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 40 / 57

Page 46: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Rethinking Generalization

Not surprising for Linear Regression: need only d examples to fit alinear model if there is no noise at all, and need d/ε examples to get εerror in the noisy case

Rosset and Tibshirani 2018: need roughly σ2d/ε examples to get εerror for noisy linear model

Since GaLU are almost linear regression model, it may be possible toderive similar bounds to GaLU networks

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 41 / 57

Page 47: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 42 / 57

Page 48: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

We know that in the worst case DL cannot work

On some real datasets, DL works pretty good

Since modeling real data is hard, lets search for surrogate distributions

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 43 / 57

Page 49: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Page 50: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Page 51: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Page 52: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Page 53: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Page 54: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 45 / 57

Page 55: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Deep Generative Model for (synthetic) images

Given a label, generate a small scale image, where each “pixel”represents a semantic class (e.g. sky, grass, ...)

Given a semantic small image, generate a larger semantic image bysampling a semantic “patch” from each semantic “pixel”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 46 / 57

Page 56: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Deep Generative Model for (synthetic) images

y

Dy

Cm0

x(0)1

x(0)2

x(0)3

x(0)4

G1

Cms1

. . .

. . .

. . .

x(1)1

x(1)2

x(1)3

x(1)4

· · ·

· · ·

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 47 / 57

Page 57: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

SGD finds latent features

Theorem (Informal, Malach and S.)

Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.

Enables layer-by-layer reconstruction of the latent features

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57

Page 58: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

SGD finds latent features

Theorem (Informal, Malach and S.)

Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.

Enables layer-by-layer reconstruction of the latent features

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57

Page 59: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 49 / 57

Page 60: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Page 61: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Page 62: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Page 63: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Fractal Distributions

Iterated Fractal Distribution:

K0 = [−1, 1]d

Kn = F1(Kn−1) ∪ . . . ∪ Fn(Kn−1)

The “depth” of the fractal is n

A “fractal distribution” is a distribution in which positive examplesare sampled from the set Kn and negative examples are sampled fromits complement

K0 K1

F1

F2

F3

F4

F1

F2

F3

F4

K2

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 51 / 57

Page 64: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Depth Separation

If Fi are affine, a network of depth O(n) can express a depth nfractal, but a shallow network requires exponential width

Approximation curve: How well does a network of depth t express adepth n fractal.

We show that SGD works only if the approximation curve is “nice”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 52 / 57

Page 65: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Approximation Curve: coarse vs. fine

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 53 / 57

Page 66: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Success of SGD depends on the Approximation Curve

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 54 / 57

Page 67: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

One Dimensional Cantor Fractals

Nc1,c2,c3,b3(x) = | | |x− c1| − c2| − c3| − b3

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 55 / 57

Page 68: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Summary

We lack a good understanding of basic properties of DL:

How to control the inductive biasHow does it depend on the initialization and optimization algorithm

A pursuit after good surrogates

Surrogates algorithmsSurrogates distributions

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 56 / 57

Page 69: Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz Mobileye, an Intel Company The Hebrew University of Jerusalem Deep Learning: Science

Summary

In practice, Deep Learning is just one important piece of a bigger puzzel.

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 57 / 57