Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz...

Surrogates

Shai Shalev-Shwartz

Mobileye, an Intel CompanyThe Hebrew University of Jerusalem

Deep Learning: Science or Alchemy. Princeton 2019

Based on joint works with Amnon Shashua, Shaked Shammah, Ohad Shamir, Jonathan Fiat,

Eran Malach, and Yonatan Wexler

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 1 / 57

Learning theory pre deep learning

PAC model: expressivity, generalization, optimization

The expressivity-generalization tradeoff is well understood by VC theoryOptimization is easy (only?) for linear models

Linear models work well in practice (AdaBoost, SVM)

Machine learning theoreticians at the time were “spoiled” by thesuccess of AdaBoost and SVM

Engineering is mainly about constructing features based on expertknowledge


Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals


Back in 2014: Deep Learning is Amazing ...


What’s Great about Deep Learning

Deep learning is essentially a “differentiable programming language”

Fast development

Accelerate computation by dedicated hardware


A differentiable programming language

less prior knowledgemore data

expert system

Shallow learning

deep networks

No Free Lunch


Back in 2015: Toward self driving beyond highways ...


Driving Policy - Challenges

Defensive/Aggressive Tradeoff — balance between cautiousnessfacing unexpected behavior of other drivers and not being toodefensive (paranoia)

Negotiation/Communication — in dense traffic we must negotiatewith other drivers/pedestrians

“The rules of breaking the rules ...”

Dealing with Uncertainty — is there a kid behind this car? Is this taxiparking or just picking up a passenger? Does this pedestrian intend tocross the street?


Deep Reinforcement Learning is Amazing ...


Outline





Sometimes things do not look good ...


You may think it works, but it doesn’t ...

Eric Schmidt: “Well, computer vision is a solved problem comparedto human vision”


https://www.cnas.org/publications/transcript/eric-schmidt-keynote-address-at-the-center-for-a-new-american-security-artificial-intelligence-and-global-security-summit

https://www.cnas.org/publications/transcript/eric-schmidt-keynote-address-at-the-center-for-a-new-american-security-artificial-intelligence-and-global-security-summit

Typical vs. Rare Cases ...


Outline





What’s Wrong with Deep Learning?

Sometimes we think it works, but it actually doesn’t ...

Sometimes it fails... What to do when it fails?

It doesn’t find the “shortest” explanation

It is hard to interpret what exactly the solution is

... and many more problems (e.g. Marcus’2017, Yuille & Liu’2019,Jia & Liang’2017)


“tried DL, it doesn’t work for my problem”

The most uninformative sentence a Professor/CTO can hear: “I’vetried X but it didn’t work”

Usually, if you understand “X” you can find out what went wrong

Very difficult to understand what went wrong if you don’t have a cluehow “X” works ...


What to do when things do not work...


Piecewise-linear Curves

Problem: Train a piecewise-linear curve detector

Input: f = (f(0), f(1), . . . , f(n− 1)) where

f(x) =

k∑r=1

ar[x− θr]+ , θr ∈ {0, . . . , n− 1}

Output: Curve parameters {ar, θr}kr=1


First try: Deep AutoEncoder

Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)

Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)

Squared Loss: (Dw2(Ew1(f))− f)2

Doesn’t work well ...

500 iterations 10,000 iterations 50,000 iterations


It doesn’t find the “simplest” explanation

Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”

Lets start with a much simpler problem: how much is 3 by 2 ?


Learning to Multiply

1 2 3 4 5

1 1 3 51 3 5

2

3 3 9 153 9 15

4

5 5 15 255 15 25



1 2 3 4 5

1 1 2 3 4 51 2 3 7 5

2 2 4 6 8 103 21 11 34 21

3 3 6 9 12 153 12 9 22 15

4 4 8 12 16 2010 35 23 53 37

5 5 10 15 20 255 23 15 36 25



5 10

0

20

40

603i

ReLU

5 10

0

20

40

60

80

1004i

ReLU


Learning to Multiply: Details

Represented input as two sequences of bits (the binary representationsof the two numbers), so their product is(∑

i

ai2i

)∑j

bj2j

=∑i,j

aibj2i+j =

∑i,j

[ai + bj − 1]+ 2i+j

Can represent product with one hidden layer ReLU network

Even though a simple solution exists, SGD learns a completelydifferent solution


Outline





The Role of Theory

We miss a theoretical understanding of what’s really going on:

Inductive bias: What is the inductive bias? How does it depend onthe architecture, the initialization, and the optimization algorithm?

Generalization: does “understanding deep learning requirere-thinking generalization (Zhang et al)” ?

Optimization: When and how gradient-based optimization works?


Example: Inductive bias of Initialization and Optimization

Theorem (Informal version of S., Shamir, Shammah 2016)

Fix some architecture. Any initialization scheme of a neural networkinduces an order over the set of functions expressible by the architecture,and gradient-based optimization is blind to all functions which are far inthe induced order


Plan of Attack

Plan of attack — Surrogates:

Build simpler models that behave similarly to DL on specific tasks,analyze them theoretically, reflect on the behavior of DL, thereforegaining insight on inductive bias, generalization, and potential failurepoints

Build simpler distributions for which we understand the behavior ofDL and which give us insight as to when and how DL works


Outline





Gated Linear Units (GaLU)

Rectified Linear Unit (ReLU):

fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x

)

Gated Linear Unit (GaLU):

gw,u(x) = (1u>x>0) ·(w>x

)


ReLU and GaLU on multiplication table

5 100

20

40

60 3iReLU

5 100

50

100 4iReLU

5 100

20

403i

GaLU

5 100

20

40

60

80 4iGaLU


ReLU, GaLU, and Linear networks on simple datasets


One hidden layer GaLU networks, single output

Optimization:

The gates are un-trainable by SGD (the gradient is always zero)

The weights are provably trainable (convex problem)

N(x) =∑j

αjgwj ,uj (x) =∑j

gαjwj ,uj (x) =∑j

gw̃j ,uj (x)

=∑j

(1u>j x>0

)·(w̃>j x

)


One hidden layer GaLU networks, multiple outputs

Optimization:

A convex problem with low rank constraints

Approximations are well known, but the constraints do not seem tomatter


One hidden layer: Expressivity

Every ReLU network can be expressed by a GaLU network (simply setuj = wj)

But, we do not optimize over uj

So, the question is: what can be expressed by a GaLU network,assuming that uj are fixed to their initialization?


GaLU networks can Memorize

A simple measure of model capacity: can we memorize a randomsample?

Theorem

Assume m random examples and consider GaLU network with kneurons and let d be the input dimension. Then,

E[minwLS(w)

]= 1− rank(X̄)

m

where X̄ ∈ Rm,kd is composed of k repetitions of the data matrixgated by the random GaLU filters.

The rank is approximately full, meaning that we can memorize a

dataset if kd∼> m


GaLU vs. ReLU for memorization


Clustered Piece-wise Linear Models

GaLU networks can be shown to be optimal for highly clusteredpiece-wise linear distributions

But, even if the data comes exactly from such distribution, thelearned weights are a random linear combination of the “real” weights

A hint as to why DNNs are uninterpretable


Rethinking Generalization

Zhang el al: Understanding deep learning requires re-thinkinggeneralization

Learning theory bounds usually take the form

LD(h) ≤ LS(h) +

(c(H)

m

)pwhere c(H) is some capacity measure of the hypothesis class

Zhang el al showed that SGD on the same network architectureoverfit random labels and generalize on “true” labels



Not surprising for Linear Regression: need only d examples to fit alinear model if there is no noise at all, and need d/ε examples to get εerror in the noisy case

Rosset and Tibshirani 2018: need roughly σ2d/ε examples to get εerror for noisy linear model

Since GaLU are almost linear regression model, it may be possible toderive similar bounds to GaLU networks


Outline





Surrogate Distributions

We know that in the worst case DL cannot work

On some real datasets, DL works pretty good

Since modeling real data is hard, lets search for surrogate distributions


Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”


Outline





Deep Generative Model for (synthetic) images

Given a label, generate a small scale image, where each “pixel”represents a semantic class (e.g. sky, grass, ...)

Given a semantic small image, generate a larger semantic image bysampling a semantic “patch” from each semantic “pixel”


Deep Generative Model for (synthetic) images

y

Dy

∈

Cm0

x(0)1

x(0)2

x(0)3

x(0)4

G1

∈

Cms1

. . .

. . .

. . .

x(1)1

x(1)2

x(1)3

x(1)4

· · ·

· · ·


SGD finds latent features

Theorem (Informal, Malach and S.)

Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.

Enables layer-by-layer reconstruction of the latent features


Outline





Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them


Fractal Distributions

Iterated Fractal Distribution:

K0 = [−1, 1]d

Kn = F1(Kn−1) ∪ . . . ∪ Fn(Kn−1)

The “depth” of the fractal is n

A “fractal distribution” is a distribution in which positive examplesare sampled from the set Kn and negative examples are sampled fromits complement

K0 K1

F1

F2

F3

F4

F1

F2

F3

F4

K2


Depth Separation

If Fi are affine, a network of depth O(n) can express a depth nfractal, but a shallow network requires exponential width

Approximation curve: How well does a network of depth t express adepth n fractal.

We show that SGD works only if the approximation curve is “nice”


Approximation Curve: coarse vs. fine


Success of SGD depends on the Approximation Curve


One Dimensional Cantor Fractals

Nc1,c2,c3,b3(x) = | | |x− c1| − c2| − c3| − b3


Summary

We lack a good understanding of basic properties of DL:

How to control the inductive biasHow does it depend on the initialization and optimization algorithm

A pursuit after good surrogates

Surrogates algorithmsSurrogates distributions


Summary

In practice, Deep Learning is just one important piece of a bigger puzzel.


Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz...

Documents

Transcript of Mobileye, an Intel Company The Hebrew University of Jerusalem · Surrogates Shai Shalev-Shwartz...