EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep...

239
EE-559 – Deep learning 2. Learning, capacity, basic embeddings Fran¸coisFleuret https://fleuret.org/dlc/ [version of: June 19, 2018] ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

Transcript of EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep...

Page 1: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

EE-559 – Deep learning

2. Learning, capacity, basic embeddings

Francois Fleuret

https://fleuret.org/dlc/

[version of: June 19, 2018]

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Page 2: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning from data

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 2 / 79

Page 3: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The general objective of machine learning is to capture regularity in data tomake predictions.

In our regression example, we modeled age and blood pressure as being linearlyrelated, to predict the latter from the former.

There are multiple types of inference that we can roughly split into threecategories:

• Classification (e.g. object recognition, cancer detection, speechprocessing),

• regression (e.g. customer satisfaction, stock prediction, epidemiology), and

• density estimation (e.g. outlier detection, data visualization,sampling/synthesis).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 3 / 79

Page 4: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The general objective of machine learning is to capture regularity in data tomake predictions.

In our regression example, we modeled age and blood pressure as being linearlyrelated, to predict the latter from the former.

There are multiple types of inference that we can roughly split into threecategories:

• Classification (e.g. object recognition, cancer detection, speechprocessing),

• regression (e.g. customer satisfaction, stock prediction, epidemiology), and

• density estimation (e.g. outlier detection, data visualization,sampling/synthesis).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 3 / 79

Page 5: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The standard formalization considers a measure of probability

µX ,Y

over the observation/value of interest, and i.i.d. training samples

(xn, yn), n = 1, . . . ,N.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 4 / 79

Page 6: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Intuitively, for classification it can often be interpreted as

µX ,Y (x , y) = µX |Y=y (x)P(Y = y)

that is, draw Y first, and given its value, generate X .

HereµX |Y=y

stands for the population of the observable signal for class y (e.g. “sound of an/e/”, “image of a cat”).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 5 / 79

Page 7: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Intuitively, for classification it can often be interpreted as

µX ,Y (x , y) = µX |Y=y (x)P(Y = y)

that is, draw Y first, and given its value, generate X .

HereµX |Y=y

stands for the population of the observable signal for class y (e.g. “sound of an/e/”, “image of a cat”).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 5 / 79

Page 8: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

For regression, one would interpret the joint law more naturally as

µX ,Y (x , y) = µY |X=x (y)µX (x)

which would be: first, generate X , and given its value, generate Y .

In the simple casesY = f (X ) + ε

where f is the deterministic dependency between x and y , and ε is a randomnoise, independent of X .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 6 / 79

Page 9: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

For regression, one would interpret the joint law more naturally as

µX ,Y (x , y) = µY |X=x (y)µX (x)

which would be: first, generate X , and given its value, generate Y .

In the simple casesY = f (X ) + ε

where f is the deterministic dependency between x and y , and ε is a randomnoise, independent of X .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 6 / 79

Page 10: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

With such a model, we can more precisely define the three types of inferenceswe introduced before:

Classification,

• (X ,Y ) random variables on Z = RD × {1, . . . ,C},• we want to estimate argmaxy P(Y = y | X = x).

Regression,

• (X ,Y ) random variables on Z = RD × R,

• we want to estimate E(Y | X = x).

Density estimation,

• X random variable on Z = RD ,

• we want to estimate µX .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 7 / 79

Page 11: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

With such a model, we can more precisely define the three types of inferenceswe introduced before:

Classification,

• (X ,Y ) random variables on Z = RD × {1, . . . ,C},• we want to estimate argmaxy P(Y = y | X = x).

Regression,

• (X ,Y ) random variables on Z = RD × R,

• we want to estimate E(Y | X = x).

Density estimation,

• X random variable on Z = RD ,

• we want to estimate µX .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 7 / 79

Page 12: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

With such a model, we can more precisely define the three types of inferenceswe introduced before:

Classification,

• (X ,Y ) random variables on Z = RD × {1, . . . ,C},• we want to estimate argmaxy P(Y = y | X = x).

Regression,

• (X ,Y ) random variables on Z = RD × R,

• we want to estimate E(Y | X = x).

Density estimation,

• X random variable on Z = RD ,

• we want to estimate µX .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 7 / 79

Page 13: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The boundaries between these categories are fuzzy:

• Regression allows to do classification through class scores.

• Density models allow to do classification thanks to Bayes’ law.

etc.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 8 / 79

Page 14: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We call generative classification methods with an explicit data model, anddiscriminative the ones bypassing such a modeling .

Example: Can we predict a Brazilian basketball player’s gender G from his/herheight H?

Females: 190 182 188 184 196 173 180 193 179 186 185 169

Males: 192 190 183 199 200 190 195 184 190 203 205 201

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 9 / 79

Page 15: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We call generative classification methods with an explicit data model, anddiscriminative the ones bypassing such a modeling .

Example: Can we predict a Brazilian basketball player’s gender G from his/herheight H?

Females: 190 182 188 184 196 173 180 193 179 186 185 169

Males: 192 190 183 199 200 190 195 184 190 203 205 201

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 9 / 79

Page 16: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

In the generative approach, we model µH|G=g (h)

140 160 180 200 220 240

MalesFemales

and use Bayes’s law P(G = g | H = h) =µH|G=g (h)P(G=g)

µH (h)

0

0.2

0.4

0.6

0.8

1

140 160 180 200 220 240

189.23

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 10 / 79

Page 17: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

In the generative approach, we model µH|G=g (h)

140 160 180 200 220 240

MalesFemales

and use Bayes’s law P(G = g | H = h) =µH|G=g (h)P(G=g)

µH (h)

0

0.2

0.4

0.6

0.8

1

140 160 180 200 220 240

189.23

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 10 / 79

Page 18: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

In the discriminative approach we directly pick the threshold that works thebest on the data:

0

2

4

6

8

10

12

140 160 180 200 220 240

187.5

Note that it is harder to design a confidence indicator.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 11 / 79

Page 19: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

In the discriminative approach we directly pick the threshold that works thebest on the data:

0

2

4

6

8

10

12

140 160 180 200 220 240

187.5

Note that it is harder to design a confidence indicator.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 11 / 79

Page 20: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Risk, empirical risk

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 12 / 79

Page 21: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning consists of finding in a set F of functionals a “good” f ∗ (or itsparameters’ values) usually defined through a loss

l : F ×Z → R

such that l(f , z) increases with how wrong f is on z.

For instance

• for classification:l(f , (x , y)) = 1{f (x)6=y},

• for regression:l(f , (x , y)) = (f (x)− y)2,

• for density estimation:

l(q, z) = − log q(z).

The loss may include additional terms related to f itself.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 13 / 79

Page 22: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning consists of finding in a set F of functionals a “good” f ∗ (or itsparameters’ values) usually defined through a loss

l : F ×Z → R

such that l(f , z) increases with how wrong f is on z. For instance

• for classification:l(f , (x , y)) = 1{f (x) 6=y},

• for regression:l(f , (x , y)) = (f (x)− y)2,

• for density estimation:

l(q, z) = − log q(z).

The loss may include additional terms related to f itself.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 13 / 79

Page 23: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning consists of finding in a set F of functionals a “good” f ∗ (or itsparameters’ values) usually defined through a loss

l : F ×Z → R

such that l(f , z) increases with how wrong f is on z. For instance

• for classification:l(f , (x , y)) = 1{f (x) 6=y},

• for regression:l(f , (x , y)) = (f (x)− y)2,

• for density estimation:

l(q, z) = − log q(z).

The loss may include additional terms related to f itself.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 13 / 79

Page 24: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We are looking for an f with a small expected risk

R(f ) = EZ (l(f ,Z)) ,

which means that our learning procedure would ideally choose

f ∗ = argminf∈F

R(f ).

Although this quantity is unknown, if we have i.i.d. training samples

D = {Z1, . . . ,ZN} ,

we can compute an estimate, the empirical risk:

R(f ;D) = ED(l(f ,Z)) =1

N

N∑n=1

l(f ,Zn).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 14 / 79

Page 25: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We are looking for an f with a small expected risk

R(f ) = EZ (l(f ,Z)) ,

which means that our learning procedure would ideally choose

f ∗ = argminf∈F

R(f ).

Although this quantity is unknown, if we have i.i.d. training samples

D = {Z1, . . . ,ZN} ,

we can compute an estimate, the empirical risk:

R(f ;D) = ED(l(f ,Z)) =1

N

N∑n=1

l(f ,Zn).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 14 / 79

Page 26: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 27: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 28: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 29: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 30: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 31: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 79

Page 32: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Finally, given D, F, and l, “learning” aims at computing

f ∗ = argminf∈F

R(f ;D).

• Can we bound R(f ) with R(f ;D)?

Yes if f is not chosen using D. Since the Zn are independent, we just needto take into account the variance of R(f ;D).

• Can we bound R(f ∗) with R(f ∗;D)?

B Unfortunately not simply, and not without additional constraints on F.

For instance if |F| = 1, we can!

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 16 / 79

Page 33: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Finally, given D, F, and l, “learning” aims at computing

f ∗ = argminf∈F

R(f ;D).

• Can we bound R(f ) with R(f ;D)?

Yes if f is not chosen using D. Since the Zn are independent, we just needto take into account the variance of R(f ;D).

• Can we bound R(f ∗) with R(f ∗;D)?

B Unfortunately not simply, and not without additional constraints on F.

For instance if |F| = 1, we can!

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 16 / 79

Page 34: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Finally, given D, F, and l, “learning” aims at computing

f ∗ = argminf∈F

R(f ;D).

• Can we bound R(f ) with R(f ;D)?

Yes if f is not chosen using D. Since the Zn are independent, we just needto take into account the variance of R(f ;D).

• Can we bound R(f ∗) with R(f ∗;D)?

B Unfortunately not simply, and not without additional constraints on F.

For instance if |F| = 1, we can!

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 16 / 79

Page 35: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Finally, given D, F, and l, “learning” aims at computing

f ∗ = argminf∈F

R(f ;D).

• Can we bound R(f ) with R(f ;D)?

Yes if f is not chosen using D. Since the Zn are independent, we just needto take into account the variance of R(f ;D).

• Can we bound R(f ∗) with R(f ∗;D)?

B Unfortunately not simply, and not without additional constraints on F.

For instance if |F| = 1, we can!

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 16 / 79

Page 36: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Note that in practice, we call “loss” both the functional

l : F ×Z → R

and the empirical risk minimized during training

L(f ) =1

N

N∑n=1

l(f , zn).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 17 / 79

Page 37: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Over and under-fitting, intuition

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 18 / 79

Page 38: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

You want to hire someone, and you evaluate candidates by asking them tentechnical yes/no questions.

Would you feel confident if you interviewed one candidate and he makes aperfect score?

What about interviewing ten candidates and picking the best?

What about one thousand?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 19 / 79

Page 39: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

You want to hire someone, and you evaluate candidates by asking them tentechnical yes/no questions.

Would you feel confident if you interviewed one candidate and he makes aperfect score?

What about interviewing ten candidates and picking the best?

What about one thousand?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 19 / 79

Page 40: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

You want to hire someone, and you evaluate candidates by asking them tentechnical yes/no questions.

Would you feel confident if you interviewed one candidate and he makes aperfect score?

What about interviewing ten candidates and picking the best?

What about one thousand?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 19 / 79

Page 41: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

You want to hire someone, and you evaluate candidates by asking them tentechnical yes/no questions.

Would you feel confident if you interviewed one candidate and he makes aperfect score?

What about interviewing ten candidates and picking the best?

What about one thousand?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 19 / 79

Page 42: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

WithQn

k ∼ B(0.5), n = 1, . . . , 1000, k = 1, . . . , 10,

independent, we have

∀n, P(∀k,Qnk = 1) =

1

1024

andP(∃n, ∀k,Qn

k = 1) ' 0.62.

There is 62% chance that among 1, 000 candidates answering completely atrandom, one will score perfectly.

It is quite intuitive that selecting a candidate based on a statisticalestimator biases the said estimator for that candidate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 20 / 79

Page 43: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

WithQn

k ∼ B(0.5), n = 1, . . . , 1000, k = 1, . . . , 10,

independent, we have

∀n, P(∀k,Qnk = 1) =

1

1024

andP(∃n, ∀k,Qn

k = 1) ' 0.62.

There is 62% chance that among 1, 000 candidates answering completely atrandom, one will score perfectly.

It is quite intuitive that selecting a candidate based on a statisticalestimator biases the said estimator for that candidate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 20 / 79

Page 44: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

WithQn

k ∼ B(0.5), n = 1, . . . , 1000, k = 1, . . . , 10,

independent, we have

∀n, P(∀k,Qnk = 1) =

1

1024

andP(∃n, ∀k,Qn

k = 1) ' 0.62.

There is 62% chance that among 1, 000 candidates answering completely atrandom, one will score perfectly.

It is quite intuitive that selecting a candidate based on a statisticalestimator biases the said estimator for that candidate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 20 / 79

Page 45: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Over and under-fitting, capacity. K -nearest-neighbors

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 21 / 79

Page 46: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

A simple classification procedure is the “K -nearest neighbors.”

Given(xn, yn) ∈ RD × {1, . . . ,C}, n = 1, . . . ,N

to predict the y associated to a new x , take the yn of the closest xn:

n∗(x) = argminn‖xn − x‖

f ∗(x) = yn∗(x).

This recipe corresponds to K = 1, and makes the empirical training error zero.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 22 / 79

Page 47: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 1

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 23 / 79

Page 48: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 1

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 23 / 79

Page 49: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Under mild assumptions of regularities of µX ,Y , for N →∞ the asymptoticerror rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.

It can be made more stable by looking at the K > 1 closest training points, andtaking the majority vote.

If we let also K →∞ “not too fast”, the error rate is the (optimal!) Bayes’Error rate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 24 / 79

Page 50: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Under mild assumptions of regularities of µX ,Y , for N →∞ the asymptoticerror rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.

It can be made more stable by looking at the K > 1 closest training points, andtaking the majority vote.

If we let also K →∞ “not too fast”, the error rate is the (optimal!) Bayes’Error rate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 24 / 79

Page 51: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Under mild assumptions of regularities of µX ,Y , for N →∞ the asymptoticerror rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.

It can be made more stable by looking at the K > 1 closest training points, andtaking the majority vote.

If we let also K →∞ “not too fast”, the error rate is the (optimal!) Bayes’Error rate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 24 / 79

Page 52: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Training set

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 25 / 79

Page 53: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Prediction (K=1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 25 / 79

Page 54: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Training set

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 26 / 79

Page 55: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Prediction (K=1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 26 / 79

Page 56: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Training set

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 27 / 79

Page 57: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Votes (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 27 / 79

Page 58: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Prediction (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 27 / 79

Page 59: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Training set

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 28 / 79

Page 60: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Votes (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 28 / 79

Page 61: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Prediction (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 28 / 79

Page 62: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

0

0.05

0.1

0.15

0.2

0.25

0.3

1 10 100 1000

Err

or

K

TrainTest

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 29 / 79

Page 63: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

0

0.05

0.1

0.15

0.2

0.25

0.3

1 10 100 1000

Underfitting Overfitting

Err

or

K

TrainTest

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 29 / 79

Page 64: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Over and under-fitting, capacity, polynomials

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 30 / 79

Page 65: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 79

Page 66: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 79

Page 67: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 79

Page 68: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 79

Page 69: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 79

Page 70: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

argminα

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

def fit_polynomial(D, x, y):

N = x.size (0)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

Y = y.view(N, 1)

# LAPACK ’s GEneralized Least -Square

alpha , _ = torch.gels(Y, X)

# unclear why we need narrow here (torch 0.3.1)

return alpha.narrow(0, 0, D + 1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 32 / 79

Page 71: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

D, N = 4, 100

x = torch.linspace(-math.pi , math.pi, N)

y = x.sin()

alpha = fit_polynomial(D, x, y)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

z = X.mm(alpha).view(-1)

for k in range(N): print(x[k], y[k], z[k])

-1

-0.5

0

0.5

1

-3 -2 -1 0 1 2 3

f*sin

We can use that model on a synthetic example with noisy training points, toillustrate how the prediction changes when we increase the degree or theregularization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 33 / 79

Page 72: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

D, N = 4, 100

x = torch.linspace(-math.pi , math.pi, N)

y = x.sin()

alpha = fit_polynomial(D, x, y)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

z = X.mm(alpha).view(-1)

for k in range(N): print(x[k], y[k], z[k])-1

-0.5

0

0.5

1

-3 -2 -1 0 1 2 3

f*sin

We can use that model on a synthetic example with noisy training points, toillustrate how the prediction changes when we increase the degree or theregularization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 33 / 79

Page 73: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

D, N = 4, 100

x = torch.linspace(-math.pi , math.pi, N)

y = x.sin()

alpha = fit_polynomial(D, x, y)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

z = X.mm(alpha).view(-1)

for k in range(N): print(x[k], y[k], z[k])-1

-0.5

0

0.5

1

-3 -2 -1 0 1 2 3

f*sin

We can use that model on a synthetic example with noisy training points, toillustrate how the prediction changes when we increase the degree or theregularization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 33 / 79

Page 74: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Data

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 75: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 76: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 77: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 78: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 79: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 80: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 81: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 82: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 83: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 84: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 79

Page 85: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

10-3

10-2

10-1

0 1 2 3 4 5 6 7 8 9

Err

or (M

SE

)

Degree

TrainTest

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 35 / 79

Page 86: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 87: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 88: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 89: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 90: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 91: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 92: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 93: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 94: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 95: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 79

Page 96: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We can reformulate this control of the degree with a penalty

L(α) =∑n

(f (xn;α)− yn)2 +∑d

ld (αd )

where

ld (α) =

{0 if d ≤ D or α = 0

+∞ otherwise.

Such a penalty kills any term of degree > D.

This motivates the use of more subtle variants. For instance, to keep all thisquadratic

L(α) =∑n

(f (xn;α)− yn)2 + ρ∑d

α2d .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 37 / 79

Page 97: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We can reformulate this control of the degree with a penalty

L(α) =∑n

(f (xn;α)− yn)2 +∑d

ld (αd )

where

ld (α) =

{0 if d ≤ D or α = 0

+∞ otherwise.

Such a penalty kills any term of degree > D.

This motivates the use of more subtle variants. For instance, to keep all thisquadratic

L(α) =∑n

(f (xn;α)− yn)2 + ρ∑d

α2d .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 37 / 79

Page 98: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e1

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 99: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e0

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 100: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-1

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 101: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-2

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 102: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-3

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 103: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-4

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 104: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-5

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 105: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-6

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 106: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-7

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 107: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-8

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 108: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-9

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 109: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-10

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 110: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-11

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 111: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-12

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 112: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-13

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 113: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=0.0

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 79

Page 114: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

10-2

10-1

10-1410-1210-1010-810-610-410-2100102

Err

or (M

SE

)

ρ

TrainTest

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 39 / 79

Page 115: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We define the capacity of a set of predictors as its ability to model an arbitraryfunctional. This is a vague definition, difficult to make formal.

A mathematically precise notion is the Vapnik–Chervonenkis dimension of a setof functions, which, in the Binary classification case, is the cardinality of thelargest set that can be labeled arbitrarily (Vapnik, 1995).

It is a very powerful concept, but is poorly adapted to neural networks. We willnot say more about it in this course.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 40 / 79

Page 116: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We define the capacity of a set of predictors as its ability to model an arbitraryfunctional. This is a vague definition, difficult to make formal.

A mathematically precise notion is the Vapnik–Chervonenkis dimension of a setof functions, which, in the Binary classification case, is the cardinality of thelargest set that can be labeled arbitrarily (Vapnik, 1995).

It is a very powerful concept, but is poorly adapted to neural networks. We willnot say more about it in this course.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 40 / 79

Page 117: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although the capacity is hard to define precisely, it is quite clear in practice howto modulate it for a given class of models.

In particular one can control over-fitting either by

• Impoverishing the space F (less functionals, early stopping)

• Make the choice of f ∗ less dependent on data (penalty on coefficients,margin maximization, ensemble methods)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 41 / 79

Page 118: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Bias-variance dilemma

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 42 / 79

Page 119: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 120: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 121: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 122: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 123: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 124: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 125: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 126: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 127: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 128: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 79

Page 129: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e1

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 130: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e0

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 131: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-1

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 132: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-2

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 133: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-3

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 134: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-4

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 135: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-5

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 136: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-6

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 137: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-7

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 138: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-8

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 139: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-9

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 140: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-10

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 141: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-11

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 142: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-12

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 143: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-13

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 144: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=0.0

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 79

Page 145: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Let x be fixed, and y = f (x) the “true” value associated to it.

Let Y = f ∗(x) be the value we predict.

If we consider that the training set D is a random quantity, so is f ∗, andconsequently Y .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 45 / 79

Page 146: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2)

= ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 147: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 148: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 149: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 150: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 151: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 79

Page 152: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

From this comes the bias variance tradeoff:

10-3

10-2

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

Var

ianc

e

Bias

Reducing the capacity makes f ∗ fit the data less on average, which increasesthe bias term.

Increasing the capacity makes f ∗ vary a lot with the trainingdata, which increases the variance term.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 47 / 79

Page 153: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

From this comes the bias variance tradeoff:

10-3

10-2

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

Var

ianc

e

Bias

Reducing the capacity makes f ∗ fit the data less on average, which increasesthe bias term. Increasing the capacity makes f ∗ vary a lot with the trainingdata, which increases the variance term.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 47 / 79

Page 154: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Is all this probabilistic?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 48 / 79

Page 155: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Conceptually model-fitting and regularization can be interpreted as Bayesianinference.

This approach consists of modeling the parameters A of the modelthemselves as random quantities with a prior µA.

By looking at the data D, we can estimate a posterior distribution for the saidparameters,

µA(α | D = d) ∝ µD(d | A = α)µA(α),

and from that their most likely values.

So instead of a penalty term, we define a prior distribution, which is usuallymore intellectually satisfying.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 49 / 79

Page 156: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Conceptually model-fitting and regularization can be interpreted as Bayesianinference.

This approach consists of modeling the parameters A of the modelthemselves as random quantities with a prior µA.

By looking at the data D, we can estimate a posterior distribution for the saidparameters,

µA(α | D = d) ∝ µD(d | A = α)µA(α),

and from that their most likely values.

So instead of a penalty term, we define a prior distribution, which is usuallymore intellectually satisfying.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 49 / 79

Page 157: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Conceptually model-fitting and regularization can be interpreted as Bayesianinference.

This approach consists of modeling the parameters A of the modelthemselves as random quantities with a prior µA.

By looking at the data D, we can estimate a posterior distribution for the saidparameters,

µA(α | D = d) ∝ µD(d | A = α)µA(α),

and from that their most likely values.

So instead of a penalty term, we define a prior distribution, which is usuallymore intellectually satisfying.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 49 / 79

Page 158: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Conceptually model-fitting and regularization can be interpreted as Bayesianinference.

This approach consists of modeling the parameters A of the modelthemselves as random quantities with a prior µA.

By looking at the data D, we can estimate a posterior distribution for the saidparameters,

µA(α | D = d) ∝ µD(d | A = α)µA(α),

and from that their most likely values.

So instead of a penalty term, we define a prior distribution, which is usuallymore intellectually satisfying.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 49 / 79

Page 159: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

For instance, consider the model

∀n, Yn =D∑

d=0

Ad X dn + ∆n,

where∀d , Ad ∼ N(0, ξ), ∀n, Xn ∼ µX , ∆n ∼ N (0, σ)

all independent.

For clarity, let A = (A0, . . . ,AD) and α = (α0, . . . , αD).

Remember that D = {(X1,Y1), . . . , (XN ,YN)} is the (random) training set andd = {(x1, y1), . . . , (xN , yN)} is a realization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 50 / 79

Page 160: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

For instance, consider the model

∀n, Yn =D∑

d=0

Ad X dn + ∆n,

where∀d , Ad ∼ N(0, ξ), ∀n, Xn ∼ µX , ∆n ∼ N (0, σ)

all independent.

For clarity, let A = (A0, . . . ,AD) and α = (α0, . . . , αD).

Remember that D = {(X1,Y1), . . . , (XN ,YN)} is the (random) training set andd = {(x1, y1), . . . , (xN , yN)} is a realization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 50 / 79

Page 161: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 162: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 163: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 164: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 165: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 166: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 167: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 168: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 79

Page 169: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Proper evaluation protocols

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 52 / 79

Page 170: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning algorithms, in particular deep-learning ones, require the tuning of manymeta-parameters.

These parameters have a strong impact on the performance, resulting in a“meta” over-fitting through experiments.

We must be extra careful with performance estimation.

Running 100 times the same experiment on MNIST, with randomized weights,we get:

Worst Median Best1.3% 1.0% 0.82%

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 53 / 79

Page 171: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning algorithms, in particular deep-learning ones, require the tuning of manymeta-parameters.

These parameters have a strong impact on the performance, resulting in a“meta” over-fitting through experiments.

We must be extra careful with performance estimation.

Running 100 times the same experiment on MNIST, with randomized weights,we get:

Worst Median Best1.3% 1.0% 0.82%

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 53 / 79

Page 172: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning algorithms, in particular deep-learning ones, require the tuning of manymeta-parameters.

These parameters have a strong impact on the performance, resulting in a“meta” over-fitting through experiments.

We must be extra careful with performance estimation.

Running 100 times the same experiment on MNIST, with randomized weights,we get:

Worst Median Best1.3% 1.0% 0.82%

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 53 / 79

Page 173: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Learning algorithms, in particular deep-learning ones, require the tuning of manymeta-parameters.

These parameters have a strong impact on the performance, resulting in a“meta” over-fitting through experiments.

We must be extra careful with performance estimation.

Running 100 times the same experiment on MNIST, with randomized weights,we get:

Worst Median Best1.3% 1.0% 0.82%

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 53 / 79

Page 174: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The ideal development cycle is

Write code Train

Test Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 79

Page 175: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The ideal development cycle is

Write code Train Test

Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 79

Page 176: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The ideal development cycle is

Write code Train Test Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 79

Page 177: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The ideal development cycle is

Write code Train Test Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 79

Page 178: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The ideal development cycle is

Write code Train Test Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 79

Page 179: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 180: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 181: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 182: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 183: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 184: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 79

Page 185: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

When data is scarce, one can use cross-validation: average through multiplerandom splits of the data in a train and a validation sets.

There is no unbiased estimator of the variance of cross-validation valid under alldistributions (Bengio and Grandvalet, 2004).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 56 / 79

Page 186: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

When data is scarce, one can use cross-validation: average through multiplerandom splits of the data in a train and a validation sets.

There is no unbiased estimator of the variance of cross-validation valid under alldistributions (Bengio and Grandvalet, 2004).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 56 / 79

Page 187: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Some data-sets (MNIST!) have been used by thousands of researchers, overmillions of experiments, in hundreds of papers.

The global overall process looks more like

Write code Train Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 57 / 79

Page 188: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Some data-sets (MNIST!) have been used by thousands of researchers, overmillions of experiments, in hundreds of papers.

The global overall process looks more like

Write code Train Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 57 / 79

Page 189: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

“Cheating” in machine learning, from bad to “are you kidding?”:

• “Early evaluation stopping”,

• meta-parameter (over-)tuning,

• data-set selection,

• algorithm data-set specific clauses,

• seed selection.

Top-tier conference are demanding regarding experiments, and are biasedagainst “complicated” pipelines.

The community pushes toward accessible implementations, reference data-sets,leader boards, and constant upgrades of benchmarks.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 58 / 79

Page 190: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

“Cheating” in machine learning, from bad to “are you kidding?”:

• “Early evaluation stopping”,

• meta-parameter (over-)tuning,

• data-set selection,

• algorithm data-set specific clauses,

• seed selection.

Top-tier conference are demanding regarding experiments, and are biasedagainst “complicated” pipelines.

The community pushes toward accessible implementations, reference data-sets,leader boards, and constant upgrades of benchmarks.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 58 / 79

Page 191: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Standard clustering and embedding

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 59 / 79

Page 192: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Deep learning models combine embeddings and dimension reduction operations.

They parametrize and re-parametrize multiple times the input signal under moreinvariant and interpretable forms.

To get an intuition of how this is possible, we consider here two standardalgorithms:

• K -means, and

• Principal Component Analysis (PCA).

We will illustrate these methods on our two favorite data-sets.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 60 / 79

Page 193: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Deep learning models combine embeddings and dimension reduction operations.

They parametrize and re-parametrize multiple times the input signal under moreinvariant and interpretable forms.

To get an intuition of how this is possible, we consider here two standardalgorithms:

• K -means, and

• Principal Component Analysis (PCA).

We will illustrate these methods on our two favorite data-sets.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 60 / 79

Page 194: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Deep learning models combine embeddings and dimension reduction operations.

They parametrize and re-parametrize multiple times the input signal under moreinvariant and interpretable forms.

To get an intuition of how this is possible, we consider here two standardalgorithms:

• K -means, and

• Principal Component Analysis (PCA).

We will illustrate these methods on our two favorite data-sets.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 60 / 79

Page 195: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

MNIST data-set

28× 28 grayscale images, 60k train samples, 10k test samples.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 61 / 79

Page 196: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

CIFAR10 data-set

32× 32 color images, 50k train samples, 10k test samples.

(Krizhevsky, 2009, chap. 3)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 62 / 79

Page 197: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Givenxn ∈ RD , n = 1, . . . ,N,

and a fixed number of clusters K > 0, K -means tries to find K “centroids” thatspan uniformly the training population.

Given a point, the index of its closest centroid is a good coding.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 63 / 79

Page 198: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Givenxn ∈ RD , n = 1, . . . ,N,

and a fixed number of clusters K > 0, K -means tries to find K “centroids” thatspan uniformly the training population.

Given a point, the index of its closest centroid is a good coding.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 63 / 79

Page 199: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Formally, [Lloyd’s algorithm for] K -means (approximately) solves

argminc1,...,cK∈RD

∑n

mink‖xn − ck‖2.

This is achieved with a random initialization of c01 , . . . c

0K followed by repeating

until convergence:

∀n, ktn = argmin

k‖xn − ctk‖ (1)

∀k, ct+1k =

1

|n : ktn = k|

∑n:ktn=k

xn (2)

At every iteration, (1) each sample is associated to its closest centroid’s cluster,and (2) each centroid is updated to the average of its cluster.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 64 / 79

Page 200: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Formally, [Lloyd’s algorithm for] K -means (approximately) solves

argminc1,...,cK∈RD

∑n

mink‖xn − ck‖2.

This is achieved with a random initialization of c01 , . . . c

0K followed by repeating

until convergence:

∀n, ktn = argmin

k‖xn − ctk‖ (1)

∀k, ct+1k =

1

|n : ktn = k|

∑n:ktn=k

xn (2)

At every iteration, (1) each sample is associated to its closest centroid’s cluster,and (2) each centroid is updated to the average of its cluster.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 64 / 79

Page 201: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 202: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 203: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 204: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 205: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 206: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 207: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 208: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 209: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 210: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 211: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 212: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 79

Page 213: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

We can apply that algorithm to images from MNIST (28× 28× 1) or CIFAR(32× 32× 3) by considering them as vectors from R784 and R3072 respectively.

Centroids can similarly be visualized as images.

Clustering can be done per-class, or for all the class mixed.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 66 / 79

Page 214: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 1 K = 2 K = 4 K = 8

K = 16

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 67 / 79

Page 215: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 32

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 67 / 79

Page 216: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 1 K = 2 K = 4 K = 8

K = 16

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 68 / 79

Page 217: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

K = 32

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 68 / 79

Page 218: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The Principal Component Analysis (PCA) aims also at extracting aninformation in a L2 sense. Instead of clusters, it looks for an “affine subspace”,i.e. a point and a basis, that spans the data.

Given data-pointsxn ∈ RD , n = 1, . . . ,N

(A) compute the average and center the data

x =1

N

∑n

xn

∀n, x(0)n = xn − x

and then for t = 1, . . . ,D,

(B) pick the direction and project the data

vt = argmax‖v‖=1

∑n

(v · x(t−1)

n

)2

∀n, x(t)n = x

(t−1)n −

(vt · x(t−1)

n

)vt .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 69 / 79

Page 219: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The Principal Component Analysis (PCA) aims also at extracting aninformation in a L2 sense. Instead of clusters, it looks for an “affine subspace”,i.e. a point and a basis, that spans the data.

Given data-pointsxn ∈ RD , n = 1, . . . ,N

(A) compute the average and center the data

x =1

N

∑n

xn

∀n, x(0)n = xn − x

and then for t = 1, . . . ,D,

(B) pick the direction and project the data

vt = argmax‖v‖=1

∑n

(v · x(t−1)

n

)2

∀n, x(t)n = x

(t−1)n −

(vt · x(t−1)

n

)vt .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 69 / 79

Page 220: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2

=

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 221: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 222: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 223: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 224: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 225: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 79

Page 226: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Data

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 71 / 79

Page 227: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataMean

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 71 / 79

Page 228: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataPCA basis

Mean

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 71 / 79

Page 229: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataPCA basis

Mean

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 71 / 79

Page 230: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

As for K -means, we can apply that algorithm to images from MNIST or CIFARby considering them as vectors.

For any sample x and any T , we can compute a reconstruction using T vectorsfrom the PCA basis, i.e.

x +T∑t=1

(vt · x)vt .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 72 / 79

Page 231: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 73 / 79

Page 232: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 74 / 79

Page 233: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 75 / 79

Page 234: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 76 / 79

Page 235: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

These results show that even crude embeddings capture something meaningful.Changes in pixel intensity as expected, but also deformations in the “indexing”space (i.e. the image plan).

However, translations and deformations damage the representation badly.“Composition” (e.g. object on background) is not handled at all.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 77 / 79

Page 236: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

These strengths and shortcomings provide an intuitive motivation for “deepneural networks”, and the rest of this course.

We would like

• to use many encoding “of these sorts” for small local structures withlimited variability,

• have different “channels” for different components,

• process at multiple scales.

Computationally, we would like to deal with large signals and large training sets,so we need to avoid super-linear cost in one or the other.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 78 / 79

Page 237: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

Practical session:

https://fleuret.org/dlc/dlc-practical-2.pdf

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 79 / 79

Page 238: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

The end

Page 239: EE-559 { Deep learning 2. Learning, capacity, basic embeddings · 2018-06-19 · EE-559 { Deep learning 2. Learning, capacity, basic embeddings Fran˘cois Fleuret https:// euret.org/dlc

References

Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-foldcross-validation. Journal of Machine Learning Research (JMLR), 5:1089–1105, 2004.

S. Geman and E. Bienenstock. Neural networks and the bias/variance dilemma. NeuralComputation, 4:1–58, 1992.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis,Department of Computer Science, University of Toronto, 2009.

V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.