2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep...

39
EE-559 – Deep learning 2. Learning, capacity, basic embeddings Fran¸coisFleuret https://fleuret.org/dlc/ [version of: June 14, 2018] ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE Learning from data Fran¸ cois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 2 / 77

Transcript of 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep...

Page 1: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

EE-559 – Deep learning

2. Learning, capacity, basic embeddings

Francois Fleuret

https://fleuret.org/dlc/

[version of: June 14, 2018]

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Learning from data

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 2 / 77

Page 2: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

The general objective of machine learning is to capture regularity in data tomake predictions.

In our regression example, we modeled age and blood pressure as being linearlyrelated, to predict the latter from the former.

There are multiple types of inference that we can roughly split into threecategories:

• Classification (e.g. object recognition, cancer detection, speechprocessing),

• regression (e.g. customer satisfaction, stock prediction, epidemiology), and

• density estimation (e.g. outlier detection, data visualization,sampling/synthesis).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 3 / 77

The standard formalization considers a measure of probability

µX ,Y

over the observation/value of interest, and i.i.d. training samples

(xn, yn), n = 1, . . . ,N.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 4 / 77

Page 3: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Intuitively, for classification it can often be interpreted as

µX ,Y (x , y) = µX |Y=y (x)P(Y = y)

that is, draw Y first, and given its value, generate X .

HereµX |Y=y

stands for the population of the observable signal for class y (e.g. “sound of an/e/”, “image of a cat”).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 5 / 77

For regression, one would interpret the joint law more naturally as

µX ,Y (x , y) = µY |X=x (y)µX (x)

which would be: first, generate X , and given its value, generate Y .

In the simple casesY = f (X ) + ε

where f is the deterministic dependency between x and y , and ε is a randomnoise, independent of X .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 6 / 77

Page 4: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

With such a model, we can more precisely define the three types of inferenceswe introduced before:

Classification,

• (X ,Y ) random variables on Z = RD × {1, . . . ,C},• we want to estimate argmaxy P(Y = y | X = x).

Regression,

• (X ,Y ) random variables on Z = RD × R,

• we want to estimate E(Y | X = x).

Density estimation,

• X random variable on Z = RD ,

• we want to estimate µX .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 7 / 77

The boundaries between these categories are fuzzy:

• Regression allows to do classification through class scores.

• Density models allow to do classification thanks to Bayes’ law.

etc.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 8 / 77

Page 5: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

We call generative classification methods with an explicit data model, anddiscriminative the ones bypassing such a modeling .

Example: Can we predict a Brazilian basketball player’s gender G from his/herheight H?

Females: 190 182 188 184 196 173 180 193 179 186 185 169

Males: 192 190 183 199 200 190 195 184 190 203 205 201

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 9 / 77

In the generative approach, we model µH|G=g (h)

140 160 180 200 220 240

MalesFemales

and use Bayes’s law P(G = g | H = h) =µH|G=g (h)P(G=g)

µH (h)

0

0.2

0.4

0.6

0.8

1

140 160 180 200 220 240

189.23

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 10 / 77

Page 6: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

In the discriminative approach we directly pick the threshold that works thebest on the data:

0

2

4

6

8

10

12

140 160 180 200 220 240

187.5

Note that it is harder to design a confidence indicator.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 11 / 77

Risk, empirical risk

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 12 / 77

Page 7: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Learning consists of finding in a set F of functionals a “good” f ∗ (or itsparameters’ values) usually defined through a loss

l : F ×Z → R

such that l(f , z) increases with how wrong f is on z. For instance

• for classification:l(f , (x , y)) = 1{f (x) 6=y},

• for regression:l(f , (x , y)) = (f (x)− y)2,

• for density estimation:

l(q, z) = − log q(z).

The loss may include additional terms related to f itself.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 13 / 77

We are looking for an f with a small expected risk

R(f ) = EZ (l(f ,Z)) ,

which means that our learning procedure would ideally choose

f ∗ = argminf∈F

R(f ).

Although this quantity is unknown, if we have i.i.d. training samples

D = {Z1, . . . ,ZN} ,

we can compute an estimate, the empirical risk:

R(f ;D) = ED(l(f ,Z)) =1

N

N∑n=1

l(f ,Zn).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 14 / 77

Page 8: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

We have

EZ1,...,ZN

(R(f ;D)

)= EZ1,...,ZN

(1

N

N∑n=1

l(f ,Zn)

)

=1

N

N∑n=1

EZn (l(f ,Zn))

=1

N

N∑n=1

EZ (l(f ,Z))

= EZ (l(f ,Z))

= R(f ).

The empirical risk is an unbiased estimator of the expected risk.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 15 / 77

Finally, given D, F, and l, “learning” aims at computing

f ∗ = argminf∈F

R(f ;D).

• Can we bound R(f ) with R(f ;D)?

Yes if f is not chosen using D. Since the Zn are independent, we just needto take into account the variance of R(f ;D).

• Can we bound R(f ∗) with R(f ∗;D)?

B Unfortunately not simply, and not without additional constraints on F.

For instance if |F| = 1, we can!

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 16 / 77

Page 9: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Note that in practice, we call “loss” both the functional

l : F ×Z → R

and the empirical risk minimized during training

L(f ) =1

N

N∑n=1

l(f , zn).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 17 / 77

Over and under-fitting, intuition

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 18 / 77

Page 10: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

You want to hire someone, and you evaluate candidates by asking them tentechnical yes/no questions.

Would you feel confident if you interviewed one candidate and he makes aperfect score?

What about interviewing ten candidates and picking the best?

What about one thousand?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 19 / 77

WithQn

k ∼ B(0.5), n = 1, . . . , 1000, k = 1, . . . , 10,

independent, we have

∀n, P(∀k,Qnk = 1) =

1

1024

andP(∃n, ∀k,Qn

k = 1) ' 0.62.

There is 62% chance that among 1, 000 candidates answering completely atrandom, one will score perfectly.

It is quite intuitive that selecting a candidate based on a statisticalestimator biases the said estimator for that candidate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 20 / 77

Page 11: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Over and under-fitting, capacity. K -nearest-neighbors

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 21 / 77

A simple classification procedure is the “K -nearest neighbors.”

Given(xn, yn) ∈ RD × {1, . . . ,C}, n = 1, . . . ,N

to predict the y associated to a new x , take the yn of the closest xn:

n∗(x) = argminn‖xn − x‖

f ∗(x) = yn∗(x).

This recipe corresponds to K = 1, and makes the empirical training error zero.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 22 / 77

Page 12: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

K = 1

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 23 / 77

Under mild assumptions of regularities of µX ,Y , for N →∞ the asymptoticerror rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.

It can be made more stable by looking at the K > 1 closest training points, andtaking the majority vote.

If we let also K →∞ “not too fast”, the error rate is the (optimal!) Bayes’Error rate.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 24 / 77

Page 13: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Training set

Prediction (K=1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 25 / 77

Training set

Prediction (K=1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 26 / 77

Page 14: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Training set

Votes (K=51) Prediction (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 27 / 77

Training set

Votes (K=51) Prediction (K=51)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 28 / 77

Page 15: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

0

0.05

0.1

0.15

0.2

0.25

0.3

1 10 100 1000

Underfitting Overfitting

Err

or

K

TrainTest

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 29 / 77

Over and under-fitting, capacity, polynomials

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 30 / 77

Page 16: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Consider a polynomial model

∀x , α0, . . . , αD ∈ R, f (x ;α) =D∑

d=0

αdxd .

and training points (xn, yn) ∈ R2, n = 1, . . . ,N, minimize the quadratic loss

L(α) =∑n

(f (xn;α)− yn)2

=∑n

(D∑

d=0

αdxdn − yn

)2

=

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

.

This is a standard quadratic problem, for which we have efficient algorithms.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 31 / 77

argminα

∥∥∥∥∥∥∥ x0

1 . . . xD1...

...x0N . . . xDN

α0

...αD

− y1

...yN

∥∥∥∥∥∥∥

2

def fit_polynomial(D, x, y):

N = x.size (0)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

Y = y.view(N, 1)

# LAPACK ’s GEneralized Least -Square

alpha , _ = torch.gels(Y, X)

# unclear why we need narrow here (torch 0.3.1)

return alpha.narrow(0, 0, D + 1)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 32 / 77

Page 17: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

D, N = 4, 100

x = torch.linspace(-math.pi , math.pi, N)

y = x.sin()

alpha = fit_polynomial(D, x, y)

X = Tensor(N, D + 1)

for d in range(D + 1):

X[:,d] = x.pow(d)

z = X.mm(alpha).view(-1)

for k in range(N): print(x[k], y[k], z[k])-1

-0.5

0

0.5

1

-3 -2 -1 0 1 2 3

f*sin

We can use that model on a synthetic example with noisy training points, toillustrate how the prediction changes when we increase the degree or theregularization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 33 / 77

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Data

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

Dataf*

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

Dataf*

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 34 / 77

Page 18: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 35 / 77

We can reformulate this control of the degree with a penalty

L(α) =∑n

(f (xn;α)− yn)2 +∑d

ld (αd )

where

ld (α) =

{0 if d ≤ D or α = 0

+∞ otherwise.

Such a penalty kills any term of degree > D.

This motivates the use of more subtle variants. For instance, to keep all thisquadratic

L(α) =∑n

(f (xn;α)− yn)2 + ρ∑d

α2d .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 36 / 77

Page 19: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e1

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e0

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-1

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-2

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-3

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-4

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-5

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-6

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-7

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-8

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-9

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-10

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-11

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-12

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-13

f*f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=0.0

f*f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 37 / 77

We define the capacity of a set of predictors as its ability to model an arbitraryfunctional. This is a vague definition, difficult to make formal.

A mathematically precise notion is the Vapnik–Chervonenkis dimension of a setof functions, which, in the Binary classification case, is the cardinality of thelargest set that can be labeled arbitrarily (Vapnik, 1995).

It is a very powerful concept, but is poorly adapted to neural networks. We willnot say more about it in this course.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 38 / 77

Page 20: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Although the capacity is hard to define precisely, it is quite clear in practice howto modulate it for a given class of models.

In particular one can control over-fitting either by

• Impoverishing the space F (less functionals, early stopping)

• Make the choice of f ∗ less dependent on data (penalty on coefficients,margin maximization, ensemble methods)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 39 / 77

Bias-variance dilemma

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 40 / 77

Page 21: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=0

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=1

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=2

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=3

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=4

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=5

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=6

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=7

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=8

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

Degree D=9

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 41 / 77

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e1

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e0

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-1

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-2

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-3

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-4

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-5

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-6

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-7

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-8

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-9

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-10

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-11

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-12

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=1e-13

E(f*)f

-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1

D=9, ρ=0.0

E(f*)f

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 42 / 77

Page 22: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Let x be fixed, and y = f (x) the “true” value associated to it.

Let Y = f ∗(x) be the value we predict.

If we consider that the training set D is a random quantity, so is f ∗, andconsequently Y .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 43 / 77

We have

ED((Y − y)2) = ED(Y 2 − 2Yy + y2)

= ED(Y 2)− 2ED(Y )y + y2

= ED(Y 2)− ED(Y )2︸ ︷︷ ︸VD(Y )

+ED(Y )2 − 2ED(Y )y + y2︸ ︷︷ ︸(ED(Y )−y)2

= (ED(Y )− y)2︸ ︷︷ ︸Bias

+VD(Y )︸ ︷︷ ︸Variance

This is the bias-variance decomposition:

• the bias term quantifies how much the model fits the data on average,

• the variance term quantifies how much the model changes across data-sets.

(Geman and Bienenstock, 1992)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 44 / 77

Page 23: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

From this comes the bias variance tradeoff:

10-3

10-2

10-7 10-6 10-5 10-4 10-3 10-2 10-1 100

Var

ianc

e

Bias

Reducing the capacity makes f ∗ fit the data less on average, which increasesthe bias term. Increasing the capacity makes f ∗ vary a lot with the trainingdata, which increases the variance term.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 45 / 77

Is all this probabilistic?

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 46 / 77

Page 24: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Conceptually model-fitting and regularization can be interpreted as Bayesianinference.

This approach consists of modeling the parameters A of the modelthemselves as random quantities with a prior µA.

By looking at the data D, we can estimate a posterior distribution for the saidparameters,

µA(α | D = d) ∝ µD(d | A = α)µA(α),

and from that their most likely values.

So instead of a penalty term, we define a prior distribution, which is usuallymore intellectually satisfying.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 47 / 77

For instance, consider the model

∀n, Yn =D∑

d=0

Ad X dn + ∆n,

where∀d , Ad ∼ N(0, ξ), ∀n, Xn ∼ µX , ∆n ∼ N (0, σ)

all independent.

For clarity, let A = (A0, . . . ,AD) and α = (α0, . . . , αD).

Remember that D = {(X1,Y1), . . . , (XN ,YN)} is the (random) training set andd = {(x1, y1), . . . , (xN , yN)} is a realization.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 48 / 77

Page 25: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

log µA(α | D = d)

= logµD(d | A = α)µA(α)

µD(d)

= log µD(d | A = α) + log µA(α)− log Z

= log∏n

µ(xn, yn | A = α) + log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) µ(xn | A = α)︸ ︷︷ ︸=µ(xn)

+ log µA(α)− log Z

= log∏n

µ(yn | Xn = xn,A = α) + log µA(α)− log Z ′

= −1

2σ2

∑n

(yn −

∑d

αdxdn

)2

︸ ︷︷ ︸Gaussian noise on Y s

−1

2ξ2

∑d

α2d︸ ︷︷ ︸

Gaussian prior on As

− log Z ′′.

Taking ρ = σ2/ξ2 gives the penalty term of the previous slides.

Regularization seen through that prism is intuitive: The stronger the prior, themore evidence you need to deviate from it.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 49 / 77

Proper evaluation protocols

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 50 / 77

Page 26: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Learning algorithms, in particular deep-learning ones, require the tuning of manymeta-parameters.

These parameters have a strong impact on the performance, resulting in a“meta” over-fitting through experiments.

We must be extra careful with performance estimation.

Running 100 times the same experiment on MNIST, with randomized weights,we get:

Worst Median Best1.3% 1.0% 0.82%

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 51 / 77

The ideal development cycle is

Write code Train Test Paper

or in practice something like

Write code Train Test Paper

There may be over-fitting, but it does not bias the final performance evaluation.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 52 / 77

Page 27: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Unfortunately, it often looks like

Write code Train Test Paper

B This should be avoided at all costs. The standard strategy is to have aseparate validation set for the tuning.

Write code Train Validation Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 53 / 77

When data is scarce, one can use cross-validation: average through multiplerandom splits of the data in a train and a validation sets.

There is no unbiased estimator of the variance of cross-validation valid under alldistributions (Bengio and Grandvalet, 2004).

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 54 / 77

Page 28: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Some data-sets (MNIST!) have been used by thousands of researchers, overmillions of experiments, in hundreds of papers.

The global overall process looks more like

Write code Train Test Paper

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 55 / 77

“Cheating” in machine learning, from bad to “are you kidding?”:

• “Early evaluation stopping”,

• meta-parameter (over-)tuning,

• data-set selection,

• algorithm data-set specific clauses,

• seed selection.

Top-tier conference are demanding regarding experiments, and are biasedagainst “complicated” pipelines.

The community pushes toward accessible implementations, reference data-sets,leader boards, and constant upgrades of benchmarks.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 56 / 77

Page 29: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Standard clustering and embedding

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 57 / 77

Deep learning models combine embeddings and dimension reduction operations.

They parametrize and re-parametrize multiple times the input signal under moreinvariant and interpretable forms.

To get an intuition of how this is possible, we consider here two standardalgorithms:

• K -means, and

• Principal Component Analysis (PCA).

We will illustrate these methods on our two favorite data-sets.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 58 / 77

Page 30: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

MNIST data-set

28× 28 grayscale images, 60k train samples, 10k test samples.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 59 / 77

CIFAR10 data-set

32× 32 color images, 50k train samples, 10k test samples.

(Krizhevsky, 2009, chap. 3)

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 60 / 77

Page 31: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Givenxn ∈ RD , n = 1, . . . ,N,

and a fixed number of clusters K > 0, K -means tries to find K “centroids” thatspan uniformly the training population.

Given a point, the index of its closest centroid is a good coding.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 61 / 77

Formally, [Lloyd’s algorithm for] K -means (approximately) solves

argminc1,...,cK∈RD

∑n

mink‖xn − ck‖2.

This is achieved with a random initialization of c01 , . . . c

0K followed by repeating

until convergence:

∀n, ktn = argmin

k‖xn − ctk‖ (1)

∀k, ct+1k =

1

|n : ktn = k|

∑n:ktn=k

xn (2)

At every iteration, (1) each sample is associated to its closest centroid’s cluster,and (2) each centroid is updated to the average of its cluster.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 62 / 77

Page 32: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 63 / 77

We can apply that algorithm to images from MNIST (28× 28× 1) or CIFAR(32× 32× 3) by considering them as vectors from R784 and R3072 respectively.

Centroids can similarly be visualized as images.

Clustering can be done per-class, or for all the class mixed.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 64 / 77

Page 33: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

K = 1 K = 2 K = 4 K = 8 K = 16

K = 32

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 65 / 77

K = 1 K = 2 K = 4 K = 8 K = 16

K = 32

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 66 / 77

Page 34: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

The Principal Component Analysis (PCA) aims also at extracting aninformation in a L2 sense. Instead of clusters, it looks for an “affine subspace”,i.e. a point and a basis, that spans the data.

Given data-pointsxn ∈ RD , n = 1, . . . ,N

(A) compute the average and center the data

x =1

N

∑n

xn

∀n, x(0)n = xn − x

and then for t = 1, . . . ,D,

(B) pick the direction and project the data

vt = argmax‖v‖=1

∑n

(v · x(t−1)

n

)2

∀n, x(t)n = x

(t−1)n −

(vt · x(t−1)

n

)vt .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 67 / 77

Although this is a simple way to envision PCA, standard implementations relyon an eigendecomposition. With

X =

— x1 —...

— xN —

we have

∑n

(v · xn)2 =

∥∥∥∥∥∥∥ v · x1

...v · xN

∥∥∥∥∥∥∥

2

2

=∥∥∥vXT

∥∥∥2

2

= (vXT )(vXT )T

= v(XTX )vT .

From this we can derive that v1, v2, . . . , vD are the eigenvectors of XTX rankedaccording to [the absolute values of] their eigenvalues.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 68 / 77

Page 35: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Data

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataMean

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataPCA basis

Mean

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

DataPCA basis

Mean

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 69 / 77

As for K -means, we can apply that algorithm to images from MNIST or CIFARby considering them as vectors.

For any sample x and any T , we can compute a reconstruction using T vectorsfrom the PCA basis, i.e.

x +T∑t=1

(vt · x)vt .

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 70 / 77

Page 36: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 71 / 77

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 72 / 77

Page 37: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 73 / 77

x v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 74 / 77

Page 38: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

These results show that even crude embeddings capture something meaningful.Changes in pixel intensity as expected, but also deformations in the “indexing”space (i.e. the image plan).

However, translations and deformations damage the representation badly.“Composition” (e.g. object on background) is not handled at all.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 75 / 77

These strengths and shortcomings provide an intuitive motivation for “deepneural networks”, and the rest of this course.

We would like

• to use many encoding “of these sorts” for small local structures withlimited variability,

• have different “channels” for different components,

• process at multiple scales.

Computationally, we would like to deal with large signals and large training sets,so we need to avoid super-linear cost in one or the other.

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 76 / 77

Page 39: 2. Learning, capacity, basic embeddings EE-559 { Deep learning · Fran˘cois Fleuret EE-559 { Deep learning / 2. Learning, capacity, basic embeddings 17 / 77 Over and under- tting,

Practical session:

https://fleuret.org/dlc/dlc-practical-2.pdf

Francois Fleuret EE-559 – Deep learning / 2. Learning, capacity, basic embeddings 77 / 77

References

Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-foldcross-validation. Journal of Machine Learning Research (JMLR), 5:1089–1105, 2004.

S. Geman and E. Bienenstock. Neural networks and the bias/variance dilemma. NeuralComputation, 4:1–58, 1992.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis,Department of Computer Science, University of Toronto, 2009.

V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.