Statistical Machine Learning, Part I Statistical Learning...
Transcript of Statistical Machine Learning, Part I Statistical Learning...
![Page 2: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/2.jpg)
Previous Lecture : Classification
• Classification: mapping objects onto S where |S| < ∞.
• Binary classification: answers to yes/no questions
• Linear classification algorithms: split the yes/no zones with a hyperplane
Yes = {cTx+ b ≥ 0} , No = {cTx+ b < 0}
• How to select c, b given a dataset?
◦ Linear Discriminant Analysis (multivariate Gaussians)◦ Logistic Regression (classification from a linear regression viewpoint)◦ Perceptron rule (iterative, random update rule)◦ brief introduction to Support Vector Machine (optimal margin classifier)
SML-2016 2
![Page 3: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/3.jpg)
Today
• Usual steps when using ML algorithms
◦ Define problem (classification? regression? multi-class?)◦ Gather data◦ Choose representation for data to build a database◦ Choose method/algorithm based on training set◦ Choose/estimate parameters◦ Run algorithm on new points, collect results
SML-2016 3
![Page 4: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/4.jpg)
Today
• Usual steps when using ML algorithms
◦ Define problem (classification? regression? multi-class?)◦ Gather data◦ Choose representation for data to build a database◦ Choose method/algorithm◦ Choose/estimate parameters based on training set◦ Run algorithm on new points, collect results
◦ ... did I overfit?
SML-2016 4
![Page 5: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/5.jpg)
Probabilistic Framework
SML-2016 5
![Page 6: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/6.jpg)
General Framework
• Couples of observations, (x, y) appear in nature.
• These observations arex ∈ R
d, y ∈ S
• S ⊂ R, that is S could be R, R+, {1, 2, 3, . . . , L}, {0, 1}
• Sometimes only x is visible. We want to guess the most likely y for that x.
• Example 1 x: Height ∈ R , y: Gender ∈ {M,F}
X is 164cm tall, is X a male or a female?
• Example 2 x: Height ∈ R , y: Weight ∈ R.
X is 164cm tall, how many kilos does X weight?
SML-2016 6
![Page 7: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/7.jpg)
Estimating the relationship between x and y
• To provide a guess ⇔ estimate a function f : Rd → S such that
f(x) ≈ y.
SML-2016 7
![Page 8: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/8.jpg)
Estimating the relationship between x and y
• To provide a guess ⇔ estimate a function f : Rd → S such that
f(x) ≈ y.
• Ideally, f(x) ≈ y should apply both to
◦ couples (x, y) we have observed in the training set◦ couples (x, y) we will observe... (guess y from x)
SML-2016 8
![Page 9: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/9.jpg)
Probabilistic Framework
• We assume that each observation (x, y) arises as an
◦ independent,◦ identically distributed,
random sample from the same probability law.
SML-2016 9
![Page 10: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/10.jpg)
Probabilistic Framework
• We assume that each observation (x, y) arises as an
◦ independent,◦ identically distributed,
random sample from the same probability law.
• This probability P on Rd × S has a density,
p(X = x, Y = y).
SML-2016 10
![Page 11: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/11.jpg)
Probabilistic Framework
• We assume that each observation (x, y) arises as an
◦ independent,◦ identically distributed,
random sample from the same probability law.
• This probability P on Rd × S has a density,
p(X = x, Y = y).
• This also provides us with the marginal probabilities for x and y:
p(Y = y) =
∫
Rd
p(X = x, Y = y)dx
p(X = x) =
∫
S
p(X = x, Y = y)dy
SML-2016 11
![Page 12: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/12.jpg)
Probabilistic Framework
• Assuming that p exists is fundamental in statistical learning theory.
p(X = x, Y = y).
• What happens to learning problems if we know p?..(in practice, this will never happen, we never know p).
• If we know p, learning problems become trivial.
(≈ running a marathon on a motorbike)
SML-2016 12
![Page 13: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/13.jpg)
Example 1: S = {M,F}, Height vs Gender
160180
200
F
M
0.005
0.01
0.015
0.02
0.025
p(Height,Gender)
p(X,Y)
SML-2016 13
![Page 14: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/14.jpg)
Example 2: S = R+, Height vs Weight
0100
200
050
100150
5
10
15
x 10−5
Height
p(Height,Weight)
Weight
Den
sity
SML-2016 14
![Page 15: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/15.jpg)
Probabilistic Framework
Conditional probability (or density)
p(A,B) = p(A|B)p(B)
• Suppose:p(X = 184cm, y = M) = 0.015
p(y = M) = 0.5
What is p(X = 184cm | y = M)?
◦ 1. 0.15◦ 2. 0.03◦ 3. 0.5◦ 4. 0.0075◦ 5. 0.2
SML-2016 15
![Page 16: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/16.jpg)
Probabilistic Framework
Bayes Rule
p(A|B) =p(B|A)p(A)
p(B)
• Suppose:p(X = 184cm | y = M) = 0.03
p(y = M) = 0.5
p(X = 184) = 0.02
What is p(y = M |X = 184)?
◦ 1. 0.6◦ 2. 0.04◦ 3. 0.75◦ 4. 0.8◦ 5. 0.2
SML-2016 16
![Page 17: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/17.jpg)
Loss, Risk and Bayes Decision
SML-2016 17
![Page 18: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/18.jpg)
Building Blocks: Loss (1)
• A loss is a function S × R → R+ designed to quantify mistakes,
how good is the prediction f(x) given that the true answer is y?m
How small is l(y, f(x))?
Examples
• S = {0, 1}
◦ 0/1 loss: l(a, b) = δa 6=b =
{
1 if a 6= b
0 if a = b
• S = R
◦ Squared euclidian distance l(a, b) = (a− b)2
◦ norm l(a, b) = ‖a− b‖q, 0 ≤ q ≤ ∞
SML-2016 18
![Page 19: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/19.jpg)
Building Blocks: Risk (2)
• The Risk of a predictor f with respect to loss l is
R(f) = Ep[l(Y,f(X))] =
∫
Rd×S
l(y,f(x))p(x, y)dxdy
• Risk = average loss of f on all possible couples (x, y),
weighted by the probability density.
Risk(f) measures the performance of f w.r.t. l and p.
• Remark: a function f with low risk can make very big mistakes for some xas long as the probability p(x) of x is small.
SML-2016 19
![Page 20: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/20.jpg)
A lower bound on the Risk? Bayes Risk
• Since l ≥ 0, R(f) ≥ 0.
• Consider all possible functions Rd → S, usually written (Rd)S.
• The Bayes risk is the quantity
R∗ = inff∈(Rd)S
R(f) = inff∈(Rd)S
Ep[l(Y,f(X))]
• Ideal classifier would have Bayes risk.
SML-2016 20
![Page 21: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/21.jpg)
Bayes Classifier : S = {0, 1}, l is the 0/1 loss.
Let’s write: η(x) = p(Y = 1|X = x).
• Define the following rule:
gB(x) =
{
1, if η(x) ≥ 12,
0 otherwise.
where
The Bayes classifier achieves the Bayes Risk.
Theorem 1. R(gB) = R∗.
SML-2016 21
![Page 22: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/22.jpg)
Bayes Classifier : S = {0, 1}, l is the 0/1 loss.
• Chain rule of conditional probability p(A,B) = p(B)p(A|B)
• Bayes rule
p(A|B) =p(B|A)p(A)
p(B)
• A simple way to compute η:
η(x) = p(Y = 1|X = x) =p(Y = 1,X = x)
p(X = x)
=p(X = x|Y = 1)p(Y = 1)
p(X = x)
=p(X = x|Y = 1)p(Y = 1)
p(X = x|Y = 1)p(Y = 1) + p(X = x|Y = 0)p(Y = 0).
SML-2016 22
![Page 23: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/23.jpg)
Bayes Classifier : S = {0, 1}, l is the 0/1 loss.
150 160 170 180 190 200
0.01
0.02
0.03
0.04
0.05
Male: p(X|y=1)Female: p(X|y=0)
in addition, p(Y = 1) = 0.4871. As a consequencep(Y = 0) = 1− 0.4871 = 0.5129
SML-2016 23
![Page 24: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/24.jpg)
Bayes Classifier : S = {0, 1}, l is the 0/1 loss.
140 150 160 170 180 190 200 210
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1/2
η(x)
SML-2016 24
![Page 25: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/25.jpg)
Bayes Estimator : S = R, l is the 2-norm
• Consider the following rule:
gB(x) = E[Y |X = x] =
∫
R
y p(Y = y|X = x)dy
Here again, the Bayes estimator achieves the Bayes Risk.
Theorem 2. R(gB) = R∗.
SML-2016 25
![Page 26: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/26.jpg)
Bayes Estimator : S = R, l is the 2-norm
• Using Bayes rule again,
f⋆(x) = E[Y |X = x] =
∫
R
y p(Y = y|X = x)dy
=
∫
R
yp(X = x|Y = y)p(Y = y)
p(X = x)dy
=
∫
R
yp(X = x|Y = y)p(Y = y)
∫
Rp(X = x|Y = u)p(Y = u)du
dy
=
∫
Ry p(X = x|Y = y)p(Y = y)dy∫
Rp(X = x|Y = y)p(Y = y)dy
SML-2016 26
![Page 27: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/27.jpg)
In practice: No p, Only Finite Samples
SML-2016 27
![Page 28: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/28.jpg)
What can we do?
• If we know the probability p, Bayes estimator would be impossible to beat.
• In practice, the only thing we can use is a training set,
{(xi, yi)}i=1,··· ,n.
• For instance, a list of Heights, gender
163 .0000 F170.0000 F175.3000 M184.0000 M175.0000 M172.5000 F153.5000 F164.0000 M163.0000 M
SML-2016 28
![Page 29: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/29.jpg)
Approximating Risk
• For any function f , we cannot compute its true risk R(f),
R(f) = Ep[l(Y,f(X))]
because we do not know p
• Instead, we can consider the empirical Risk Rempn , defined as
Rempn (f) =
1
n
n∑
i=1
l(yi,f(xi))
• The law of large numbers tells us that for any given f
Rempn (f) → R(f).
SML-2016 29
![Page 30: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/30.jpg)
Relying on the empirical risk
As sample size n grows, the empirical risk behaves like the real risk
• It may thus seem like a good idea to minimize directly the empirical risk.
• The intuition is that
◦ since a function f such that R(f) is low is desirable,◦ since Remp
n (f) converges to R(f) as n → ∞,
why not look directly for any function f such that Rempn (f) is low?
• Typically, in the context of classification with 0/1 loss, find a function such that
Rempn (f) =
1
n
n∑
i=1
δyi 6=f(xi)
...is low.
SML-2016 30
![Page 31: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/31.jpg)
A flawed intuition
• However, focusing only on Rempn is not viable.
• Many ways this can go wrong...
SML-2016 31
![Page 32: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/32.jpg)
A flawed intuition
• Consider the function defined as
h(x) =
y1, if x = x1,
y2, if x = x2,...
yn, if x = xn,
0 otherwise..
• Since, Rempn (h) = 1
n
∑n
i=1 δyi 6=h(xi) =1n
∑n
i=1 δyi 6=yi = 0, h minimizes Rempn .
• However, h always answers 0, except for a few points.
• In practice, we can expect R(h) to be much higher, equal to P (Y = 1) in fact.
SML-2016 32
![Page 33: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/33.jpg)
Here is what this function would predict on the
Height/Gender Problem
150 160 170 180 190 200F
M
Overfitting is probably the most frequent mistake made by ML practitioners.
SML-2016 33
![Page 34: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/34.jpg)
Ideas to Avoid Overfitting
• Our criterion Rempn (g) only considers a finite set of points.
• A function g defined on Rd is defined on an infinite set of points.
A few approaches to control overfitting
• Restrict the set of candidates
ming∈G
Rempn (g).
• Penalize “undesirable” functions
ming∈G
Rempn (g) + λ‖g‖2
Are there theoretical tools which justify such approaches?
SML-2016 34
![Page 35: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/35.jpg)
Bounds
SML-2016 35
![Page 36: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/36.jpg)
Flow of a learning process in Machine Learning
• Assumption 1. existence of a probability density p for (X,Y ).
• Assumption 2. points are observed i.i.d. following this probability density.
SML-2016 36
![Page 37: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/37.jpg)
Flow of a learning process in Machine Learning
• Assumption 1. existence of a probability density p for (X,Y ).
• Assumption 2. points are observed i.i.d. following this probability density.
Roadmap
• Get a random training sample {(xj, yj)}i=1,··· ,n (training set)
• Choose a class of functions G (method or model)
• Choose gn in G such that Rempn (gn) is low (estimation algorithm)
SML-2016 37
![Page 38: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/38.jpg)
Flow of a learning process in Machine Learning
• Assumption 1. existence of a probability density p for (X,Y ).
• Assumption 2. points are observed i.i.d. following this probability density.
Roadmap
• Get a random training sample {(xj, yj)}i=1,··· ,n (training set)
• Choose a class of functions G (method or model)
• Choose gn in G such that Rempn (gn) is low (estimation algorithm)
Next... use gn in practice
SML-2016 38
![Page 39: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/39.jpg)
Flow of a learning process in Machine Learning
Yet, you may want to have a partial answer to these questions
• How good would be gB if we knew the real probability p?
• what about R(gn)?
• What’s the gap between them, R(gn)−R(gB)?
• Is the estimation algorithm reliable? how big is Remp(gn)− infg∈G Rempn (g)?
• how big is Rempn (gn)− infg∈G R(g)?
SML-2016 39
![Page 40: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/40.jpg)
Excess Risk
• In the general case gB /∈ G.
• Hence, by introducing g⋆ as a function achieving the lowest risk in G,
R(g⋆) = infg∈G
R(g),
we decompose
R(gn)−R(gB) = [R(gn)−R(g⋆)] + [R(g⋆)−R(gB)]
SML-2016 40
![Page 41: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/41.jpg)
Excess Risk
• In the general case gB /∈ G.
• Hence, by introducing g⋆ as a function achieving the lowest risk in G,
R(g⋆) = infg∈F
R(g),
we decompose
R(gn)−R(gB) = [R(gn) − R(g⋆)]︸ ︷︷ ︸
Estimation Error
+ [R(g⋆) − R(gB)]︸ ︷︷ ︸
Approximation Error
• Estimation error is random, Approximation error is fixed.
• In the following we focus on the estimation error.
SML-2016 41
![Page 42: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/42.jpg)
Types of Bounds
Error Bounds
R(gn) ≤ Rempn (gn) + C(n,G).
SML-2016 42
![Page 43: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/43.jpg)
Types of Bounds
Error Bounds
R(gn) ≤ Rempn (gn) + C(n,G).
Error Bounds Relative to Best in Class
R(gn) ≤ R(g⋆) + C(n,G).
SML-2016 43
![Page 44: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/44.jpg)
Types of Bounds
Error Bounds
R(gn) ≤ Rempn (gn) + C(n,G).
Error Bounds Relative to Best in Class
R(gn) ≤ R(g⋆) + C(n,G).
Error Bounds Relative to the Bayes Risk
R(gn) ≤ R(gB) + C(n,G).
SML-2016 44
![Page 45: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/45.jpg)
Error Bounds / Generalization Bounds
R(gn)−Rempn (gn)
SML-2016 45
![Page 46: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/46.jpg)
What is Overfitting?
• Overfitting is the idea that,
◦ given n training points sampled randomly,◦ given a function gn estimated from these points,◦ we may have...
R(gn) ≫ Rempn (gn).
SML-2016 46
![Page 47: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/47.jpg)
What is Overfitting?
• Overfitting is the idea that,
◦ given n training points sampled randomly,◦ given a function gn estimated from these points,◦ we may have...
R(gn) ≫ Rempn (gn).
• Question of interest:
P [R(gn)−Rempn (gn) > ε] =?
• From now on, we consider the classification case, namely G : Rd → {0, 1}.
SML-2016 47
![Page 48: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/48.jpg)
Alleviating Notations
• More convenient to see a couple (x, y) as a realization of Z, namely
zi = (xi, yi), Z = (X,Y ).
• We define the loss class
F = {f : z = (x, y) → δg(x) 6=y, g ∈ G},
• with the additional notations
Pf = E[f(X,Y )], Pnf =1
n
n∑
i=1
f(xi, yi),
where we recoverPnf = Remp
n (g), Pf = R(g)]
SML-2016 48
![Page 49: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/49.jpg)
Empirical Processes
For each f ∈ F , Pnf is a random variable which depends on n realizations of Z.
• If we consider all possible functions f ∈ F , we obtain
The set of random variables {Pnf}f∈F is called anEmpirical measure indexed by F .
• A branch of mathematics studies explicitly the convergence of {Pf −Pnf}f∈F ,
This branch is known as Empirical process theory .
SML-2016 49
![Page 50: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/50.jpg)
Hoeffding’s Inequality
• Recall that for a given g and corresponding f ,
R(g)−Remp(g) = Pf − Pnf = E[f(Z)]−1
n
n∑
i=1
f(zi),
which is simply the difference between the expectation and the empiricalaverage of f(Z).
• The strong law of large numbers says that
P
(
limn→∞
E[f(Z)]−1
n
n∑
i=1
f(zi) = 0
)
= 1.
SML-2016 50
![Page 51: Statistical Machine Learning, Part I Statistical Learning Theorymarcocuturi.net/Teaching/KU/2016/FIS/Lec4.pdf · 2016-10-04 · Statistical Machine Learning, Part I Statistical Learning](https://reader030.fdocuments.in/reader030/viewer/2022040307/5eccf61ba6703f70481ada7e/html5/thumbnails/51.jpg)
Hoeffding’s Inequality
• A more detailed result is
Theorem 3 (Hoeffding). Let Z1, · · · , Zn be n i.i.d random variables with
f(Z) ∈ [a, b]. Then, ∀ε,
P [|Pnf − Pf | > ε] ≤ 2e− 2nε2
(b−a)2 .
SML-2016 51