Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y...
Transcript of Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y...
![Page 1: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/1.jpg)
Bias and Variancein
Machine Learning
Pierre GeurtsUniversité de Liège
Octobre 2002
![Page 2: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/2.jpg)
2
Content of the presentation
• Bias and variance definitions• Parameters that influence bias and variance• Variance reduction techniques• Decision tree induction
![Page 3: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/3.jpg)
3
Content of the presentation
• Bias and variance definitions:– A simple regression problem with no input– Generalization to full regression problems– A short discussion about classification
• Parameters that influence bias and variance• Variance reduction techniques• Decision tree induction
![Page 4: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/4.jpg)
4
Regression problem - no input
• Goal: predict as well as possible the height of a Belgian male adult
• More precisely:– Choose an error measure, for example the square error.– Find an estimation y such that the expectation:
over the whole population of Belgian male adult is minimized.
180
![Page 5: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/5.jpg)
5
Regression problem - no input
• The estimation that minimizes the error can be computed by taking:
• So, the estimation which minimizes the error is Ey{y}. In AL, it is called the Bayes model.
• But in practice, we cannot compute the exact value ofEy{y} (this would imply to measure the height of everyBelgian male adults).
![Page 6: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/6.jpg)
6
Learning algorithm
• As p(y) is unknown, find an estimation y from a sample of individuals, LS={y1,y2,…,yN}, drawn from the Belgian male adult population.
• Example of learning algorithms:
– ,
–
(if we know that the height is close to 180)
![Page 7: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/7.jpg)
7
Good learning algorithm
• As LS are randomly drawn, the prediction y will also be a random variable
• A good learning algorithm should not be good only on onelearning sample but in average over all learning samples(of size N) ⇒ we want to minimize:
• Let us analyse this error in more detail
y
pLS (y)
![Page 8: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/8.jpg)
8
Bias/variance decomposition (1)
![Page 9: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/9.jpg)
9
Bias/variance decomposition (2)
E= Ey{(y- Ey{y})2} + ELS{(Ey{y}-y)2}
y
vary{y}
Ey{y}
= residual error = minimal atteinable error= vary{y}
![Page 10: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/10.jpg)
10
Bias/variance decomposition (3)
![Page 11: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/11.jpg)
11
Bias/variance decomposition (4)
E= vary{y} + (Ey{y}-ELS{y})2 + …ELS{y} = average model (over all LS)bias2 = error between bayes and average model
yEy{y} ELS{y}
bias2
![Page 12: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/12.jpg)
12
Bias/variance decomposition (5)
E= vary{y} + bias2 + ELS{(y-ELS{y})2}varLS{y} = estimation variance = consequence of over-fitting
y
varLS{y}
ELS{y}
![Page 13: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/13.jpg)
13
Bias/variance decomposition (6)
E= vary{y} + bias2 + varLS{y}
yEy{y} ELS{y}
bias2
vary{y} varLS{y}
![Page 14: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/14.jpg)
14
Our simple example
•–
–
From statistics, y1 is the best estimate with zero bias
•
–
–
So, the first one may not be the best estimator because ofvariance (There is a bias/variance tradeoff w.r.t. λ)
![Page 15: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/15.jpg)
15
Bayesian approach (1)
• Hypotheses :– The average height is close to 180cm:
– The height of one individual is Gaussian around themean:
• What is the most probable value of after havingseen the learning sample ?
![Page 16: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/16.jpg)
16
Bayesian approach (2)
Bayes theorem andP(LS) is constant
Independence of the learning cases
![Page 17: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/17.jpg)
17
Regression problem – full (1)
• Actually, we want to find a function y(x) of several inputs => average over the whole input space:
• The error becomes:
• Over all learning sets:
![Page 18: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/18.jpg)
18
Regression problem – full (2)
ELS{Ey|x{(y-y(x))^2}}=Noise(x)+Bias2(x)+Variance(x)
• Noise(x) = Ey|x{(y-hB(x))2}Quantifies how much y varies from hB(x) = Ey|x{y}, theBayes model.
• Bias2(x) = (hB(x)-ELS{y(x)})2:Measures the error between the Bayes model and theaverage model.
• Variance(x) = ELS{(y(x)-ELS{y(x))2} :Quantify how much y(x) varies from one learning sampleto another.
![Page 19: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/19.jpg)
19
Illustration (1)
• Problem definition:– One input x, uniform random variable in [0,1]– y=h(x)+ewhere e∼N(0,1)
h(x)=Ey|x{y}
x
y
![Page 20: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/20.jpg)
20
Illustration (2)
• Small variance, high bias method
ELS{y(x)}
![Page 21: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/21.jpg)
21
Illustration (3)
• Small bias, high variance method
ELS{y(x)}
![Page 22: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/22.jpg)
22
Classification problem (1)
err(c,c)=1(c ≠ c) ⇒ PE=ELS{Ec{1(c ≠ c)}}Bayes model = cB = arg maxc P(c)Residual error = 1- P(cB)Average model = cLS = arg maxc PLS(c)bias=1(cB ≠ cLS)
P PLS
c1 c2 c3 c1 c2 c3c c
![Page 23: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/23.jpg)
23
Classification problem (2)
• Important difference : A more unstable classification may be beneficial on biased cases (such that cB ≠ cLS)
• Example: method 2 is better than method 1 although more variable
P
c1 c2 c
PLS
c1 c2 c1
PLS
c1 c2 c2
![Page 24: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/24.jpg)
24
Content of the presentation
• Bias and variance definitions• Parameters that influence bias and variance
– Complexity of the model– Complexity of the Bayes model– Noise– Learning sample size– Learning algorithm
• Variance reduction techniques• Decision tree induction
![Page 25: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/25.jpg)
25
Illustrative problem
• Artificial problem with 10 inputs, all uniform random variables in [0,1]
• The true function depends only on 5 inputs:
y(x)=10.sin(p .x1.x2)+20.(x3-0.5)2+10.x4+5.x5+e,
where e is a N(0,1) random variable
• Experimentation: – ELS ⇒ average over 50 learning sets of size 500– Ex,y ⇒ average over 2000 cases⇒ Estimate variance and bias (+ residual error)
![Page 26: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/26.jpg)
26
Complexity of the model
Usually, the bias is a decreasing function of thecomplexity, while variance is an increasingfunction of the complexity.
E=bias2+var
bias2
var
Complexity
![Page 27: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/27.jpg)
27
Complexity of the model – neural networks
• Error, bias, and variance w.r.t. the number ofneurons in the hidden layer
0
1
2
3
4
5
6
7
8
0 2 4 6 8 10 12
Nb hidden perceptrons
Err
or Error
Bias R
Var R
![Page 28: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/28.jpg)
28
Complexity of the model – regression trees
• Error, bias, and variance w.r.t. the number of test nodes
0
2
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50
Nb test nodes
Err
or Error
Bias R
Var R
![Page 29: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/29.jpg)
29
Complexity of the model – k-NN
• Error, bias, and variance w.r.t. k, the number ofneighbors
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30
k
Err
or
ErrorBias RVar R
![Page 30: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/30.jpg)
30
Learning problem
• Complexity of the Bayes model:– At fixed model complexity, bias increases with the complexity of
the Bayes model. However, the effect on variance is difficult to predict.
• Noise: – Variance increases with noise and bias is mainly unaffected. – E.g. with regression trees
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6
Noise std. dev.
Err
or
Error
Noise
Bias R
Var R
![Page 31: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/31.jpg)
31
Learning sample size (1)
• At fixed model complexity, bias remains constant andvariance decreases with the learning sample size. E.g.linear regression
0
1
2
3
4
5
6
7
8
9
10
0 500 1000 1500 2000
LS size
Err
or Error
Bias R
Var R
![Page 32: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/32.jpg)
32
Learning sample size (2)
• When the complexity of the model is dependant on thelearning sample size, both bias and variance decrease withthe learning sample size. E.g. regression trees
0
2
4
6
8
10
12
14
16
18
20
0 500 1000 1500 2000
LS size
Err
or Error
Bias R
Var R
![Page 33: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/33.jpg)
33
Learning algorithms – linear regression
• Very few parameters : small variance• Goal function is not linear : high bias
3.21.44.6MLP (10 – 10)
k-NN (k=1)
VarianceBias2+NoiseErr2Method
10.2
2.0
8.5
15.4
7.0
6.73.5Regr. Tree
0.81.2MLP (10)
1.37.2k-NN (k=10)
10.45
0.26.8Linear regr.
![Page 34: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/34.jpg)
34
Learning algorithms – k-NN
• Small k : high variance and moderate bias• High k : smaller variance but higher bias
3.21.44.6MLP (10 – 10)
k-NN (k=1)
VarianceBias2+NoiseErr2Method
10.2
2.0
8.515.4
7.0
6.73.5Regr. Tree
0.81.2MLP (10)
1.37.2k-NN (k=10)10.45
0.26.8Linear regr.
![Page 35: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/35.jpg)
35
Learning algorithms - MLP
• Small bias • Variance increases with the model complexity
3.21.44.6MLP (10 – 10)
k-NN (k=1)
VarianceBias2+NoiseErr2Method
10.2
2.08.5
15.4
7.0
6.73.5Regr. Tree
0.81.2MLP (10)1.37.2k-NN (k=10)
10.45
0.26.8Linear regr.
![Page 36: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/36.jpg)
36
Learning algorithms – regression trees
• Small bias, a (complex enough) tree can approximate any non linear function
• High variance (see later)
3.21.44.6MLP (10 – 10)
k-NN (k=1)
VarianceBias2+NoiseErr2Method
10.2
2.0
8.5
15.4
7.0
6.73.5Regr. Tree
0.81.2MLP (10)
1.37.2k-NN (k=10)
10.45
0.26.8Linear regr.
![Page 37: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/37.jpg)
37
Content of the presentation
• Bias and variance definition• Parameters that influence bias and variance• Variance reduction techniques
– Introduction– Dealing with the bias/variance tradeoff of one
algorithm– Averaging techniques
• Decision tree induction
![Page 38: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/38.jpg)
38
Variance reduction techniques
• In the context of a given method:– Adapt the learning algorithm to find the best trade-off
between bias and variance.– Not a panacea but the least we can do.– Example: pruning, weight decay.
• Averaging techniques:– Change the bias/variance trade-off.– Universal but destroys some features of the initial
method.– Example: bagging.
![Page 39: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/39.jpg)
39
Variance reduction: 1 model (1)
• General idea: reduce the ability of the learning algorithm to over-fit the LS– Pruning
• reduces the model complexity explicitly
– Early stopping• reduces the amount of search
– Regularization• reduce the size of hypothesis space
![Page 40: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/40.jpg)
40
Variance reduction: 1 model (2)
• Bias2 ≈ error on the learning set, E ≈ error on an independent test set
• Selection of the optimal level of fitting– a priori (not optimal) – by cross-validation (less efficient)
E=bias2+var
bias2
var
Fitting
Optimal fitting
![Page 41: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/41.jpg)
41
Variance reduction: 1 model (3)
• Examples:– Post-pruning of regression trees– Early stopping of MLP by cross-validation
Pr. regr. Tree (93)
VarianceBiasEMethod
3.8
4.6
9.1
10.2
2.31.5Early stopped MLP
3.21.4Full learned MLP
4.84.3
6.73.5Full regr. Tree (488)
• As expected, reduces variance and increases bias
![Page 42: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/42.jpg)
42
Variance reduction: bagging (1)
ELS{Err(x)}=Ey|x{(y-hB(x))2}+ (hB(x)-ELS{y(x)})2+ELS{(y(x)-ELS{y(x))2}
• Idea : the average model ELS{y(x)} has the same bias as the original method but zero variance
• Bagging (Bootstrap AGGregatING) :– To compute ELS{y (x)}, we should draw an infinite number of LS
(of size N)– Since we have only one single LS, we simulate sampling from
nature by bootstrap sampling from the given LS– Bootstrap sampling = sampling with replacement of N objects from
LS (N is the size of LS)
![Page 43: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/43.jpg)
43
Variance reduction: bagging (2)
LS
LS1 LS2 LSk
y1(x) y2(x) yk(x)
y(x) = 1/k.(y1(x)+y2(x)+…+yk(x))
x
![Page 44: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/44.jpg)
44
Variance reduction: bagging (3)
• Application to regression trees
Bagged
VarianceBiasEMethod
5.3
10.2
11.7
14.8
1.53.8Bagged
6.73.5Full regr. Tree
1.010.7
3.711.13 Test regr. Tree
• Strong variance reduction without increasing the bias(although the model is much more complex than a single tree)
![Page 45: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/45.jpg)
45
Variance reduction: averaging techniques
• Perturb and Combine paradigm:– Perturb the learning algorithm to obtain several models.– Combine the predictions of these models
• Examples:– Bagging: perturb learning sets.– Random trees: choose tests at random (see later).– Random initial weights for neural networks– …
![Page 46: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/46.jpg)
46
Averaging techniques: how they work ?
• Intuitively:
1 model
Several perturbedmodels (combined)
1 perturbed model
Variance due to the LS
Variance due to the perturbation
Perturbation
Combination
• The effect of the perturbation is difficult to predict
![Page 47: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/47.jpg)
47
Dual idea of bagging (1)
• Instead of perturbing learning sets to obtain several predictions, directly perturb the test case atthe prediction stage
• Given a model y(.) and a test case x:– Form k attribute vectors by adding Gaussian noise to x:
{x+e1, x+e2, …, x+ek}.– Average the predictions of the model at these points to
get the prediction at point x: 1/k.(y(x+e1)+ y(x+e2)+…+ y(x+ek)
• Noise level ? (variance of Gaussian noise) selected by cross-validation
![Page 48: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/48.jpg)
48
Dual idea of bagging (2)
• With regression trees:
0.2
VarianceBiasENoise level
13.3
5.36.3
10.2
0.213.12.0
0.94.40.52.83.5
6.73.50.0
• Smooth the function y(.).
• Too much noise increases bias. There is a (new) trade-off between bias and variance.
![Page 49: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/49.jpg)
49
Conclusion
• Variance reduction is a very important topic:– To reduce bias is easy, but to keep variance low is not as easy.– Especially in the context of new applications of machine learning
to very complex domains: temporal data, biological data, Bayesian networks learning…
• Interpretability of the model and efficiency of the method are difficult to preserve if we want to reduce variance significantly.
• Other approaches to variance reduction: Bayesian approaches, support vector machines
![Page 50: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/50.jpg)
50
Content of the presentation
• Bias and variance definitions• Parameters that influence bias and variance• Variance reduction techniques• Decision tree induction
– Induction algorithm– Study of decision tree variance– Variance reduction methods for decision trees
![Page 51: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/51.jpg)
51
Arbre de décision: famille de modèle
0
1
0 1A1
A2
A2<0.33
good A1<0.91
A1<0.23 A2<0.91
A2<0.75A2<0.49
A2<0.65good
bad good
bad
badbad
good
![Page 52: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/52.jpg)
52
Arbre de décision: méthode d’induction
?
0,33
0,91
0
0.25
0 1
0
0.45
0
1
A2<0,33
0
1
0 1A1
A2
![Page 53: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/53.jpg)
53
Arbre de décision: méthode d’induction
A2<0,33
?good
0
1
0 1A1
A2
![Page 54: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/54.jpg)
54
Arbre de décision: méthode d’induction
A2<0,33
good ?A1<0.91
0
1
0 1A1
A2
![Page 55: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/55.jpg)
55
Arbre de décision: méthode d’induction
A2<0.33
good A1<0.91
A1<0.23 A2<0.91
A2<0.75A2<0.49
A2<0.65good
bad good
bad
badbad
good
0
1
0 1A1
A2
![Page 56: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/56.jpg)
56
Impact de la variance sur l’erreur
• Estimation de l’erreur sur 7 problèmes différents
Erreur résiduelle (17%)
Biais (28 %)
Variance (55 %)
21,62%
• Impact de la variance mesuré par la décomposition biais/variance
![Page 57: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/57.jpg)
57
Impact de la variance sur l’erreur
• Sources de variance = choix qui dépendent de l’échantillon
A1<3.45
good …Erreur résiduelle (17%)
Biais (28 %)
Variance (55 %)
21,62%
Attributs (21%)Attributs (21%)
Seuil (58%)
Prediction (21%)
Attributs (21%)
Seuil (58%)
Prediction (21%)
Attributs (21%)
Seuil (58%)
![Page 58: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/58.jpg)
58
A1
?
• Par exemple, le choix du seuil:
Variance des paramètres
• Expérimentations pour mettre en évidence la variabilité des paramètres avec l’échantillon
0,48
A1<0,48
0,61
A1<0,61
0,82
A1<0,82
⇒Les paramètres sont très variables⇒Remet en question l’interprétabilité de la méthode
![Page 59: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/59.jpg)
59
Synthèse
Très bonBonMoyenArbres complets
EfficacitéInterprétabilitéPrécision
![Page 60: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/60.jpg)
60
Méthode de réduction de variance
Trois approches:• Améliorer l’interprétabilité d’abord
– Élagage– Stabilisation des paramètres
• Améliorer la précision d’abord– Bagging– Arbres aléatoires
• Améliorer les deux (si possible)– Dual perturb and combine
![Page 61: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/61.jpg)
61
Élagage
• Détermine la taille appropriée de l ’arbre à l’aide d’un ensemble indépendant de l’ensemble d’apprentissage
Tree complexity
Error (GS / PS)
Growing
OverfittingUnderfitting
Final tree
Pruning
![Page 62: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/62.jpg)
62
Stabilisation des paramètres
• Plusieurs techniques pour stabiliser le choix des seuils de discrétisation et des attributs testés
• Un exemple de technique pour stabiliser le seuil: moyenne des n meilleurs seuils
0
0,45
0 1
Seuil optimal Seuil stabilisé
![Page 63: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/63.jpg)
63
Stabilisation des paramètres
• Effet important sur l’interprétabilité: – L’élagage réduit la complexité de ..%– la stabilisation réduit la variance du seuil de 60 %
• Effet limité sur l’erreur: variance ↓ mais biais ↑Arbre complet
Élagage + Stabilisation
21,61 %
20,05 %
Arbre élagué 20,65 %
![Page 64: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/64.jpg)
64
Synthèse
Très bonTrès bonMoyenÉlagage+Stabilisation
Très bonBonMoyenArbres complets
EfficacitéInterprétabilitéPrécision
![Page 65: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/65.jpg)
65
Agrégation de modèles
good
Exemple: le bagging utilise le rééchantillonnage
good badgood
Perturbe
Combine
0
1
0 1
![Page 66: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/66.jpg)
66
⇒ On agrège plusieurs arbres aléatoires
Arbres aléatoires: induction
0
1
0 1A1
A2
A2<0.33
le test optimal
A1<0,25
un test aléatoire
• “Imite” la très grande variance des arbres en tirantun attribut et un seuil au hasard
![Page 67: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/67.jpg)
67
Arbres aléatoires: évaluation
• Effet sur la précision:
• Diminution de l’erreur due essentiellement à unediminution de la variance
Arbre complet
25 arbres aléatoires
21,61%
11,84%
Bagging 14,02%
![Page 68: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/68.jpg)
68
Synthèse
MoyenMauvaisTrès bonBagging
Très bonMauvaisTrès bonArbres aléatoires
Très bonTrès bonMoyenÉlagage + Stabilisation
Très bonBonMoyenArbres complets
EfficacitéInterprétabilitéPrécision
![Page 69: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/69.jpg)
69
Dual Perturb and Combine
• Perturbation lors du test avec un seul modèle• Ajout d’un bruit Gaussien indépendant à chacune
des coordonnées
goodbad
goodgood
0
1
0 1
![Page 70: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/70.jpg)
70
Dual Perturb and Combine
• Compromis biais/variance en fonction du niveau de bruit
• Détermination du niveau de bruit optimal sur un échantillon indépendant
°
°
°
°
°
°°
°
°
°°
°
° °
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°°
°
°
°
°
°
° °
°
°
°°
°
°
°
°
°°
°
°
°
°
°
°
° ° °
°
°
°
°
°°
°
°
°
°
°
°
°
°
°
°
°°
°°°°
°
°
°
°
°
°
° °
°
°
°
°
°
°
°
°°
°
°
°
°
°°
°
°
°
°
°
°
°
°°
°
•
•
•
•
•
•
••
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• ••
•
•
•
• ••
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
••
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
• ••
•
•
•
•
•
• •
•
•
•
•
•
••
•
•
• •
•
•
••
•
•
•
••
•
•
••
•
•
•
• •
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
optimal
trop de bruit
sans bruit
![Page 71: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/71.jpg)
71
Dual Perturb and Combine
• Résultats en terme de précision
• Impact: réduction de la variance essentiellement• Entre les arbres et les arbres aléatoires
Arbre complet
25 arbres aléatoires
21,61%
11,84%
Dual P&C 17,26%
![Page 72: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/72.jpg)
72
Dual P&C = arbres flous
A1<a1
A1
P(A1+e<a1) P(A1+e≥a1)
a1
a1
1
0A1
P(A1+e<a1)
![Page 73: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/73.jpg)
73
Dual P&C = arbres flous
Good (0,58 contre 0,42)
good
badgood
A1<a1
A2<a2
P(A1+e<a1) P(A1+e≥a1)
P(A2+e<a2) P(A2+e≥a2)
P(A1+e<a1) P(A2+e<a2) P(A1+e<a1) P(A2+e<a2)
P(A1+e≥a1)
0,30,7
0,60,4
0,28 0,42
0,3
![Page 74: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/74.jpg)
74
Synthèse
MoyenMauvaisTrès bonBagging
BonBonBonDual P&C
Très bonMauvaisTrès bonArbres aléatoires
Très bonTrès bonMoyenÉlagage + Stabilisation
Très bonBonMoyenArbres complets
EfficacitéInterprétabilitéPrécision
![Page 75: Bias and Variance in Machine Learning · 2005-12-01 · Bias/variance decomposition (6) E= var y {y} + bias2 + var LS {y} E y y ... Learning algorithms – linear regression • Very](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f616e1c692b5a45b05e235d/html5/thumbnails/75.jpg)
75
Synthèse
MoyenMauvaisTrès bonBagging
BonBonBonDual P&C
Très bonMauvaisTrès bonArbres aléatoires
Très bonTrès bonMoyenÉlagage + Stabilisation
Très bonBonMoyenArbres complets
EfficacitéInterprétabilitéPrécision