Post on 04-Sep-2020
CZECH TECHNICAL UNIVERSITY IN PRAGUE
Faculty of Electrical EngineeringDepartment of Cybernetics
P. Posık c© 2015 Artificial Intelligence – 1 / 13
Bias-variance trade-off.Crossvalidation. Regularization.
Petr Posık
How to evaluate a predictive model?
P. Posık c© 2015 Artificial Intelligence – 2 / 13
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x2+3x
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x2+3x
Using MSE only, both models are equivalent!!!
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)
Using MSE only, the cubic model is better thanlinear!!!
Model evaluation
P. Posık c© 2015 Artificial Intelligence – 3 / 13
Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?
■ We have various measures of model error:
■ For regression tasks: MSE, MAE, . . .
■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .
■ Some of them can be regarded as finite approximations of the Bayes risk.
■ Are these functions good approximations when measured on the data the models were trained on?
−0.5 0 0.5 1 1.5 2 2.5−1
−0.5
0
0.5
1
1.5
2
2.5
3f(x) = x
f(x) = x3−3x2+3x
Using MSE only, both models are equivalent!!!
−0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5f(x) = −0.09 + 0.99x
f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)
Using MSE only, the cubic model is better thanlinear!!!
A basic method of evaluation is model validation on a different, independent data set from the same source, i.e.on testing data.
Validation on testing data
P. Posık c© 2015 Artificial Intelligence – 4 / 13
Example: Polynomial regression with varrying degree:
X ∼ U(−1, 3)
Y ∼ X2 + N(0, 1)
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 0 , t r. e rr.: 8 .3 1 9 , t e s t . e rr.: 6 .9 0 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 3 , t r. e rr.: 0 .6 4 5 , t e s t . e rr.: 0 .9 1 9
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 5 , t r. e rr.: 0 .6 1 1 , t e s t . e rr.: 0 .9 7 9
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
Training and testing error
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 5 / 13
0 2 4 6 8 1 0
Polyn om d e g re e
0
1
2
3
4
5
6
7
8
9
MS
E
Tra in in g e rror
Te s t in g e rror
■ The training error decreases with increasing model flexibility.
■ The testing error is minimal for certain degree of model flexibility.
Overfitting
P. Posık c© 2015 Artificial Intelligence – 6 / 13
Definition of overfitting:
■ Let H be a hypothesis space.
■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.
■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).
■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).
■ We say that h is overfitted if there is another h′ forwhich
ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h
′)
Mo
del
Err
or
Model Flexibility
Training data
Testing data
■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”
■ Overfitting is a general phenomenon affecting all kinds of inductive learning.
Overfitting
P. Posık c© 2015 Artificial Intelligence – 6 / 13
Definition of overfitting:
■ Let H be a hypothesis space.
■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.
■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).
■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).
■ We say that h is overfitted if there is another h′ forwhich
ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h
′)
Mo
del
Err
or
Model Flexibility
Training data
Testing data
■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”
■ Overfitting is a general phenomenon affecting all kinds of inductive learning.
We want models and learning algorithms with a good generalization ability, i.e.
■ we want models that encode only the patterns valid in the whole domain, not those that learned thespecifics of the training data,
■ we want algorithms able to find only the patterns valid in the whole domain and ignore specifics of thetraining data.
Bias vs Variance
P. Posık c© 2015 Artificial Intelligence – 7 / 13
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
High bias:model not flexible enough
(Underfit)
“Just right”(Good fit)
High variance:model flexibility too high
(Overfit)
Bias vs Variance
P. Posık c© 2015 Artificial Intelligence – 7 / 13
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 1 , t r. e rr.: 2 .0 1 3 , t e s t . e rr.: 2 .8 4 1
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 2 , t r. e rr.: 0 .6 4 7 , t e s t . e rr.: 0 .9 2 5
Tra in in g d a ta
Te s t in g d a ta
− 2 − 1 0 1 2 3 4
x
− 4
− 2
0
2
4
6
8
1 0
y
Polyn om d e g .: 9 , t r. e rr.: 0 .5 4 5 , t e s t . e rr.: 1 .0 6 7
Tra in in g d a ta
Te s t in g d a ta
High bias:model not flexible enough
(Underfit)
“Just right”(Good fit)
High variance:model flexibility too high
(Overfit)
High bias problem:
■ ErrTr(h) is high
■ ErrTst(h) ≈ ErrTr(h) Mo
del
Err
or
Model Flexibility
Training data
Testing data
High variance problem:
■ ErrTr(h) is low
■ ErrTst(h) >> ErrTr(h)
Crossvalidation
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 8 / 13
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
Crossvalidation
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 8 / 13
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
K-fold crossvalidation:
■ Split the data into k folds (k is usually 5 or 10).
■ In each iteration:
■ Use k − 1 folds to train the model.
■ Use 1 fold to test the model, i.e. measure error.
Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training
■ Aggregate (average) the k error measurements to get the final error estimate.
■ Train the model on the whole data set.
Crossvalidation
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 8 / 13
Simple crossvalidation:
■ Split the data into training and testing subsets.
■ Train the model on training data.
■ Evaluate the model error on testing data.
K-fold crossvalidation:
■ Split the data into k folds (k is usually 5 or 10).
■ In each iteration:
■ Use k − 1 folds to train the model.
■ Use 1 fold to test the model, i.e. measure error.
Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training
■ Aggregate (average) the k error measurements to get the final error estimate.
■ Train the model on the whole data set.
Leave-one-out (LOO) crossvalidation:
■ k = |T|, i.e. the number of folds is equal to the training set size.
■ Time consuming for large |T|.
How to determine a suitable model flexibility
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 9 / 13
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
How to determine a suitable model flexibility
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 9 / 13
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
How to determine a suitable model flexibility
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 9 / 13
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
Using k-fold crossvalidation
1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.
2. Testing data: use cca 25 % of data for the final estimate of the model error.
How to determine a suitable model flexibility
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 9 / 13
Simply test models of varying complexities and choose the one with the best testing error,right?
■ The testing data are used here to tune a meta-parameter of the model.
■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.
■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.
■ A new, separate data set is needed to estimate the model error.
Using simple crossvalidation:
1. Training data: use cca 50 % of data for model building.
2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
3. Train the suitable model on training + validation data.
4. Testing data: use cca 25 % of data for the final estimate of the model error.
Using k-fold crossvalidation
1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.
2. Testing data: use cca 25 % of data for the final estimate of the model error.
The ratios are not set in stone, there are other possibilities, e.g. 60:20:20, etc.
How to prevent overfitting?
How to evaluate apredictive model?
• Model evaluation• Training and testingerror
• Overfitting
• Bias vs Variance
• Crossvalidation• How to determine asuitable modelflexibility
• How to preventoverfitting?
Regularization
P. Posık c© 2015 Artificial Intelligence – 10 / 13
1. Reduce number of features.
■ Select manually, which features to keep.
■ Try to identify a suitable subset of features during learning phase.
2. Regularization
■ Keep all features, but reduce the magnitude of parameters w.
■ Works well, if we have a lot of features each of which contributes a bit topredicting y.
Regularization
P. Posık c© 2015 Artificial Intelligence – 11 / 13
Ridge regularization (a.k.a. Tikhonov regularization)
P. Posık c© 2015 Artificial Intelligence – 12 / 13
Ridge regularization penalizes the size of themodel coefficients:
■ Modification of the optimization criterion:
J(w) =1
|T|
|T|
∑i=1
(
y(i) − hw(x(i)))2
+α
D
∑d=1
w2d.
■ The solution is given by a modified Normalequation
w∗ = (XT X+αI)−1XTy
■ As α → 0, wridge → wOLS.
■ As α → ∞, wridge → 0.
Training and testing errors as functions ofregularization parameter:
10-8 10-6 10-4 10-2 100 102 104 106 108
Re g u la riza t ion fa c tor
0 .5
1 .0
1 .5
2 .0
2 .5
3 .0
3 .5
MS
E
Tra in in g e rror
Te s t in g e rror
The values of coefficients as functions ofregularization parameter:
10-8 10-6 10-4 10-2 100 102 104 106 108
Re g u la riza t ion fa c tor
− 1 0
− 5
0
5
1 0
1 5
Co
eff
icie
nt
siz
e
Lasso regularization
P. Posık c© 2015 Artificial Intelligence – 13 / 13
Lasso regularization penalizes the size of themodel coefficients:
■ Modification of the optimization criterion:
J(w) =1
|T|
|T|
∑i=1
(
y(i) − hw(x(i)))2
+α
D
∑d=1
|wd|.
■ Solution is usually found by quadraticprogramming.
■ As α → ∞, Lasso regularization decreases thenumber of non-zero coefficients.
Training and testing errors as functions ofregularization parameter:
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102
Re g u la riza t ion fa c tor
1 .0
1 .2
1 .4
1 .6
1 .8
2 .0
2 .2
2 .4
2 .6
2 .8
MS
E
Tra in in g e rror
Te s t in g e rror
The values of coefficients as functions ofregularization parameter:
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102
Re g u la riza t ion fa c tor
− 0 .1
0 .0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
Co
eff
icie
nt
siz
e