Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL...

CZECH TECHNICAL UNIVERSITY IN PRAGUE

Faculty of Electrical EngineeringDepartment of Cybernetics

Bias-variance trade-off.Crossvalidation. Regularization.

Petr Posık

How to evaluate a predictive model?

Model evaluation

Fundamental question: What is a good measure of “model quality” from the machine-learningstandpoint?

Model evaluation

■ We have various measures of model error:

■ For regression tasks: MSE, MAE, . . .

■ For classification tasks: misclassification rate, measures based on confusion matrix, . . .

■ Some of them can be regarded as finite approximations of the Bayes risk.

■ Are these functions good approximations when measured on the data the models were trained on?

Model evaluation

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

3f(x) = x

f(x) = x3−3x2+3x

Model evaluation

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

3f(x) = x

f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

Model evaluation

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

3f(x) = x

f(x) = x3−3x2+3x

−0.5 0 0.5 1 1.5 2 2.5−0.5

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Model evaluation

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

3f(x) = x

f(x) = x3−3x2+3x

−0.5 0 0.5 1 1.5 2 2.5−0.5

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better thanlinear!!!

Model evaluation

−0.5 0 0.5 1 1.5 2 2.5−1

−0.5

3f(x) = x

f(x) = x3−3x2+3x

−0.5 0 0.5 1 1.5 2 2.5−0.5

2.5f(x) = −0.09 + 0.99x

f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better thanlinear!!!

A basic method of evaluation is model validation on a different, independent data set from the same source, i.e.on testing data.

Validation on testing data

Example: Polynomial regression with varrying degree:

X ∼ U(−1, 3)

Y ∼ X2 + N(0, 1)

− 2 − 1 0 1 2 3 4

Polyn om d e g .: 0 , t r. e rr.: 8 .3 1 9 , t e s t . e rr.: 6 .9 0 1

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

Training and testing error

How to evaluate apredictive model?

• Model evaluation• Training and testingerror

• Overfitting

• Bias vs Variance

• Crossvalidation• How to determine asuitable modelflexibility

• How to preventoverfitting?

Regularization

0 2 4 6 8 1 0

Polyn om d e g re e

Tra in in g e rror

Te s t in g e rror

■ The training error decreases with increasing model flexibility.

■ The testing error is minimal for certain degree of model flexibility.

Overfitting

Definition of overfitting:

■ Let H be a hypothesis space.

■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.

■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ forwhich

ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h

Model Flexibility

Training data

Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”

■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

Overfitting

Definition of overfitting:

■ Let H be a hypothesis space.

■ Let h ∈ H and h′ ∈ H be 2 different hypotheses fromthis space.

■ Let ErrTr(h) be an error of the hypothesis hmeasured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis hmeasured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ forwhich

ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h

Model Flexibility

Training data

Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.”

■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

We want models and learning algorithms with a good generalization ability, i.e.

■ we want models that encode only the patterns valid in the whole domain, not those that learned thespecifics of the training data,

■ we want algorithms able to find only the patterns valid in the whole domain and ignore specifics of thetraining data.

Bias vs Variance

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

High bias:model not flexible enough

(Underfit)

“Just right”(Good fit)

High variance:model flexibility too high

(Overfit)

Bias vs Variance

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

− 2 − 1 0 1 2 3 4

Tra in in g d a ta

Te s t in g d a ta

High bias:model not flexible enough

(Underfit)

“Just right”(Good fit)

High variance:model flexibility too high

(Overfit)

High bias problem:

■ ErrTr(h) is high

■ ErrTst(h) ≈ ErrTr(h) Mo

Model Flexibility

Training data

Testing data

High variance problem:

■ ErrTr(h) is low

■ ErrTst(h) >> ErrTr(h)

Crossvalidation

• Overfitting

Regularization

Simple crossvalidation:

■ Split the data into training and testing subsets.

■ Train the model on training data.

■ Evaluate the model error on testing data.

Crossvalidation

• Overfitting

Regularization

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10).

■ In each iteration:

■ Use k − 1 folds to train the model.

■ Use 1 fold to test the model, i.e. measure error.

Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate.

■ Train the model on the whole data set.

Crossvalidation

• Overfitting

Regularization

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10).

■ In each iteration:

■ Use k − 1 folds to train the model.

■ Use 1 fold to test the model, i.e. measure error.

Iter. 1 Training Training TestingIter. 2 Training Testing TrainingIter. k Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate.

■ Train the model on the whole data set.

Leave-one-out (LOO) crossvalidation:

■ k = |T|, i.e. the number of folds is equal to the training set size.

■ Time consuming for large |T|.

How to determine a suitable model flexibility

• Overfitting

Regularization

Simply test models of varying complexities and choose the one with the best testing error,right?

■ The testing data are used here to tune a meta-parameter of the model.

■ The testing data are used to train (a part of) the model, thus essentially become part oftraining data.

■ The error on testing data is no longer an unbiased estimate of model error; itunderestimates it.

■ A new, separate data set is needed to estimate the model error.

• Overfitting

Regularization

Using simple crossvalidation:

1. Training data: use cca 50 % of data for model building.

2. Validation data: use cca 25 % of data to search for the suitable model flexibility.

3. Train the suitable model on training + validation data.

4. Testing data: use cca 25 % of data for the final estimate of the model error.

• Overfitting

Regularization

Using k-fold crossvalidation

1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.

• Overfitting

Regularization

Using k-fold crossvalidation

1. Training data: use cca 75 % of data to find and train a suitable model usingcrossvalidation.

The ratios are not set in stone, there are other possibilities, e.g. 60:20:20, etc.

How to prevent overfitting?

• Overfitting

Regularization

1. Reduce number of features.

■ Select manually, which features to keep.

■ Try to identify a suitable subset of features during learning phase.

2. Regularization

■ Keep all features, but reduce the magnitude of parameters w.

■ Works well, if we have a lot of features each of which contributes a bit topredicting y.

Regularization

Ridge regularization (a.k.a. Tikhonov regularization)

Ridge regularization penalizes the size of themodel coefficients:

■ Modification of the optimization criterion:

J(w) =1

∑i=1

y(i) − hw(x(i)))2

∑d=1

■ The solution is given by a modified Normalequation

w∗ = (XT X+αI)−1XTy

■ As α → 0, wridge → wOLS.

■ As α → ∞, wridge → 0.

Training and testing errors as functions ofregularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108

Re g u la riza t ion fa c tor

Tra in in g e rror

Te s t in g e rror

The values of coefficients as functions ofregularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108

− 1 0

Lasso regularization

Lasso regularization penalizes the size of themodel coefficients:

■ Modification of the optimization criterion:

J(w) =1

∑i=1

y(i) − hw(x(i)))2

∑d=1

■ Solution is usually found by quadraticprogramming.

■ As α → ∞, Lasso regularization decreases thenumber of non-zero coefficients.

Training and testing errors as functions ofregularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102

Tra in in g e rror

Te s t in g e rror

The values of coefficients as functions ofregularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102

− 0 .1

Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL...

Documents

Transcript of Bias-variance trade-off. Crossvalidation. Regularization. · 2015. 3. 17. · CZECH TECHNICAL...

Psycho-Cybernetics Psycho-Cybernetics Dr. Maxwell Maltz.

Cybernetics and Second-Order Cybernetics - USP · Cybernetics and Second-Order Cybernetics ... an essential characteristic of mind and life, ... were first explored by cyberneticians

How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems

FOUNDATION OF CYBERNETICS

Cybernetics and Systems

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS…pages.cs.wisc.edu/~gdguo/myPapersOnWeb/IEEESMC05Guo.pdf · IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS,

Cyborgs and Cybernetics

ık Artırma Teorisi Üzerine Bir Çalışma

2011 Marine Cybernetics 2011 Marine Cybernetics AS | Trondheim, Norway | .

cybernetics technology

Soviet Cybernetics

Cybernetics - research.shahed.ac.irresearch.shahed.ac.ir/WSR/SiteData/PaperFiles/8231_914513486.pdf · cybernetics cybernetics cybernetics as (us-el) i j as "the" al i (reductionist)

Ethics of Cybernetics

Cem Bozs¸ahin Computer Engineering March 17, 2004users.metu.edu.tr/bozsahin/metu04_slides.pdf · four truck-s semantics ... yeni yepyeni/yesyeni ac¸ık apac¸ık ... yet there seems

´ ¯ık˙rs.n.al¯ılataran¯ .gin.¯ı - Internet Archive...sr´ ¯ınar¯ ayan¯ .at¯ırthayat¯ındraviracita¯ sr´ ¯ık˙rs.n.al¯ılataran¯ .gin.¯ı TO NAVIGATE CLlICK ON

Ninety years of Jaroslav Kurzweiltvrdy/JK90-MathBohem_141-2016... · 2016-05-11 · Ninety years of Jaroslav Kurzweil Jiˇr´ı Jarn´ık, Pavel Krejˇc´ı, Anton´ın Slav´ık,

Bateson Cybernetics

Introduction to Cybernetics

Cybernetics and Second-Order Cybernetics -

The Cybernetics of CrisisThe Cybernetics of Crisis and the ...