COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...

44
COMP 551 - Applied Machine Learning Lecture 6 --- Evaluation and Regularization William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Transcript of COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...

Page 1: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

COMP 551 - Applied Machine LearningLecture 6 --- Evaluation and RegularizationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Clarifications on generative classification§ The proper pdf for the multivariate Gaussian is:

§ For continuous variables, we are not computing proper probabilities but instead we are comparing the probability densities of different solutions (via pdfs). § The “probability” of a specific point in a continuous distribution is always zero!

§ But we can compare the relative likelihoods of different points and model parameter settings by comparing the pdf values.

William L. Hamilton, McGill University and Mila 2

Cancels out in LDA computation, but it is still

important to include!

Page 3: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Classification in the real world

William L. Hamilton, McGill University and Mila

Not spam

Spam

?

3

Page 4: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Multi-class classification

William L. Hamilton, McGill University and Mila

§ Generally two options:

1. Learn a single classifier that can produce 20 distinct output values.

2. Learn 20 different 1-vs-all binary classifiers.

4

Page 5: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Multi-class classification

William L. Hamilton, McGill University and Mila

§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.

§ Option 1 assumes you have a multi-class version of the classifier.

§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.

5

Page 6: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Multi-class classification

William L. Hamilton, McGill University and Mila

§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.

§ Option 1 assumes you have a multi-class version of the classifier.

§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.

§ Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates an imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)

6

Page 7: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Classification in the real world

William L. Hamilton, McGill University and Mila

§ Different objectives:§ Selecting the right model for a problem.

§ Testing performance of a new algorithm.

§ Evaluating impact on a new application.

7

Page 8: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Minimizing the error

§ Find the low point in the validation error:220 7. Model Assessment and Selection

0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Model Complexity (df)

Pred

ictio

n Er

ror

High Bias Low BiasHigh VarianceLow Variance

FIGURE 7.1. Behavior of test sample and training sample error as the modelcomplexity is varied. The light blue curves show the training error err, while thelight red curves show the conditional test error ErrT for 100 training sets of size50 each, as the model complexity is increased. The solid curves show the expectedtest error Err and the expected training error E[err].

Test error, also referred to as generalization error, is the prediction errorover an independent test sample

ErrT = E[L(Y, f(X))|T ] (7.2)

where both X and Y are drawn randomly from their joint distribution(population). Here the training set T is fixed, and test error refers to theerror for this specific training set. A related quantity is the expected pre-diction error (or expected test error)

Err = E[L(Y, f(X))] = E[ErrT ]. (7.3)

Note that this expectation averages over everything that is random, includ-ing the randomness in the training set that produced f .

Figure 7.1 shows the prediction error (light red curves) ErrT for 100simulated training sets each of size 50. The lasso (Section 3.4.2) was usedto produce the sequence of fits. The solid red curve is the average, andhence an estimate of Err.

Estimation of ErrT will be our goal, although we will see that Err ismore amenable to statistical analysis, and most methods effectively esti-mate the expected error. It does not seem possible to estimate conditional

Train error

Validation error

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Not all errors are created equal

William L. Hamilton, McGill University and Mila

§ Not all errors have equal impact!

§ Different types of mistakes, particularly with classification tasks.

§ Incorrectly classifying a tumor is very different from accidently allowing a spam

email though the filter!

9

Not spam

Spam

?

Page 10: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Datasets can be imbalanced

William L. Hamilton, McGill University and Mila

§ Most real world data is not “balanced”, meaning that there classes are not equally distributed.

§ It is easy to get high accuracy on imbalanced data!

§ E.g., if 99% of emails are not spam, then we can get 99% accuracy by

simply predicting that everything is not spam!

10

Not spam

Spam

?

Page 11: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Example 1

§ Real-world study on predicting seizures for EEG signals.

§ Claim the technique is “clinically ready for mass screening”

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Example 1

Accuracy =Truepositives+TrueNegatives/TotalnumberofexamplesSensitivity =Truepositives/TotalnumberofactualpositivesSpecificity =Truenegatives/Totalnumberofactualnegatives

§ Real-world study on predicting seizures for EEG signals.

§ Claim the technique is “clinically ready for mass screening”

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Performance metrics for classification§ Not all errors have equal impact!§ There are different types of mistakes, particularly in the

classification setting.§ E.g., Consider the diagnostic of a disease. Two types of mis-diagnostics:

§ Patient does not have disease but received positive diagnostic (Type I error);§ Patient has disease but it was not detected (Type II error).

§ E.g. Consider the problem of spam classification:§ A message that is not spam is assigned to the spam folder (Type I error);§ A message that is spam appears in the regular folder (Type II error).

§ Question: How many Type I errors are you willing to tolerate, for a reasonable rate of Type II errors?

William L. Hamilton, McGill University and Mila 13

Page 14: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Example 2

§ Another real-world study on predicting seizures for EEG signals.

§ >85% sensitivity.

§ “Warning rate” of 0.35/hr.

William L. Hamilton, McGill University and Mila 14

Page 15: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Example 3

§ Another real-world study on predicting seizures for EEG signals.

§ 95% sensitivity.

§ “False alarm rate” of 0.15/hr.

William L. Hamilton, McGill University and Mila 15

Page 16: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Terminology

§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.

§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.

§ True negative (m00): Example of class 0 predicted as class 0.

§ False negative (m10): Example of class 1 predicted as class 0. Type II error.

§ Total number of instances: m = m00 + m01 + m10 + m11

William L. Hamilton, McGill University and Mila 16

Page 17: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Terminology

§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.

§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.

§ True negative (m00): Example of class 0 predicted as class 0.

§ False negative (m10): Example of class 1 predicted as class 0. Type II error.

§ Total number of instances: m = m00 + m01 + m10 + m11

§ Error rate: (m01 + m10) / m§ If the classes are imbalanced (e.g. 10% from class 1, 90% from class 0),

one can achieve low error (e.g. 10%) by classifying everything as coming

from class 0!

William L. Hamilton, McGill University and Mila 17

Page 18: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Confusion matrix

§ Many software packages output a confusion matrix:

William L. Hamilton, McGill University and Mila 18

Page 19: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Confusion matrix

§ Many software packages output this matrix:

§ Be careful! Sometimes the format is slightly different, e.g.,

Confusion matrix

• Confusion matrix gives more information than error rate:

m00 m01

m10 m11

• Many software packages (eg. Weka) output this matrix

• Varying the parameter of the algorithm produces a curve

COMP-652, Lecture 12 - October 18, 2012 11

William L. Hamilton, McGill University and Mila 19

Page 20: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)

§ Precision = True positives / Total number of declared positives

= TP / (TP+ FP)

§ Recall = True positives / Total number of actual positives

= TP / (TP + FN)

William L. Hamilton, McGill University and Mila 20

Page 21: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)

§ Precision = True positives / Total number of declared positives

= TP / (TP+ FP)

§ Recall = True positives / Total number of actual positives

= TP / (TP + FN)

§ Sensitivity is the same as recall.

§ Specificity = True negatives / Total number of actual negatives

= TN / (FP + TN)

Textclassification

Medicine

William L. Hamilton, McGill University and Mila 21

Page 22: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)

§ Precision = True positives / Total number of declared positives

= TP / (TP+ FP)

§ Recall = True positives / Total number of actual positives

= TP / (TP + FN)

§ Sensitivity is the same as recall.

§ Specificity = True negatives / Total number of actual negatives

= TN / (FP + TN)

§ False positive rate = FP / (FP + TN)

Textclassification

Medicine

William L. Hamilton, McGill University and Mila 22

Page 23: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)

§ Precision = True positives / Total number of declared positives

= TP / (TP+ FP)

§ Recall = True positives / Total number of actual positives

= TP / (TP + FN)

§ Sensitivity is the same as recall.

§ Specificity = True negatives / Total number of actual negatives

= TN / (FP + TN)

§ False positive rate = FP / (FP + TN)

§ F1 measure

Textclassification

Medicine

William L. Hamilton, McGill University and Mila 23

Page 24: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Trade-off: False positives and false negatives§ Often have a trade-off between false positives and false negatives.

§ E.g. Consider 30 different classifiers trained on a class. Classify a new sample as positive if K classifiers output positive. Vary K between 0 and 30.

Example: Tree bagging

• 30 decision trees, classify an example as positive if K trees classify it aspositive

• Vary K between 0 and 30

!"#$%&'()*+),'-./.01)23''/)!"#$%&'()*+),'-./.01)23''/)-01/234-2',)56)5#77.17-01/234-2',)56)5#77.17

8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)%3',.-2)%0/.2.:'<))=#36);<%3',.-2)%0/.2.:'<))=#36);<

COMP-652, Lecture 12 - October 18, 2012 12

Precision-recall

• Similar concept to AUC curves, but used in retrieval tasks

• Precision is true positive / total number of documents retrieved

• Recall is true positives / all positives

• In medical applications we use instead sensitivity and selectivity, whichare the recall for the two classes

!"#$%&%'()*#$+,,)-"+./!"#$%&%'()*#$+,,)-"+./!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:)4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:).'&%0%4#)."#5%$0%'(&);'")4+"6)<=.'&%0%4#)."#5%$0%'(&);'")4+"6)<=

COMP-652, Lecture 12 - October 18, 2012 17

William L. Hamilton, McGill University and Mila 24

Page 25: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Receiver operating characteristic (ROC)§ Consider simple 2-class classification (blue = negative class; red = positive class).

§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .

§ Changing that boundary defines the ROC curve on the right.

Predictpositive

Predictnegative

William L. Hamilton, McGill University and Mila 25

Page 26: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Understanding the ROC curve§ Consider simple 2-class classification (blue = negative class; red = positive class).

§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .

§ Changing that boundary defines the ROC curve on the right.

Predictpositive

Predictnegative

Diagonal line corresponds to a random classifier.

William L. Hamilton, McGill University and Mila 26

Page 27: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Building the ROC curve§ In many domains, the empirical ROC curve will be non-convex

(red line). Take the convex hull of the points (blue line).

To build the ROC curve: 1. Train a classifier.2. Vary the decision boundary.3. Compute FP and TP rate for

each different decision boundary.

ROC convex hull

• Suppose we have two hypotheses h1 and h2 along the ROC curve.

• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two

• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,

!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(

COMP-652, Lecture 12 - October 18, 2012 15

FP

TP

William L. Hamilton, McGill University and Mila 27

Page 28: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,

consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.

ROC convex hull

• Suppose we have two hypotheses h1 and h2 along the ROC curve.

• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two

• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,

!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(

COMP-652, Lecture 12 - October 18, 2012 15

William L. Hamilton, McGill University and Mila 28

Page 29: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,

consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.

ROC convex hull

• Suppose we have two hypotheses h1 and h2 along the ROC curve.

• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two

• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,

!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(

COMP-652, Lecture 12 - October 18, 2012 15

Intuition: AUC is equivalent to the probability of ranking a random positive example higher than a random negative example!

William L. Hamilton, McGill University and Mila 29

Page 30: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

What about regression?§ Evaluation for regression is generally more straightforward than

classification, but there are still choices!

§ Common examples:

§ Mean Squared Error:

§ Root Mean Squared Error:

§ Mean Absolute Error:

§ (R)MSE penalizes larger errors more (e.g., being off by 10 is more than

twice as bad as being off by 5).

MSE =1

n

nX

i=1

(yi � yi)2

MAE =1

n

nX

i=1

|yi � yi|

William L. Hamilton, McGill University and Mila 30

RMSE =

rPni=1(yi � yi)2

n

Page 31: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Our goal: Minimize validation error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.

William L. Hamilton, McGill University and Mila 31

Page 32: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Our goal: Minimize validation error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

§ Overly simple model:§ High bias

§ Overly complex model:§ High variance

William L. Hamilton, McGill University and Mila 32

Page 33: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:

§ where

E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2

Bias[f(x)] = E[f(x)]� f(x)

Var[f(x)] = E[f(x)2]� E[f(x)]2

William L. Hamilton, McGill University and Mila 33

Page 34: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:

§ where E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2

Bias[f(x)] = E[f(x)]� f(x)

Var[f(x)] = E[f(x)2]� E[f(x)]2

How far off is our mean prediction?

How wildly do our predictions vary?

William L. Hamilton, McGill University and Mila 34

Page 35: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Bias vs variance in machine learning

§ Overly simple model:§ High bias§ Underfitting

§ Overly complex model:§ High variance§ Overfitting

William L. Hamilton, McGill University and Mila 35

Page 36: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Overfitting, bias, and variance

§ What to do when we are overfitting?

§ Remove some features.

§ Simplify the model.

§ Reduce the variance at the cost of some bias!

§ This is called regularization.

William L. Hamilton, McGill University and Mila 36

Page 37: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Overfitting, bias, and variance

Which models have higher bias, which have higher variance?

William L. Hamilton, McGill University and Mila 37

Page 38: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:

ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.

§ Do a little algebra to get the solution: ŵridge =(XTX+λI)-1XTY§ The ridge solution is not equivariant under scaling of the data, so

typically need to normalize the inputs first.

§ Ridge gives a smooth solution, effectively shrinking the weights, but

drives few weights to 0.

William L. Hamilton, McGill University and Mila 38

Page 39: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:

ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.

§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ(||w||2)2§ Can also simply add the penalty to gradient descent:

∂Err(w)/∂w =2(XTXw – XTY)+2 λ w

Usual gradient for linear regression

Gradient from the L2-penalty term

William L. Hamilton, McGill University and Mila 39

Page 40: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Lasso regression (aka L1-regularization)

§ Constrains the weights by penalizing the absolute value of their size:

ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}

§ Now the objective is non-linear in the output y, and there is no closed-form solution. Need to solve a quadratic programming problem instead.

§ More computationally expensive than Ridge regression.

§ Effectively sets the weights of less relevant input features to zero.

William L. Hamilton, McGill University and Mila 40

Page 41: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Lasso regression (aka L1-regularization)

§ Constrains the weights by penalizing the absolute value of their size:

ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}

§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ||w||1§ Can also simply add the penalty to gradient descent:

∂Err(w)/∂w =2(XTXw – XTY)+λ sign(w)

Usual gradient for linear regression

(Sub)gradient of the L1-penalty term. sign(x)=1 if x >0, sign(x)=-1 if x < 0,

and sign(x)=0 if x=0.

William L. Hamilton, McGill University and Mila 41

Page 42: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Comparing Ridge and Lasso

4

Joelle Pineau 7

Comparing Ridge and Lasso

COMP-598: Applied Machine Learning

Solving L1 regularization

• The optimization problem is a quadratic program

• There is one constraint for each possible sign of the weights (2n

constraints for n weights)

• For example, with two weights:

minw1,w2

mX

j=1

(yj � w1x1 � w2x2)2

such that w1 + w2 ⇥ ⇤

w1 � w2 ⇥ ⇤

�w1 + w2 ⇥ ⇤

�w1 � w2 ⇥ ⇤

• Solving this program directly can be done for problems with a smallnumber of inputs

COMP-652, Lecture 3 - September 8, 2011 31

Visualizing L1 regularization

w1

w2

w?

• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners

• This makes L1 regularization much more likely to make some weightsexactly 0

COMP-652, Lecture 3 - September 8, 2011 32

Ridge Lasso

L2 Regularization for linear models revisited

• Optimization problem: minimize error while keeping norm of the weightsbounded

minw

JD(w) = minw

(�w � y)T (�w � y)

such that wTw ⇥ ⇤

• The Lagrangian is:

L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤

• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay

COMP-652, Lecture 3 - September 8, 2011 27

Visualizing regularization (2 parameters)

w1

w2

w?

w⇤ = (�T�+ ⌅I)�1�y

COMP-652, Lecture 3 - September 8, 2011 28

Joelle Pineau 8

A quick look at evaluation functions

•  We call L(Y,fw(x)) the loss function.

–  Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2

•  Other loss functions?

–  Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

–  0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

•  Different loss functions make different assumptions.

–  Squared error loss assumes the data can be approximated by a global linear model.

COMP-598: Applied Machine Learning

Contours of equalregression error

Contours of equalmodel complexity penalty

§ Ridge regression will tend to lower all weights.

§ Lasso will tend to set some weights to 0.

§ Combining Lasso and Ridge regression is called the “elastic” net.

William L. Hamilton, McGill University and Mila 42

Page 43: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

Bias, variance, and regularization

William L. Hamilton, McGill University and Mila

§ Regularization (e.g, Ridge and Lass) decrease variance at the expense of some bias.

§ Regularization reduces overfitting but can cause underfitting.

§ The L2 and L1 penalties can be added to any model (e.g., logistic regression). Easy to incorporate with gradient descent.

43

Page 44: COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating characteristic (ROC) § Consider simple 2-class classification (blue = negative class;

The concept of inductive bias

William L. Hamilton, McGill University and Mila

§ Statistical “bias” is not a bad thing in ML!§ It is all about having the right inductive bias. § ML is all about coming up with different models that

have different inductive biases. § Optional reading: “The Need for Biases in Learning

Generalizations"

44