COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...

COMP 551 - Applied Machine LearningLecture 6 --- Evaluation and RegularizationWilliam L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Clarifications on generative classification§ The proper pdf for the multivariate Gaussian is:

§ For continuous variables, we are not computing proper probabilities but instead we are comparing the probability densities of different solutions (via pdfs). § The “probability” of a specific point in a continuous distribution is always zero!

§ But we can compare the relative likelihoods of different points and model parameter settings by comparing the pdf values.


Cancels out in LDA computation, but it is still

important to include!

Classification in the real world

William L. Hamilton, McGill University and Mila

Not spam

Spam

?

3

Multi-class classification


§ Generally two options:

1. Learn a single classifier that can produce 20 distinct output values.

2. Learn 20 different 1-vs-all binary classifiers.

4



§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.

§ Option 1 assumes you have a multi-class version of the classifier.

§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.

5



§ Generally two options:1. Learn a single classifier that can produce 20 distinct output values.2. Learn 20 different 1-vs-all binary classifiers.

§ Option 1 assumes you have a multi-class version of the classifier.

§ For Naïve Bayes, compute P(y|x)for each class, and select the class with highest probability.

§ Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates an imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)

6

Classification in the real world


§ Different objectives:§ Selecting the right model for a problem.

§ Testing performance of a new algorithm.

§ Evaluating impact on a new application.

7

Minimizing the error

§ Find the low point in the validation error:220 7. Model Assessment and Selection

0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Model Complexity (df)

Pred

ictio

n Er

ror

High Bias Low BiasHigh VarianceLow Variance

FIGURE 7.1. Behavior of test sample and training sample error as the modelcomplexity is varied. The light blue curves show the training error err, while thelight red curves show the conditional test error ErrT for 100 training sets of size50 each, as the model complexity is increased. The solid curves show the expectedtest error Err and the expected training error E[err].

Test error, also referred to as generalization error, is the prediction errorover an independent test sample

ErrT = E[L(Y, f(X))|T ] (7.2)

where both X and Y are drawn randomly from their joint distribution(population). Here the training set T is fixed, and test error refers to theerror for this specific training set. A related quantity is the expected pre-diction error (or expected test error)

Err = E[L(Y, f(X))] = E[ErrT ]. (7.3)

Note that this expectation averages over everything that is random, includ-ing the randomness in the training set that produced f .

Figure 7.1 shows the prediction error (light red curves) ErrT for 100simulated training sets each of size 50. The lasso (Section 3.4.2) was usedto produce the sequence of fits. The solid red curve is the average, andhence an estimate of Err.

Estimation of ErrT will be our goal, although we will see that Err ismore amenable to statistical analysis, and most methods effectively esti-mate the expected error. It does not seem possible to estimate conditional

Train error

Validation error


Not all errors are created equal


§ Not all errors have equal impact!

§ Different types of mistakes, particularly with classification tasks.

§ Incorrectly classifying a tumor is very different from accidently allowing a spam

email though the filter!

9

Not spam

Spam

?

Datasets can be imbalanced


§ Most real world data is not “balanced”, meaning that there classes are not equally distributed.

§ It is easy to get high accuracy on imbalanced data!

§ E.g., if 99% of emails are not spam, then we can get 99% accuracy by

simply predicting that everything is not spam!

10

Not spam

Spam

?

Example 1

§ Real-world study on predicting seizures for EEG signals.

§ Claim the technique is “clinically ready for mass screening”


Example 1

Accuracy =Truepositives+TrueNegatives/TotalnumberofexamplesSensitivity =Truepositives/TotalnumberofactualpositivesSpecificity =Truenegatives/Totalnumberofactualnegatives

§ Real-world study on predicting seizures for EEG signals.

§ Claim the technique is “clinically ready for mass screening”


Performance metrics for classification§ Not all errors have equal impact!§ There are different types of mistakes, particularly in the

classification setting.§ E.g., Consider the diagnostic of a disease. Two types of mis-diagnostics:

§ Patient does not have disease but received positive diagnostic (Type I error);§ Patient has disease but it was not detected (Type II error).

§ E.g. Consider the problem of spam classification:§ A message that is not spam is assigned to the spam folder (Type I error);§ A message that is spam appears in the regular folder (Type II error).

§ Question: How many Type I errors are you willing to tolerate, for a reasonable rate of Type II errors?


Example 2

§ Another real-world study on predicting seizures for EEG signals.

§ >85% sensitivity.

§ “Warning rate” of 0.35/hr.


Example 3

§ Another real-world study on predicting seizures for EEG signals.

§ 95% sensitivity.

§ “False alarm rate” of 0.15/hr.


Terminology

§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.

§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.

§ True negative (m00): Example of class 0 predicted as class 0.

§ False negative (m10): Example of class 1 predicted as class 0. Type II error.

§ Total number of instances: m = m00 + m01 + m10 + m11


Terminology

§ Type of classification outputs:§ True positive (m11): Example of class 1 predicted as class 1.

§ False positive (m01): Example of class 0 predicted as class 1. Type 1 error.

§ True negative (m00): Example of class 0 predicted as class 0.

§ False negative (m10): Example of class 1 predicted as class 0. Type II error.

§ Total number of instances: m = m00 + m01 + m10 + m11

§ Error rate: (m01 + m10) / m§ If the classes are imbalanced (e.g. 10% from class 1, 90% from class 0),

one can achieve low error (e.g. 10%) by classifying everything as coming

from class 0!


Confusion matrix

§ Many software packages output a confusion matrix:


Confusion matrix

§ Many software packages output this matrix:

§ Be careful! Sometimes the format is slightly different, e.g.,

Confusion matrix

• Confusion matrix gives more information than error rate:

m00 m01

m10 m11

�

• Many software packages (eg. Weka) output this matrix

• Varying the parameter of the algorithm produces a curve

COMP-652, Lecture 12 - October 18, 2012 11


Common measures§ Accuracy = (TP+ TN) / (TP + FP + FN + TN)

§ Precision = True positives / Total number of declared positives

= TP / (TP+ FP)

§ Recall = True positives / Total number of actual positives

= TP / (TP + FN)




= TP / (TP+ FP)


= TP / (TP + FN)

§ Sensitivity is the same as recall.

§ Specificity = True negatives / Total number of actual negatives

= TN / (FP + TN)

Textclassification

Medicine




= TP / (TP+ FP)


= TP / (TP + FN)



= TN / (FP + TN)

§ False positive rate = FP / (FP + TN)

Textclassification

Medicine




= TP / (TP+ FP)


= TP / (TP + FN)



= TN / (FP + TN)

§ False positive rate = FP / (FP + TN)

§ F1 measure

Textclassification

Medicine


Trade-off: False positives and false negatives§ Often have a trade-off between false positives and false negatives.

§ E.g. Consider 30 different classifiers trained on a class. Classify a new sample as positive if K classifiers output positive. Vary K between 0 and 30.

Example: Tree bagging

• 30 decision trees, classify an example as positive if K trees classify it aspositive

• Vary K between 0 and 30

!"#$%&'()*+),'-./.01)23''/)!"#$%&'()*+),'-./.01)23''/)-01/234-2',)56)5#77.17-01/234-2',)56)5#77.17

8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)8&#//.96)#/)%0/.2.:').9);)042)09)*+)23''/)%3',.-2)%0/.2.:'<))=#36);<%3',.-2)%0/.2.:'<))=#36);<


Precision-recall

• Similar concept to AUC curves, but used in retrieval tasks

• Precision is true positive / total number of documents retrieved

• Recall is true positives / all positives

• In medical applications we use instead sensitivity and selectivity, whichare the recall for the two classes

!"#$%&%'()*#$+,,)-"+./!"#$%&%'()*#$+,,)-"+./!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()!,'0)"#$+,,)'()/'"%1'(0+,)+2%&3)."#$%&%'()'()4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:)4#"0%$+,)+2%&3)+(5)4+"6)0/#)0/"#&/',5)7'")8+9%(:).'&%0%4#)."#5%$0%'(&);'")4+"6)<=.'&%0%4#)."#5%$0%'(&);'")4+"6)<=



Receiver operating characteristic (ROC)§ Consider simple 2-class classification (blue = negative class; red = positive class).

§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .

§ Changing that boundary defines the ROC curve on the right.

Predictpositive

Predictnegative


Understanding the ROC curve§ Consider simple 2-class classification (blue = negative class; red = positive class).

§ Consider the decision boundary (vertical line inthe left figure), where you predict Negative on the left of the boundary and predict Positive on the right .

§ Changing that boundary defines the ROC curve on the right.

Predictpositive

Predictnegative

Diagonal line corresponds to a random classifier.


Building the ROC curve§ In many domains, the empirical ROC curve will be non-convex

(red line). Take the convex hull of the points (blue line).

To build the ROC curve: 1. Train a classifier.2. Vary the decision boundary.3. Compute FP and TP rate for

each different decision boundary.

ROC convex hull

• Suppose we have two hypotheses h1 and h2 along the ROC curve.

• We can always use h1 with probability p and h2 with probability (1 � p)and the performance will interpolate between the two

• So we can always match any point on the convex hull of an empiricalROC curve !"#$#%&'()$*+,,!"#$#%&'()$*+,,

!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(


FP

TP


Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,

consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.

ROC convex hull




!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(



Using the ROC curve§ To compare 2 algorithms over a range of classification thresholds,

consider the Area Under the (ROC) Curve (AUC).§ A perfect algorithm has AUC=1.§ A random algorithm has AUC=0.5.§ Higher AUC doesn’t mean all performance measures are better.

ROC convex hull




!"#$#%&'()$*+,,

"-./.&0,$!"#$#+-'(


Intuition: AUC is equivalent to the probability of ranking a random positive example higher than a random negative example!


What about regression?§ Evaluation for regression is generally more straightforward than

classification, but there are still choices!

§ Common examples:

§ Mean Squared Error:

§ Root Mean Squared Error:

§ Mean Absolute Error:

§ (R)MSE penalizes larger errors more (e.g., being off by 10 is more than

twice as bad as being off by 5).

MSE =1

n

nX

i=1

(yi � yi)2

MAE =1

n

nX

i=1

|yi � yi|


RMSE =

rPni=1(yi � yi)2

n

Our goal: Minimize validation error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

§ Overly simple model:§ High training error and

high test error.§ Overly complex model:

§ Low training error but high test error.


Our goal: Minimize validation error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

§ Overly simple model:§ High bias

§ Overly complex model:§ High variance


Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:

§ where

E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2

Bias[f(x)] = E[f(x)]� f(x)

Var[f(x)] = E[f(x)2]� E[f(x)]2


Bias vs variance in machine learning§ Assume that y =f(x)+) where ) is zero mean and finite variance *+§ Then for any learned model we can decompose the error as:

§ where E[(y � f(x))2] = (Bias[f(x)])2 +Var[f(x)] + �2

Bias[f(x)] = E[f(x)]� f(x)

Var[f(x)] = E[f(x)2]� E[f(x)]2

How far off is our mean prediction?

How wildly do our predictions vary?


Bias vs variance in machine learning

§ Overly simple model:§ High bias§ Underfitting

§ Overly complex model:§ High variance§ Overfitting


Overfitting, bias, and variance

§ What to do when we are overfitting?

§ Remove some features.

§ Simplify the model.

§ Reduce the variance at the cost of some bias!

§ This is called regularization.


Overfitting, bias, and variance

Which models have higher bias, which have higher variance?


Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:

ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.

§ Do a little algebra to get the solution: ŵridge =(XTX+λI)-1XTY§ The ridge solution is not equivariant under scaling of the data, so

typically need to normalize the inputs first.

§ Ridge gives a smooth solution, effectively shrinking the weights, but

drives few weights to 0.


Ridge regression (aka L2-regularization)§ Constrains the weights by imposing a penalty on their size:

ŵridge =argminw { ∑i=1:n(yi - wTxi)2 +λ∑j=0:mwj2 }where λ can be selected manually, or by cross-validation.

§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ(||w||2)2§ Can also simply add the penalty to gradient descent:

∂Err(w)/∂w =2(XTXw – XTY)+2 λ w

Usual gradient for linear regression

Gradient from the L2-penalty term


Lasso regression (aka L1-regularization)

§ Constrains the weights by penalizing the absolute value of their size:

ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}

§ Now the objective is non-linear in the output y, and there is no closed-form solution. Need to solve a quadratic programming problem instead.

§ More computationally expensive than Ridge regression.

§ Effectively sets the weights of less relevant input features to zero.


Lasso regression (aka L1-regularization)

§ Constrains the weights by penalizing the absolute value of their size:

ŵlasso =argminW {∑i=1:n(yi - wTxi)2 +λ∑j=1:m|wj|}

§ I.e., Err(w)=∑i=1:n(yi - wTxi)2 +λ||w||1§ Can also simply add the penalty to gradient descent:

∂Err(w)/∂w =2(XTXw – XTY)+λ sign(w)

Usual gradient for linear regression

(Sub)gradient of the L1-penalty term. sign(x)=1 if x >0, sign(x)=-1 if x < 0,

and sign(x)=0 if x=0.


Comparing Ridge and Lasso

4

Joelle Pineau 7

Comparing Ridge and Lasso

COMP-598: Applied Machine Learning

Solving L1 regularization

• The optimization problem is a quadratic program

• There is one constraint for each possible sign of the weights (2n

constraints for n weights)

• For example, with two weights:

minw1,w2

mX

j=1

(yj � w1x1 � w2x2)2

such that w1 + w2 ⇥ ⇤

w1 � w2 ⇥ ⇤

�w1 + w2 ⇥ ⇤

�w1 � w2 ⇥ ⇤

• Solving this program directly can be done for problems with a smallnumber of inputs

COMP-652, Lecture 3 - September 8, 2011 31

Visualizing L1 regularization

w1

w2

w?

• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners

• This makes L1 regularization much more likely to make some weightsexactly 0


Ridge Lasso

L2 Regularization for linear models revisited

• Optimization problem: minimize error while keeping norm of the weightsbounded

minw

JD(w) = minw

(�w � y)T (�w � y)

such that wTw ⇥ ⇤

• The Lagrangian is:

L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤

• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay


Visualizing regularization (2 parameters)

w1

w2

w?

w⇤ = (�T�+ ⌅I)�1�y


Joelle Pineau 8

A quick look at evaluation functions

•  We call L(Y,fw(x)) the loss function.

–  Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2

•  Other loss functions?

–  Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

–  0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

•  Different loss functions make different assumptions.

–  Squared error loss assumes the data can be approximated by a global linear model.

COMP-598: Applied Machine Learning

Contours of equalregression error

Contours of equalmodel complexity penalty

§ Ridge regression will tend to lower all weights.

§ Lasso will tend to set some weights to 0.

§ Combining Lasso and Ridge regression is called the “elastic” net.


Bias, variance, and regularization


§ Regularization (e.g, Ridge and Lass) decrease variance at the expense of some bias.

§ Regularization reduces overfitting but can cause underfitting.

§ The L2 and L1 penalties can be added to any model (e.g., logistic regression). Easy to incorporate with gradient descent.

43

The concept of inductive bias


§ Statistical “bias” is not a bad thing in ML!§ It is all about having the right inductive bias. § ML is all about coming up with different models that

have different inductive biases. § Optional reading: “The Need for Biases in Learning

Generalizations"

44

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.5466

COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...

Documents

Transcript of COMP 551 -Applied Machine Learningwlh/comp551/slides/06-eval_and_reg.pdf · Receiver operating...