COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine...

62
COMP 551 – Applied Machine Learning Lecture 3: Linear regression (cont’d) Instructor: Herke van Hoof ([email protected]) Slides mostly by: Joelle Pineau Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.

Transcript of COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine...

Page 1: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

COMP 551 – Applied Machine LearningLecture 3: Linear regression (cont’d)

Instructor: Herke van Hoof ([email protected])

Slides mostly by: Joelle Pineau

Class web page: www.cs.mcgill.ca/~hvanho2/comp551

Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.

Page 2: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau2

What we saw last time

COMP-551: Applied Machine Learning

• Definition and characteristics of a supervised learning problem.

• Linear regression (hypothesis class, cost function, algorithm).

• Closed-form least-squares solution method (algorithm,

computational complexity, stability issues).

• Alternative optimization methods via gradient descent.

Page 3: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau3

This function looks complicated, and a linear hypothesis does not

seem very good.

What should we do?

COMP-551: Applied Machine Learning

Predicting recurrence time from tumor size

Page 4: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau4

This function looks complicated, and a linear hypothesis does not

seem very good.

What should we do?

• Pick a better function?

• Use more features?

• Get more data?

COMP-551: Applied Machine Learning

Predicting recurrence time from tumor size

Page 5: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau5

Input variables for linear regression

• Original quantitative variables X1, …, Xm

• Transformations of variables, e.g. Xm+1 = log(Xi)

• Basis expansions, e.g. Xm+1 = Xi2, Xm+2 = Xi

3, …

• Interaction terms, e.g. Xm+1 = Xi Xj

• Numeric coding of qualitative variables, e.g. Xm+1 = 1 if Xi is true

and 0 otherwise.

In all cases, we can add Xm+1, …, Xm+k to the list of original

variables and perform the linear regression.

COMP-551: Applied Machine Learning

Page 6: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau6

Example of linear regression with polynomial terms

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .

COMP-652, Lecture 1 - September 6, 2012 44

Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775

COMP-652, Lecture 1 - September 6, 2012 45

COMP-551: Applied Machine Learning

fw(x) = w0 + w1 x +w2 x2

x2 x

Page 7: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau7

Solving the problem

COMP-551: Applied Machine Learning

Solving for w

w = (XTX)�1XTY =

4.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10

��1 3.606.498.34

�=

0.681.740.73

So the best order-2 polynomial is y = 0.68x2 + 1.74x+ 0.73.

COMP-652, Lecture 1 - September 6, 2012 48

Order-2 fit

x

y

Is this a better fit to the data?

COMP-652, Lecture 1 - September 6, 2012 49

Compared to y = 1.6x + 1.05 for the order-1 polynomial.

Page 8: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau8COMP-551: Applied Machine Learning

Order-3 fit: Is this better?

Page 9: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau9COMP-551: Applied Machine Learning

Order-4 fit

Page 10: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau10COMP-551: Applied Machine Learning

Order-5 fit

Page 11: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau11COMP-551: Applied Machine Learning

Order-6 fit

Page 12: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau12COMP-551: Applied Machine Learning

Order-7 fit

Page 13: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau13COMP-551: Applied Machine Learning

Order-8 fit

Page 14: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau14COMP-551: Applied Machine Learning

Order-9 fit

Page 15: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau15COMP-551: Applied Machine Learning

This is overfitting!

Page 16: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau16COMP-551: Applied Machine Learning

This is overfitting!

• We can find a hypothesis that explains perfectly the training

data, but does not generalize well to new data.

• In this example: we have a lot of parameters (weights), so the

hypothesis matches the data points exactly,

but is wild everywhere else.

• A very important problem in machine learning.

Page 17: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau17COMP-551: Applied Machine Learning

Overfitting

• Every hypothesis has a true error measured on all possible data

items we could ever encounter (e.g. fw(xi) - yi ).

• Since we don’t have all possible data, in order to decide what is

a good hypothesis, we measure error over the training set.

• Formally: Suppose we compare hypotheses f1 and f2.

– Assume f1 has lower error on the training set.

– If f2 has lower true error, then our algorithm is overfitting.

Page 18: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau18

Overfitting

• Which hypothesis has the lowest true error?

COMP-551: Applied Machine Learning

d=2 d=3 d=4

d=5 d=8d=7d=6

d=1

Page 19: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau19COMP-551: Applied Machine Learning

Cross-Validation• Partition your data into a Training Set and a Validation set.

– The proportions in each set can vary.

• Use the Training Set to find the best hypothesis in the class.

• Use the Validation Set to evaluate the true prediction error.– Compare across different hypothesis classes (different order polynominals.)

Answer: Polynomial regression

• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).

• Suppose we want a degree-d polynomial fit.

• Let Y be as before and let

X =

2

664

xd1 . . . x2

1 x1 1xd2 . . . x2

2 x2 1...

......

...

xdm . . . x2

m xm 1

3

775

• Solve the linear regression Xw ⌅ Y .

COMP-652, Lecture 1 - September 6, 2012 44

Example of quadratic regression: Data matrices

X =

2

6666666666666664

0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1

3

7777777777777775

Y =

2

6666666666666664

2.490.83�0.253.100.870.02�0.121.81�0.830.43

3

7777777777777775

COMP-652, Lecture 1 - September 6, 2012 45

Train:

Validate:

Page 20: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau20COMP-551: Applied Machine Learning

k-fold Cross-Validation

• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

Page 21: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau21COMP-551: Applied Machine Learning

k-fold Cross-Validation

• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.

• Computation time is increased by factor of k.

Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn

Page 22: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau22COMP-551: Applied Machine Learning

• Let k = n, the size of the training set

• For each order-d hypothesis class,

– Repeat n times:• Set aside one instance <xi, yi> from the training set.• Use all other data points to find w (optimization).• Measure prediction error on the held-out <xi, yi>.

– Average the prediction error over all n subsets.

• Choose the d with lowest estimated true prediction error.

Leave-one-out cross-validation

Page 23: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau23COMP-551: Applied Machine Learning

Estimating true error for d=1

x y 0.86 2.490.09 0.83-0.85 -0.250.87 3.10-0.44 0.87-0.43 0.02-1.1 -0.120.40 1.81-0.96 -0.830.17 0.43

Data Cross-validation results

Page 24: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau24COMP-551: Applied Machine Learning

Cross-validation results

• Optimal choice: d=2. Overfitting for d > 2.

Page 25: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau25COMP-551: Applied Machine Learning

Evaluation

• We use cross-validation for model selection.

• Available labeled data is split into two parts:

– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).

– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).

• Must be untouched during the process of looking for f within a class F.

Page 26: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau26

Evaluation

• After adapting the weights to minimize the error on the train set,

the weights could be exploiting particularities in the train set:– have to use the validation set as proxy for true error

• After choosing the hypothesis class to minimize error on the

validation set, the hypothesis class could be adapted to some

particularities in the validation set– Validation set is no longer a good proxy for the true error!

COMP-551: Applied Machine Learning

Page 27: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau27COMP-551: Applied Machine Learning

Evaluation

• We use cross-validation for model selection.

• Available labeled data is split into parts:

– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).

– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).

• Must be untouched during the process of looking for f within a class F.

• Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error.– Cannot be touched during the process of selecting F(Often the “validation set” is called “test set”, without distinction.)

Page 28: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau28

Validation vs Train error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

COMP-551: Applied Machine Learning

[From Hastie et al. textbook]

Page 29: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau29

Understanding the error

Given set of examples <X, Y>. Assume that y = f(x) + ∊,

where ∊ is Gaussian noise with zero mean and std deviation σ.

COMP-551: Applied Machine Learning

Polynomial regression wrap-up

• We can add to the training data all cross-product terms up to somedegree d

• Fitting parameters is no di�erent from linear regression

• We can use cross-validation to choose the best order of polynomial to fitour data.

• Because the number of parameters explodes with the degree of thepolynomial (see homework 1), we will often use only select cross-termsor higher-order powers (based on domain knowledge).

• Always use cross-validation, and report results on testing data (completelyuntouched in the training process)

COMP-652, Lecture 2 - September 11, 2012 15

The anatomy of the error

• Suppose we have examples ⇧x, y⌃ where y = f(x) + ⇥ and ⇥ is Gaussiannoise with zero mean and standard deviation ⇤

• Reminder: normal (Gaussian) distribution

N (x|µ,�2)

x

2�

µ

(see Bishop Ch. 2 for review)

COMP-652, Lecture 2 - September 11, 2012 16

f(x)

Page 30: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau30

Understanding the error

• Consider standard linear regression solution:

Err(w) = ∑i=1:n ( yi - wTxi)2

• If we consider only the class of linear hypotheses, we have

systematic prediction error, called bias, whenever the data is

generated by a non-linear function.

• Depending on what dataset we observed, we may get different

solutions. Thus we can also have error due to this variance.

– This occurs even if data is generated from class of linear functions.

COMP-551: Applied Machine Learning

Page 31: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau31

An example (from Tom Dietterich)

• The circles are data points. X is drawn uniformly randomly. Y is

generated by the function y = 2sin(0.5x) + ∊.

COMP-551: Applied Machine Learning

The anatomy of the error: Linear regression

• In linear regression, given a set of examples ⇧xi, yi⌃i=1...m, we fit a linearhypothesis h(x) = wTx, such as to minimize sum-squared error over thetraining data:

mX

i=1

(yi � h(xi))2

• Because of the hypothesis class that we chose (hypotheses linear inthe parameters) for some target functions f we will have a systematicprediction error

• Even if f were truly from the hypothesis class we picked, depending onthe data set we have, the parameters w that we find may be di�erent;this variability due to the specific data set on hand is a di�erent sourceof error

COMP-652, Lecture 2 - September 11, 2012 17

An example (Tom Dietterich)

• The sine is the true function• The circles are data points (x drawn uniformly randomly, y given by theformula)

• The straight line is the linear regression fit (see lecture 1)

COMP-652, Lecture 2 - September 11, 2012 18

Page 32: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau32

An example (from Tom Dietterich)

• With different sets of 20 points, we get different lines.

COMP-551: Applied Machine Learning

Example continued

With di�erent sets of 20 points, we get di�erent lines

COMP-652, Lecture 2 - September 11, 2012 19

Bias-variance analysis

• Given a new data point x, what is the expected prediction error?• Assume that the data points are drawn independently and identicallydistributed (i.i.d.) from a unique underlying probability distributionP (⇧x, y⌃) = P (x)P (y|x)

• The goal of the analysis is to compute, for an arbitrary given point x,

EP

⇥(y � h(x))2|x

where y is the value of x in a data set, and the expectation is over alltraining sets of a given size, drawn according to P

• For a given hypothesis class, we can also compute the true error, whichis the expected error over the input distribution:

X

x

EP

⇥(y � h(x))2|x

⇤P (x)

(if x continuous, sum becomes integral with appropriate conditions).• We will decompose this expectation into three components

COMP-652, Lecture 2 - September 11, 2012 20

Page 33: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau33

Bias-variance decomposition

• Compare f(x) (true value at x) to prediction h(x)

• Error:

• Bias:

– How far is the average estimate from the true function?

• Variance:

– How far are estimates (on average) from the average estimate

• Notes:

– Expectations over all possible dataset (noise realizations)

COMP-551: Applied Machine Learning

E[(h(x)� f(x))2]

E[(h(x)� h(x))2]

f(x)� h(x)

h(x) = E[h(x)]

Page 34: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau34

Bias-variance decomposition

• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)

• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)

• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)

COMP-551: Applied Machine Learning

Page 35: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau35

Bias-variance decomposition

• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)

• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)

• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)

COMP-551: Applied Machine Learning

Page 36: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau36

Validation vs Train error38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Prediction

Error

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

COMP-551: Applied Machine Learning

[From Hastie et al. textbook]

Page 37: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau37

Bias-variance decomposition

• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)

• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)

• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)

• What happens if we apply 1st order model to linear dataset?

COMP-551: Applied Machine Learning

Page 38: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau38

Gauss-Markov Theorem

• Main result:

The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.

COMP-551: Applied Machine Learning

Page 39: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau39

Gauss-Markov Theorem

• Main result:

The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.

• Understanding the statement:

– Real parameters are denoted: w

– Estimate of the parameters is denoted: ŵ

– Error of the estimator: Err(ŵ) = E(ŵ-w)2 = Var(ŵ) + (E(ŵ-w))2

– Unbiased estimator means: E(ŵ-w)=0

– There may exist an estimator that has lower error, but some bias.

COMP-551: Applied Machine Learning

Page 40: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau40

Bias vs Variance

• Gauss-Markov Theorem says:

The least-squares estimates of the parameters w have the smallest

variance among all linear unbiased estimates.

• Insight: Find lower variance solution, at the expense of some bias.

COMP-551: Applied Machine Learning

Page 41: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau41

Bias vs Variance

• Gauss-Markov Theorem says:

The least-squares estimates of the parameters w have the smallest

variance among all linear unbiased estimates.

• Insight: Find lower variance solution, at the expense of some bias.

• E.g. fix low-relevance weights to 0

COMP-551: Applied Machine Learning

Page 42: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau42

Recall our prostate cancer example• The Z-score measures the effect of dropping that feature from the

linear regression. zj = ŵj / sqrt(σ2vj )where ŵj is the estimated weight of the jth

feature, and vj is the jth diagonal element of (XTX)-1

COMP-551: Applied Machine Learning

50 3. Linear Methods for Regression

TABLE 3.1. Correlations of predictors in the prostate cancer data.

lcavol lweight age lbph svi lcp gleason

lweight 0.300age 0.286 0.317lbph 0.063 0.437 0.287svi 0.593 0.181 0.129 −0.139lcp 0.692 0.157 0.173 −0.089 0.671

gleason 0.426 0.024 0.366 0.033 0.307 0.476pgg45 0.483 0.074 0.276 −0.030 0.481 0.663 0.757

TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is thecoefficient divided by its standard error (3.12). Roughly a Z score larger than twoin absolute value is significantly nonzero at the p = 0.05 level.

Term Coefficient Std. Error Z ScoreIntercept 2.46 0.09 27.60

lcavol 0.68 0.13 5.37lweight 0.26 0.10 2.75

age −0.14 0.10 −1.40lbph 0.21 0.10 2.06svi 0.31 0.12 2.47lcp −0.29 0.15 −1.87

gleason −0.02 0.15 −0.15pgg45 0.27 0.15 1.74

example, that both lcavol and lcp show a strong relationship with theresponse lpsa, and with each other. We need to fit the effects jointly tountangle the relationships between the predictors and the response.

We fit a linear model to the log of prostate-specific antigen, lpsa, afterfirst standardizing the predictors to have unit variance. We randomly splitthe dataset into a training set of size 67 and a test set of size 30. We ap-plied least squares estimation to the training set, producing the estimates,standard errors and Z-scores shown in Table 3.2. The Z-scores are definedin (3.12), and measure the effect of dropping that variable from the model.A Z-score greater than 2 in absolute value is approximately significant atthe 5% level. (For our example, we have nine parameters, and the 0.025 tailquantiles of the t67−9 distribution are ±2.002!) The predictor lcavol showsthe strongest effect, with lweight and svi also strong. Notice that lcp isnot significant, once lcavol is in the model (when used in a model withoutlcavol, lcp is strongly significant). We can also test for the exclusion ofa number of terms at once, using the F -statistic (3.13). For example, weconsider dropping all the non-significant terms in Table 3.2, namely age,

[From Hastie et al. textbook]

Page 43: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau43

Subset selection

• Idea: Keep only a small set of features with non-zero weights.

• Goal: Find lower variance solution, at the expense of some bias.

• There are many different methods for choosing subsets.

(More on this later… )

• Least-squares regression can be used to estimate the weights of

the selected features.

• Bias as true model might rely on the discarded features!

COMP-551: Applied Machine Learning

Page 44: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau44

Bias vs Variance

• Find lower variance solution, at the expense of some bias.

• Force some weights to 0

• E.g. Include penalty for model complexity in error to reduce

overfitting.

Err(w) = ∑i=1:n ( yi - wTxi)2 + λ |model_size|

λ is a hyper-parameter that controls penalty size.

COMP-551: Applied Machine Learning

Page 45: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau45

Ridge regression (aka L2-regularization)

• Constrains the weights by imposing a penalty on their size:

ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }

where λ can be selected manually, or by cross-validation.

COMP-551: Applied Machine Learning

Page 46: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau46

Ridge regression (aka L2-regularization)

• Constrains the weights by imposing a penalty on their size:

ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }

where λ can be selected manually, or by cross-validation.

• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY

COMP-551: Applied Machine Learning

Page 47: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau47

Ridge regression (aka L2-regularization)

• Re-write in matrix notation: fw (X) = Xw

Err(w) = (Y – Xw)T(Y – Xw) + λwTw

• To minimize, take the derivative w.r.t. w:

∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0

• Try a little algebra:

COMP-551: Applied Machine Learning

Page 48: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau48

Ridge regression (aka L2-regularization)

• Re-write in matrix notation: fw (X) = Xw

Err(w) = (Y – Xw)T(Y – Xw) + λwTw

• To minimize, take the derivative w.r.t. w:

∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0

• Try a little algebra: XT Y = (XT X + Iλ) w

ŵ = (XTX + Iλ)-1 XT Y

COMP-551: Applied Machine Learning

Page 49: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau49

Ridge regression (aka L2-regularization)

• Constrains the weights by imposing a penalty on their size:

ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }

where λ can be selected manually, or by cross-validation.

• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY

– The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first.

– Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0.

COMP-551: Applied Machine Learning

Page 50: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau50

Lasso regression (aka L1-regularization)

• Constrains the weights by penalizing the absolute value of their

size:

ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }

COMP-551: Applied Machine Learning

Page 51: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau51

Lasso regression (aka L1-regularization)

• Constrains the weights by penalizing the absolute value of their

size:

ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }

• Now there is no closed-form solution. Need to solve a quadratic

programming problem instead.

COMP-551: Applied Machine Learning

Page 52: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau52

Lasso regression (aka L1-regularization)

• Constrains the weights by penalizing the absolute value of their

size:

ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }

• Now there is no closed-form solution. Need to solve a quadratic

programming problem instead.

– More computationally expensive than Ridge regression.

– Effectively sets the weights of less relevant input features to zero.

COMP-551: Applied Machine Learning

Page 53: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau53

Comparing Ridge and Lasso

COMP-551: Applied Machine Learning

Ridge Lassoλ∑j=1:m|wj| Complexity: ∑j=0:mwj

2

• Note that for lasso, reducing any weight by, say, 0.1 reduces the

complexity by the same amount

– So, we’d prefer reducing less relevant features

• In ridge regression, reducing high weights reduces complexity

more than reducing a low weight

– So, trade-off between reducing less relevant features and larger weights. Tend to not reduce weights that are already small

Page 54: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau54

Comparing Ridge and Lasso

COMP-551: Applied Machine Learning

Ridge Lasso

4

Joelle Pineau 7

Comparing Ridge and Lasso

COMP-598: Applied Machine Learning

Solving L1 regularization

• The optimization problem is a quadratic program

• There is one constraint for each possible sign of the weights (2n

constraints for n weights)

• For example, with two weights:

minw1,w2

mX

j=1

(yj � w1x1 � w2x2)2

such that w1 + w2 ⇥ ⇤

w1 � w2 ⇥ ⇤

�w1 + w2 ⇥ ⇤

�w1 � w2 ⇥ ⇤

• Solving this program directly can be done for problems with a smallnumber of inputs

COMP-652, Lecture 3 - September 8, 2011 31

Visualizing L1 regularization

w1

w2

w?

• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners

• This makes L1 regularization much more likely to make some weightsexactly 0

COMP-652, Lecture 3 - September 8, 2011 32

Ridge Lasso

L2 Regularization for linear models revisited

• Optimization problem: minimize error while keeping norm of the weightsbounded

minw

JD(w) = minw

(�w � y)T (�w � y)

such that wTw ⇥ ⇤

• The Lagrangian is:

L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤

• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay

COMP-652, Lecture 3 - September 8, 2011 27

Visualizing regularization (2 parameters)

w1

w2

w?

w⇤ = (�T�+ ⌅I)�1�y

COMP-652, Lecture 3 - September 8, 2011 28

Joelle Pineau 8

A quick look at evaluation functions

•  We call L(Y,fw(x)) the loss function.

–  Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2

•  Other loss functions?

–  Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

–  0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

•  Different loss functions make different assumptions.

–  Squared error loss assumes the data can be approximated by a global linear model.

COMP-598: Applied Machine Learning

Contours of equalregression error

Contours of equalmodel complexity penalty

Page 55: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau55

A quick look at evaluation functions

• We call L(Y,fw(x)) the loss function.

– Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2

• Other loss functions?

COMP-551: Applied Machine Learning

Page 56: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau56

A quick look at evaluation functions

COMP-551: Applied Machine Learning

• We call L(Y,fw(x)) the loss function.

– Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2

• Other loss functions?

– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

Page 57: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau57

A quick look at evaluation functions

• We call L(Y,fw(x)) the loss function.

– Least-square / Mean squared-error (MSE) loss:

L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2

• Other loss functions?

– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

• Different loss functions make different assumptions.

– Squared error loss assumes the data can be approximated by a global linear model with Gaussian noise.

– Loss function independent of complexity penalty (l1 or l2)

COMP-551: Applied Machine Learning

Page 58: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau58

A quick look at evaluation functions

• Assume data generated as

COMP-551: Applied Machine Learning

p(y|x;w) = N (xTw,�2)

=1p2⇡�2

exp

✓� (y � xTw)2

2�2

Page 59: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau59

A quick look at evaluation functions

• Assume data generated as

• Reasonable to try to maximize

the probability of generating the

labels y

COMP-551: Applied Machine Learning

argmaxw

p(y|x,w) = argmaxw

log p(y|x,w)

= argmaxw

const� (y � xTw)2

p(y|x;w) = N (xTw,�2)

=1p2⇡�2

exp

✓� (y � xTw)2

2�2

Page 60: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau60

Tutorial

• Programming with Python

• This Friday, 6-7 pm

• Stewart Biology building S3/3

COMP-551: Applied Machine Learning

Page 61: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau61COMP-551: Applied Machine Learning

What you should know

• Overfitting (when it happens, how to avoid it).

• Cross-validation (how and why we use it).

• Ridge regression (l2 regularisation), Lasso (l1 regularisation)

Page 62: COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine Learning • Definition and characteristics of a supervised learning problem. • Linear

Joelle Pineau62

MSURJ announcement

COMP-551: Applied Machine Learning