COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine...
Transcript of COMP 551 –Applied Machine Learning Lecture 3: Linear ...€¦ · COMP-551: Applied Machine...
COMP 551 – Applied Machine LearningLecture 3: Linear regression (cont’d)
Instructor: Herke van Hoof ([email protected])
Slides mostly by: Joelle Pineau
Class web page: www.cs.mcgill.ca/~hvanho2/comp551
Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor’s written permission.
Joelle Pineau2
What we saw last time
COMP-551: Applied Machine Learning
• Definition and characteristics of a supervised learning problem.
• Linear regression (hypothesis class, cost function, algorithm).
• Closed-form least-squares solution method (algorithm,
computational complexity, stability issues).
• Alternative optimization methods via gradient descent.
Joelle Pineau3
This function looks complicated, and a linear hypothesis does not
seem very good.
What should we do?
COMP-551: Applied Machine Learning
Predicting recurrence time from tumor size
Joelle Pineau4
This function looks complicated, and a linear hypothesis does not
seem very good.
What should we do?
• Pick a better function?
• Use more features?
• Get more data?
COMP-551: Applied Machine Learning
Predicting recurrence time from tumor size
Joelle Pineau5
Input variables for linear regression
• Original quantitative variables X1, …, Xm
• Transformations of variables, e.g. Xm+1 = log(Xi)
• Basis expansions, e.g. Xm+1 = Xi2, Xm+2 = Xi
3, …
• Interaction terms, e.g. Xm+1 = Xi Xj
• Numeric coding of qualitative variables, e.g. Xm+1 = 1 if Xi is true
and 0 otherwise.
In all cases, we can add Xm+1, …, Xm+k to the list of original
variables and perform the linear regression.
COMP-551: Applied Machine Learning
Joelle Pineau6
Example of linear regression with polynomial terms
Answer: Polynomial regression
• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).
• Suppose we want a degree-d polynomial fit.
• Let Y be as before and let
X =
2
664
xd1 . . . x2
1 x1 1xd2 . . . x2
2 x2 1...
......
...
xdm . . . x2
m xm 1
3
775
• Solve the linear regression Xw ⌅ Y .
COMP-652, Lecture 1 - September 6, 2012 44
Example of quadratic regression: Data matrices
X =
2
6666666666666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777777777777775
Y =
2
6666666666666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777777777777775
COMP-652, Lecture 1 - September 6, 2012 45
COMP-551: Applied Machine Learning
fw(x) = w0 + w1 x +w2 x2
x2 x
Joelle Pineau7
Solving the problem
COMP-551: Applied Machine Learning
Solving for w
w = (XTX)�1XTY =
4.11 �1.64 4.95�1.64 4.95 �1.394.95 �1.39 10
��1 3.606.498.34
�=
0.681.740.73
�
So the best order-2 polynomial is y = 0.68x2 + 1.74x+ 0.73.
COMP-652, Lecture 1 - September 6, 2012 48
Order-2 fit
x
y
Is this a better fit to the data?
COMP-652, Lecture 1 - September 6, 2012 49
Compared to y = 1.6x + 1.05 for the order-1 polynomial.
Joelle Pineau8COMP-551: Applied Machine Learning
Order-3 fit: Is this better?
Joelle Pineau9COMP-551: Applied Machine Learning
Order-4 fit
Joelle Pineau10COMP-551: Applied Machine Learning
Order-5 fit
Joelle Pineau11COMP-551: Applied Machine Learning
Order-6 fit
Joelle Pineau12COMP-551: Applied Machine Learning
Order-7 fit
Joelle Pineau13COMP-551: Applied Machine Learning
Order-8 fit
Joelle Pineau14COMP-551: Applied Machine Learning
Order-9 fit
Joelle Pineau15COMP-551: Applied Machine Learning
This is overfitting!
Joelle Pineau16COMP-551: Applied Machine Learning
This is overfitting!
• We can find a hypothesis that explains perfectly the training
data, but does not generalize well to new data.
• In this example: we have a lot of parameters (weights), so the
hypothesis matches the data points exactly,
but is wild everywhere else.
• A very important problem in machine learning.
Joelle Pineau17COMP-551: Applied Machine Learning
Overfitting
• Every hypothesis has a true error measured on all possible data
items we could ever encounter (e.g. fw(xi) - yi ).
• Since we don’t have all possible data, in order to decide what is
a good hypothesis, we measure error over the training set.
• Formally: Suppose we compare hypotheses f1 and f2.
– Assume f1 has lower error on the training set.
– If f2 has lower true error, then our algorithm is overfitting.
Joelle Pineau18
Overfitting
• Which hypothesis has the lowest true error?
COMP-551: Applied Machine Learning
d=2 d=3 d=4
d=5 d=8d=7d=6
d=1
Joelle Pineau19COMP-551: Applied Machine Learning
Cross-Validation• Partition your data into a Training Set and a Validation set.
– The proportions in each set can vary.
• Use the Training Set to find the best hypothesis in the class.
• Use the Validation Set to evaluate the true prediction error.– Compare across different hypothesis classes (different order polynominals.)
Answer: Polynomial regression
• Given data: (x1, y1), (x2, y2), . . . , (xm, ym).
• Suppose we want a degree-d polynomial fit.
• Let Y be as before and let
X =
2
664
xd1 . . . x2
1 x1 1xd2 . . . x2
2 x2 1...
......
...
xdm . . . x2
m xm 1
3
775
• Solve the linear regression Xw ⌅ Y .
COMP-652, Lecture 1 - September 6, 2012 44
Example of quadratic regression: Data matrices
X =
2
6666666666666664
0.75 0.86 10.01 0.09 10.73 �0.85 10.76 0.87 10.19 �0.44 10.18 �0.43 11.22 �1.10 10.16 0.40 10.93 �0.96 10.03 0.17 1
3
7777777777777775
Y =
2
6666666666666664
2.490.83�0.253.100.870.02�0.121.81�0.830.43
3
7777777777777775
COMP-652, Lecture 1 - September 6, 2012 45
Train:
Validate:
Joelle Pineau20COMP-551: Applied Machine Learning
k-fold Cross-Validation
• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.
Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn
Joelle Pineau21COMP-551: Applied Machine Learning
k-fold Cross-Validation
• Consider k partitions of the data (usually of equal size).• Train with k-1 subset, validate on kth subset. Repeat k times.• Average the prediction error over the k rounds/folds.
• Computation time is increased by factor of k.
Source: http://stackoverflow.com/questions/31947183/how-to-implement-walk-forward-testing-in-sklearn
Joelle Pineau22COMP-551: Applied Machine Learning
• Let k = n, the size of the training set
• For each order-d hypothesis class,
– Repeat n times:• Set aside one instance <xi, yi> from the training set.• Use all other data points to find w (optimization).• Measure prediction error on the held-out <xi, yi>.
– Average the prediction error over all n subsets.
• Choose the d with lowest estimated true prediction error.
Leave-one-out cross-validation
Joelle Pineau23COMP-551: Applied Machine Learning
Estimating true error for d=1
x y 0.86 2.490.09 0.83-0.85 -0.250.87 3.10-0.44 0.87-0.43 0.02-1.1 -0.120.40 1.81-0.96 -0.830.17 0.43
Data Cross-validation results
Joelle Pineau24COMP-551: Applied Machine Learning
Cross-validation results
• Optimal choice: d=2. Overfitting for d > 2.
Joelle Pineau25COMP-551: Applied Machine Learning
Evaluation
• We use cross-validation for model selection.
• Available labeled data is split into two parts:
– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).
– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).
• Must be untouched during the process of looking for f within a class F.
Joelle Pineau26
Evaluation
• After adapting the weights to minimize the error on the train set,
the weights could be exploiting particularities in the train set:– have to use the validation set as proxy for true error
• After choosing the hypothesis class to minimize error on the
validation set, the hypothesis class could be adapted to some
particularities in the validation set– Validation set is no longer a good proxy for the true error!
COMP-551: Applied Machine Learning
Joelle Pineau27COMP-551: Applied Machine Learning
Evaluation
• We use cross-validation for model selection.
• Available labeled data is split into parts:
– Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree).
– Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression).
• Must be untouched during the process of looking for f within a class F.
• Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error.– Cannot be touched during the process of selecting F(Often the “validation set” is called “test set”, without distinction.)
Joelle Pineau28
Validation vs Train error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
COMP-551: Applied Machine Learning
[From Hastie et al. textbook]
Joelle Pineau29
Understanding the error
Given set of examples <X, Y>. Assume that y = f(x) + ∊,
where ∊ is Gaussian noise with zero mean and std deviation σ.
COMP-551: Applied Machine Learning
Polynomial regression wrap-up
• We can add to the training data all cross-product terms up to somedegree d
• Fitting parameters is no di�erent from linear regression
• We can use cross-validation to choose the best order of polynomial to fitour data.
• Because the number of parameters explodes with the degree of thepolynomial (see homework 1), we will often use only select cross-termsor higher-order powers (based on domain knowledge).
• Always use cross-validation, and report results on testing data (completelyuntouched in the training process)
COMP-652, Lecture 2 - September 11, 2012 15
The anatomy of the error
• Suppose we have examples ⇧x, y⌃ where y = f(x) + ⇥ and ⇥ is Gaussiannoise with zero mean and standard deviation ⇤
• Reminder: normal (Gaussian) distribution
N (x|µ,�2)
x
2�
µ
(see Bishop Ch. 2 for review)
COMP-652, Lecture 2 - September 11, 2012 16
f(x)
Joelle Pineau30
Understanding the error
• Consider standard linear regression solution:
Err(w) = ∑i=1:n ( yi - wTxi)2
• If we consider only the class of linear hypotheses, we have
systematic prediction error, called bias, whenever the data is
generated by a non-linear function.
• Depending on what dataset we observed, we may get different
solutions. Thus we can also have error due to this variance.
– This occurs even if data is generated from class of linear functions.
COMP-551: Applied Machine Learning
Joelle Pineau31
An example (from Tom Dietterich)
• The circles are data points. X is drawn uniformly randomly. Y is
generated by the function y = 2sin(0.5x) + ∊.
COMP-551: Applied Machine Learning
The anatomy of the error: Linear regression
• In linear regression, given a set of examples ⇧xi, yi⌃i=1...m, we fit a linearhypothesis h(x) = wTx, such as to minimize sum-squared error over thetraining data:
mX
i=1
(yi � h(xi))2
• Because of the hypothesis class that we chose (hypotheses linear inthe parameters) for some target functions f we will have a systematicprediction error
• Even if f were truly from the hypothesis class we picked, depending onthe data set we have, the parameters w that we find may be di�erent;this variability due to the specific data set on hand is a di�erent sourceof error
COMP-652, Lecture 2 - September 11, 2012 17
An example (Tom Dietterich)
• The sine is the true function• The circles are data points (x drawn uniformly randomly, y given by theformula)
• The straight line is the linear regression fit (see lecture 1)
COMP-652, Lecture 2 - September 11, 2012 18
Joelle Pineau32
An example (from Tom Dietterich)
• With different sets of 20 points, we get different lines.
COMP-551: Applied Machine Learning
Example continued
With di�erent sets of 20 points, we get di�erent lines
COMP-652, Lecture 2 - September 11, 2012 19
Bias-variance analysis
• Given a new data point x, what is the expected prediction error?• Assume that the data points are drawn independently and identicallydistributed (i.i.d.) from a unique underlying probability distributionP (⇧x, y⌃) = P (x)P (y|x)
• The goal of the analysis is to compute, for an arbitrary given point x,
EP
⇥(y � h(x))2|x
⇤
where y is the value of x in a data set, and the expectation is over alltraining sets of a given size, drawn according to P
• For a given hypothesis class, we can also compute the true error, whichis the expected error over the input distribution:
X
x
EP
⇥(y � h(x))2|x
⇤P (x)
(if x continuous, sum becomes integral with appropriate conditions).• We will decompose this expectation into three components
COMP-652, Lecture 2 - September 11, 2012 20
Joelle Pineau33
Bias-variance decomposition
• Compare f(x) (true value at x) to prediction h(x)
• Error:
• Bias:
– How far is the average estimate from the true function?
• Variance:
– How far are estimates (on average) from the average estimate
• Notes:
– Expectations over all possible dataset (noise realizations)
–
COMP-551: Applied Machine Learning
E[(h(x)� f(x))2]
E[(h(x)� h(x))2]
f(x)� h(x)
h(x) = E[h(x)]
Joelle Pineau34
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
COMP-551: Applied Machine Learning
Joelle Pineau35
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
COMP-551: Applied Machine Learning
Joelle Pineau36
Validation vs Train error38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Prediction
Error
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
COMP-551: Applied Machine Learning
[From Hastie et al. textbook]
Joelle Pineau37
Bias-variance decomposition
• Can show that: Error = bias2 + variance(e.g. https://web.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf)
• What happens if we apply 1st order model to quadratic dataset?– No matter how much data we have, can’t fit the dataset– This means we have bias (underfitting)
• What happens if we apply 2nd order model to linear dataset?– The model can definitely fit the dataset! (no bias)– Spurious parameters make the model sensitive to noise– Higher variance than necessary (overfitting if dataset is small)
• What happens if we apply 1st order model to linear dataset?
COMP-551: Applied Machine Learning
Joelle Pineau38
Gauss-Markov Theorem
• Main result:
The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.
COMP-551: Applied Machine Learning
Joelle Pineau39
Gauss-Markov Theorem
• Main result:
The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates.
• Understanding the statement:
– Real parameters are denoted: w
– Estimate of the parameters is denoted: ŵ
– Error of the estimator: Err(ŵ) = E(ŵ-w)2 = Var(ŵ) + (E(ŵ-w))2
– Unbiased estimator means: E(ŵ-w)=0
– There may exist an estimator that has lower error, but some bias.
COMP-551: Applied Machine Learning
Joelle Pineau40
Bias vs Variance
• Gauss-Markov Theorem says:
The least-squares estimates of the parameters w have the smallest
variance among all linear unbiased estimates.
• Insight: Find lower variance solution, at the expense of some bias.
COMP-551: Applied Machine Learning
Joelle Pineau41
Bias vs Variance
• Gauss-Markov Theorem says:
The least-squares estimates of the parameters w have the smallest
variance among all linear unbiased estimates.
• Insight: Find lower variance solution, at the expense of some bias.
• E.g. fix low-relevance weights to 0
COMP-551: Applied Machine Learning
Joelle Pineau42
Recall our prostate cancer example• The Z-score measures the effect of dropping that feature from the
linear regression. zj = ŵj / sqrt(σ2vj )where ŵj is the estimated weight of the jth
feature, and vj is the jth diagonal element of (XTX)-1
COMP-551: Applied Machine Learning
50 3. Linear Methods for Regression
TABLE 3.1. Correlations of predictors in the prostate cancer data.
lcavol lweight age lbph svi lcp gleason
lweight 0.300age 0.286 0.317lbph 0.063 0.437 0.287svi 0.593 0.181 0.129 −0.139lcp 0.692 0.157 0.173 −0.089 0.671
gleason 0.426 0.024 0.366 0.033 0.307 0.476pgg45 0.483 0.074 0.276 −0.030 0.481 0.663 0.757
TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is thecoefficient divided by its standard error (3.12). Roughly a Z score larger than twoin absolute value is significantly nonzero at the p = 0.05 level.
Term Coefficient Std. Error Z ScoreIntercept 2.46 0.09 27.60
lcavol 0.68 0.13 5.37lweight 0.26 0.10 2.75
age −0.14 0.10 −1.40lbph 0.21 0.10 2.06svi 0.31 0.12 2.47lcp −0.29 0.15 −1.87
gleason −0.02 0.15 −0.15pgg45 0.27 0.15 1.74
example, that both lcavol and lcp show a strong relationship with theresponse lpsa, and with each other. We need to fit the effects jointly tountangle the relationships between the predictors and the response.
We fit a linear model to the log of prostate-specific antigen, lpsa, afterfirst standardizing the predictors to have unit variance. We randomly splitthe dataset into a training set of size 67 and a test set of size 30. We ap-plied least squares estimation to the training set, producing the estimates,standard errors and Z-scores shown in Table 3.2. The Z-scores are definedin (3.12), and measure the effect of dropping that variable from the model.A Z-score greater than 2 in absolute value is approximately significant atthe 5% level. (For our example, we have nine parameters, and the 0.025 tailquantiles of the t67−9 distribution are ±2.002!) The predictor lcavol showsthe strongest effect, with lweight and svi also strong. Notice that lcp isnot significant, once lcavol is in the model (when used in a model withoutlcavol, lcp is strongly significant). We can also test for the exclusion ofa number of terms at once, using the F -statistic (3.13). For example, weconsider dropping all the non-significant terms in Table 3.2, namely age,
[From Hastie et al. textbook]
Joelle Pineau43
Subset selection
• Idea: Keep only a small set of features with non-zero weights.
• Goal: Find lower variance solution, at the expense of some bias.
• There are many different methods for choosing subsets.
(More on this later… )
• Least-squares regression can be used to estimate the weights of
the selected features.
• Bias as true model might rely on the discarded features!
COMP-551: Applied Machine Learning
Joelle Pineau44
Bias vs Variance
• Find lower variance solution, at the expense of some bias.
• Force some weights to 0
• E.g. Include penalty for model complexity in error to reduce
overfitting.
Err(w) = ∑i=1:n ( yi - wTxi)2 + λ |model_size|
λ is a hyper-parameter that controls penalty size.
COMP-551: Applied Machine Learning
Joelle Pineau45
Ridge regression (aka L2-regularization)
• Constrains the weights by imposing a penalty on their size:
ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }
where λ can be selected manually, or by cross-validation.
COMP-551: Applied Machine Learning
Joelle Pineau46
Ridge regression (aka L2-regularization)
• Constrains the weights by imposing a penalty on their size:
ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }
where λ can be selected manually, or by cross-validation.
• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY
COMP-551: Applied Machine Learning
Joelle Pineau47
Ridge regression (aka L2-regularization)
• Re-write in matrix notation: fw (X) = Xw
Err(w) = (Y – Xw)T(Y – Xw) + λwTw
• To minimize, take the derivative w.r.t. w:
∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0
• Try a little algebra:
COMP-551: Applied Machine Learning
Joelle Pineau48
Ridge regression (aka L2-regularization)
• Re-write in matrix notation: fw (X) = Xw
Err(w) = (Y – Xw)T(Y – Xw) + λwTw
• To minimize, take the derivative w.r.t. w:
∂Err(w)/∂w = -2 XT (Y–Xw) –2λw = 0
• Try a little algebra: XT Y = (XT X + Iλ) w
ŵ = (XTX + Iλ)-1 XT Y
COMP-551: Applied Machine Learning
Joelle Pineau49
Ridge regression (aka L2-regularization)
• Constrains the weights by imposing a penalty on their size:
ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }
where λ can be selected manually, or by cross-validation.
• Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY
– The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first.
– Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0.
COMP-551: Applied Machine Learning
Joelle Pineau50
Lasso regression (aka L1-regularization)
• Constrains the weights by penalizing the absolute value of their
size:
ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }
COMP-551: Applied Machine Learning
Joelle Pineau51
Lasso regression (aka L1-regularization)
• Constrains the weights by penalizing the absolute value of their
size:
ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }
• Now there is no closed-form solution. Need to solve a quadratic
programming problem instead.
COMP-551: Applied Machine Learning
Joelle Pineau52
Lasso regression (aka L1-regularization)
• Constrains the weights by penalizing the absolute value of their
size:
ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }
• Now there is no closed-form solution. Need to solve a quadratic
programming problem instead.
– More computationally expensive than Ridge regression.
– Effectively sets the weights of less relevant input features to zero.
COMP-551: Applied Machine Learning
Joelle Pineau53
Comparing Ridge and Lasso
COMP-551: Applied Machine Learning
Ridge Lassoλ∑j=1:m|wj| Complexity: ∑j=0:mwj
2
• Note that for lasso, reducing any weight by, say, 0.1 reduces the
complexity by the same amount
– So, we’d prefer reducing less relevant features
• In ridge regression, reducing high weights reduces complexity
more than reducing a low weight
– So, trade-off between reducing less relevant features and larger weights. Tend to not reduce weights that are already small
Joelle Pineau54
Comparing Ridge and Lasso
COMP-551: Applied Machine Learning
Ridge Lasso
4
Joelle Pineau 7
Comparing Ridge and Lasso
COMP-598: Applied Machine Learning
Solving L1 regularization
• The optimization problem is a quadratic program
• There is one constraint for each possible sign of the weights (2n
constraints for n weights)
• For example, with two weights:
minw1,w2
mX
j=1
(yj � w1x1 � w2x2)2
such that w1 + w2 ⇥ ⇤
w1 � w2 ⇥ ⇤
�w1 + w2 ⇥ ⇤
�w1 � w2 ⇥ ⇤
• Solving this program directly can be done for problems with a smallnumber of inputs
COMP-652, Lecture 3 - September 8, 2011 31
Visualizing L1 regularization
w1
w2
w?
• If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners
• This makes L1 regularization much more likely to make some weightsexactly 0
COMP-652, Lecture 3 - September 8, 2011 32
Ridge Lasso
L2 Regularization for linear models revisited
• Optimization problem: minimize error while keeping norm of the weightsbounded
minw
JD(w) = minw
(�w � y)T (�w � y)
such that wTw ⇥ ⇤
• The Lagrangian is:
L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤
• For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay
COMP-652, Lecture 3 - September 8, 2011 27
Visualizing regularization (2 parameters)
w1
w2
w?
w⇤ = (�T�+ ⌅I)�1�y
COMP-652, Lecture 3 - September 8, 2011 28
Joelle Pineau 8
A quick look at evaluation functions
• We call L(Y,fw(x)) the loss function.
– Least-square / Mean squared-error (MSE) loss:
L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2
• Other loss functions?
– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |
– 0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )
• Different loss functions make different assumptions.
– Squared error loss assumes the data can be approximated by a global linear model.
COMP-598: Applied Machine Learning
Contours of equalregression error
Contours of equalmodel complexity penalty
Joelle Pineau55
A quick look at evaluation functions
• We call L(Y,fw(x)) the loss function.
– Least-square / Mean squared-error (MSE) loss:
L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2
• Other loss functions?
COMP-551: Applied Machine Learning
Joelle Pineau56
A quick look at evaluation functions
COMP-551: Applied Machine Learning
• We call L(Y,fw(x)) the loss function.
– Least-square / Mean squared-error (MSE) loss:
L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2
• Other loss functions?
– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |
– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )
Joelle Pineau57
A quick look at evaluation functions
• We call L(Y,fw(x)) the loss function.
– Least-square / Mean squared-error (MSE) loss:
L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2
• Other loss functions?
– Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |
– 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )
• Different loss functions make different assumptions.
– Squared error loss assumes the data can be approximated by a global linear model with Gaussian noise.
– Loss function independent of complexity penalty (l1 or l2)
COMP-551: Applied Machine Learning
Joelle Pineau58
A quick look at evaluation functions
• Assume data generated as
COMP-551: Applied Machine Learning
p(y|x;w) = N (xTw,�2)
=1p2⇡�2
exp
✓� (y � xTw)2
2�2
◆
Joelle Pineau59
A quick look at evaluation functions
• Assume data generated as
• Reasonable to try to maximize
the probability of generating the
labels y
COMP-551: Applied Machine Learning
argmaxw
p(y|x,w) = argmaxw
log p(y|x,w)
= argmaxw
const� (y � xTw)2
p(y|x;w) = N (xTw,�2)
=1p2⇡�2
exp
✓� (y � xTw)2
2�2
◆
Joelle Pineau60
Tutorial
• Programming with Python
• This Friday, 6-7 pm
• Stewart Biology building S3/3
COMP-551: Applied Machine Learning
Joelle Pineau61COMP-551: Applied Machine Learning
What you should know
• Overfitting (when it happens, how to avoid it).
• Cross-validation (how and why we use it).
• Ridge regression (l2 regularisation), Lasso (l1 regularisation)
Joelle Pineau62
MSURJ announcement
COMP-551: Applied Machine Learning