Distributed Coordinate Descent for Logistic Regression with Regularization
Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning...
Transcript of Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning...
1/13/2017
1
CSE 446: Machine Learning
CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 13, 2017
©2017 Emily Fox
Ridge Regression:Regulating overfitting when using many features
CSE 446: Machine Learning2
Training, true, & test error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
Overfitting if:
x
y
x
y
1/13/2017
2
CSE 446: Machine Learning3
Error vs. amount of data
©2017 Emily Fox
# data points in training set
Err
or
CSE 446: Machine Learning
Overfitting of polynomial regression
©2017 Emily Fox
1/13/2017
3
CSE 446: Machine Learning5
Flexibility of high-order polynomials
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
fŵ
square feet (sq.ft.)
pri
ce
($)
x
yfŵ
y
yi = w0 + w1 xi+ w2 xi2 + … + wp xi
p + εi
CSE 446: Machine Learning6
Symptom of overfitting
Often, overfitting associated with verylarge estimated parameters ŵ
©2017 Emily Fox
1/13/2017
4
CSE 446: Machine Learning
Overfitting of linear regressionmodels more generically
©2017 Emily Fox
CSE 446: Machine Learning8
Overfitting with many features
Not unique to polynomial regression,but also if lots of inputs (d large)
Or, generically, lots of features (D large)
yi = wj hj(xi) + εi
©2017 Emily Fox
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
1/13/2017
5
CSE 446: Machine Learning9
How does # of observations influence overfitting?
Few observations (N small) rapidly overfit as model complexity increases
Many observations (N very large) harder to overfit
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
fŵy
square feet (sq.ft.)
pri
ce
($)
x
fŵy
CSE 446: Machine Learning10
How does # of inputs influence overfitting?
1 input (e.g., sq.ft.):Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x
fŵy
1/13/2017
6
CSE 446: Machine Learning11
How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):
Data must include examples of all possible(sq.ft., #bath, #bed, lot size, year,…., $) combosto avoid overfitting
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($)
x[1]
yx[2]
CSE 446: Machine Learning
Adding term to cost-of-fitto prefer small coefficients
©2017 Emily Fox
1/13/2017
7
CSE 446: Machine Learning13
Desired total cost format
Want to balance:
i. How well function fits data
ii. Magnitude of coefficients
Total cost =
measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
small # = good fit to training data
small # = not overfit
want to balancemeasure quality of fit
CSE 446: Machine Learning14
Measure of fit to training data
©2017 Emily Fox
RSS(w) = (yi-h(xi)Tw)2
square feet (sq.ft.)
pri
ce (
$)
x[1]
y
x[2]
1/13/2017
8
CSE 446: Machine Learning15
What summary # is indicative of size of regression coefficients?
- Sum?
- Sum of absolute value?
- Sum of squares (L2 norm)
©2017 Emily Fox
Measure of magnitude of regression coefficient
CSE 446: Machine Learning16
Consider specific total cost
Total cost =measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
1/13/2017
9
CSE 446: Machine Learning17
Consider specific total cost
Total cost =measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
RSS(w) ||w||22
CSE 446: Machine Learning18
Consider resulting objective
What if ŵselected to minimize
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + ||w||2tuning parameter = balance of fit and magnitude
©2017 Emily Fox
λ 2
1/13/2017
10
CSE 446: Machine Learning19
Consider resulting objective
What if ŵselected to minimize
RSS(w) + ||w||2
©2017 Emily Fox
λ
Ridge regression(a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2
CSE 446: Machine Learning20
Bias-variance tradeoff
Large λ:
high bias, low variance
(e.g., ŵ=0 for λ=∞)
Small λ:
low bias, high variance
(e.g., standard least squares (RSS) fit ofhigh-order polynomial for λ=0)
©2017 Emily Fox
In essence, λcontrols model
complexity
1/13/2017
11
CSE 446: Machine Learning21
Revisit polynomial fit demo
What happens if we refit our high-orderpolynomial, but now using ridge regression?
Will consider a few settings of λ …
©2017 Emily Fox
CSE 446: Machine Learning22
Coefficient path
©2017 Emily Fox
λ
co
eff
icie
nts
ŵj
1/13/2017
12
CSE 446: Machine Learning
Fitting the ridge regression model(for given λ value)
©2017 Emily Fox
CSE 446: Machine Learning
Step 1:Rewrite total cost in matrix notation
©2017 Emily Fox
1/13/2017
13
CSE 446: Machine Learning25
Recall matrix form of RSS
Model for all N observations together
©2017 Emily Fox
= +y H ε
CSE 446: Machine Learning26
Recall matrix form of RSS
©2017 Emily Fox
RSS(w) = (yi- h(xi)Tw)2
= (y-Hw)T(y-Hw)
1/13/2017
14
CSE 446: Machine Learning27
Rewrite magnitude of coefficients in vector notation
©2017 Emily Fox
||w||2 = w02 + w1
2 + w22 + … + wD
2
=
2
CSE 446: Machine Learning28
Putting it all together
©2017 Emily Fox
In matrix form, ridge regression cost is:
RSS(w) + λ||w||2
= (y-Hw)T(y-Hw) + λwTw
2
1/13/2017
15
CSE 446: Machine Learning
Step 2:Compute the gradient
©2017 Emily Fox
CSE 446: Machine Learning30
Why? By analogy to 1d case…
wTw analogous to w2 and derivative of w2=2w
Gradient of ridge regression cost
©2017 Emily Fox
[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]
[(y-Hw)T(y-Hw)] + λ [wTw]
ΔΔ Δ
-2HT(y-Hw)
Why?
Δ
=
2
2w
1/13/2017
16
CSE 446: Machine Learning
Step 3, Approach 1:Set the gradient = 0
©2017 Emily Fox
CSE 446: Machine Learning32
Ridge closed-form solution
©2017 Emily Fox
cost(w) = -2HT(y-Hw) +2λIw= 0
Δ
Solve for w:
1/13/2017
17
CSE 446: Machine Learning33
Interpreting ridge closed-form solution
©2017 Emily Fox
ŵ= ( HTH + λI)-1 HTy
If λ=0:
If λ=∞:
CSE 446: Machine Learning34
Recall discussion onprevious closed-form solution
©2017 Emily Fox
ŵ= ( HTH )-1 HTyInvertible if:
In general, (# linearly independent obs)
N > D
Complexity of inverse:O(D3)
1/13/2017
18
CSE 446: Machine Learning35
Discussion ofridge closed-form solution
©2017 Emily Fox
ŵ= ( HTH + λI)-1 HTyInvertible if:
Always if λ>0, even if N < D
Complexity of inverse:
O(D3)…big for large D!
CSE 446: Machine Learning
Step 3, Approach 2:Gradient descent
©2017 Emily Fox
1/13/2017
19
CSE 446: Machine Learning37
Elementwise ridge regressiongradient descent algorithm
©2017 Emily Fox
wj(t+1) wj
(t) – η *
[-2 hj(xi)(yi-ŷi(w(t)))
+2λwj(t) ]
Update to jth feature weight:
cost(w) = -2HT(y-Hw) +2λw
Δ
CSE 446: Machine Learning38
Recall previous algorithm
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > εfor j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) wj
(t) – η partial[j]
t t + 1
Δ
1/13/2017
20
CSE 446: Machine Learning39
Summary of ridge regression algorithm
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > εfor j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) (1-2ηλ)wj
(t) – η partial[j]
t t + 1
Δ
CSE 446: Machine Learning
How to choose λ
©2017 Emily Fox
1/13/2017
21
CSE 446: Machine Learning41
The regression/ML workflow
1. Model selectionNeed to choose tuning parameters λ controlling model complexity
2. Model assessmentHaving selected a model, assess generalization error
©2017 Emily Fox
CSE 446: Machine Learning42
Hypothetical implementation
1. Model selectionFor each considered λ :i. Estimate parameters ŵλ on training dataii. Assess performance of ŵλ on test dataiii. Choose λ* to be λ with lowest test error
2. Model assessmentCompute test error of ŵλ* (fitted model for selected λ*) to approx. generalization error
©2017 Emily Fox
Training set Test set
Overly optimistic!
1/13/2017
22
CSE 446: Machine Learning43
Hypothetical implementation
©2017 Emily Fox
Issue: Just like fitting ŵand assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)
• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates
Training set Test set
CSE 446: Machine Learning44
Training set Test set
Practical implementation
©2017 Emily Fox
Solution: Create two “test” sets!
1. Select λ* such that ŵλ* minimizes error on validation set
2. Approximate generalization error of ŵλ* using test set
Validation set
Training setTest set
1/13/2017
23
CSE 446: Machine Learning45
Practical implementation
©2017 Emily Fox
Validation set
Training setTest set
fitŵλtest performance ofŵλ to select λ*
assess generalization
error of ŵλ*
CSE 446: Machine Learning46
Typical splits
©2017 Emily Fox
Validation set
Training setTest set
80% 10% 10%
50% 25% 25%
1/13/2017
24
CSE 446: Machine Learning
How to handle the intercept
©2017 Emily Fox
CSE 446: Machine Learning48
Recall multiple regression model
©2017 Emily Fox
Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., x[1]feature 3 = h2(x)… e.g., x[2]…feature D+1 = hD(x)… e.g., x[d]
1/13/2017
25
CSE 446: Machine Learning49
If constant feature…
©2017 Emily Fox
yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi
In matrix notation for N observations:w01
11111111111111
CSE 446: Machine Learning50
Do we penalize intercept?
Standard ridge regression cost:
Encourages intercept w0 to also be small
Do we want a small intercept? Conceptually, not indicative of overfitting…
©2017 Emily Fox
RSS(w) + ||w||2λ 2
strength of penalty
1/13/2017
26
CSE 446: Machine Learning51
Option 1: Don’t penalize intercept
Modified ridge regression cost:
How to implement this in practice?
©2017 Emily Fox
RSS(w0,wrest) + ||wrest||2λ 2
CSE 446: Machine Learning52
Option 1: Don’t penalize intercept– Closed-form solution –
©2017 Emily Fox
ŵ= ( HTH + λImod)-1 HTy
0
1/13/2017
27
CSE 446: Machine Learning53
Option 1: Don’t penalize intercept– Gradient descent algorithm –
©2017 Emily Fox
while || RSS(w(t))|| > εfor j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
if j==0w0
(t+1) w0(t) – η partial[j]
else
wj(t+1) (1-2ηλ)wj
(t) – η partial[j]
t t + 1
Δ
CSE 446: Machine Learning54
Option 2: Center data first
If data are first centered about 0, thenfavoring small intercept not so worrisome
Step 1: Transform y to have 0 mean
Step 2: Run ridge regression as normal(closed-form or gradient algorithms)
©2017 Emily Fox
1/13/2017
28
CSE 446: Machine Learning
Summary for ridge regression
©2017 Emily Fox
CSE 446: Machine Learning56
What you can do now…• Describe what happens to magnitude of estimated
coefficients when model is overfit
• Motivate form of ridge regression cost function
• Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied
• Interpret coefficient path plot
• Estimate ridge regression parameters:- In closed form
- Using an iterative gradient descent algorithm
• Use a validation set to select the ridge regression tuning parameter λ
©2017 Emily Fox