Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning...

28
1/13/2017 1 CSE 446: Machine Learning Emily Fox University of Washington January 13, 2017 ©2017 Emily Fox Ridge Regression: Regulating overfitting when using many features CSE 446: Machine Learning 2 Training, true, & test error vs. model complexity ©2017 Emily Fox Model complexity Error Overfitting if: x y x y

Transcript of Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning...

Page 1: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

1

CSE 446: Machine Learning

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 13, 2017

©2017 Emily Fox

Ridge Regression:Regulating overfitting when using many features

CSE 446: Machine Learning2

Training, true, & test error vs. model complexity

©2017 Emily Fox

Model complexity

Err

or

Overfitting if:

x

y

x

y

Page 2: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

2

CSE 446: Machine Learning3

Error vs. amount of data

©2017 Emily Fox

# data points in training set

Err

or

CSE 446: Machine Learning

Overfitting of polynomial regression

©2017 Emily Fox

Page 3: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

3

CSE 446: Machine Learning5

Flexibility of high-order polynomials

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($)

x

square feet (sq.ft.)

pri

ce

($)

x

yfŵ

y

yi = w0 + w1 xi+ w2 xi2 + … + wp xi

p + εi

CSE 446: Machine Learning6

Symptom of overfitting

Often, overfitting associated with verylarge estimated parameters ŵ

©2017 Emily Fox

Page 4: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

4

CSE 446: Machine Learning

Overfitting of linear regressionmodels more generically

©2017 Emily Fox

CSE 446: Machine Learning8

Overfitting with many features

Not unique to polynomial regression,but also if lots of inputs (d large)

Or, generically, lots of features (D large)

yi = wj hj(xi) + εi

©2017 Emily Fox

- Square feet

- # bathrooms

- # bedrooms

- Lot size

- Year built

- …

Page 5: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

5

CSE 446: Machine Learning9

How does # of observations influence overfitting?

Few observations (N small) rapidly overfit as model complexity increases

Many observations (N very large) harder to overfit

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($)

x

fŵy

square feet (sq.ft.)

pri

ce

($)

x

fŵy

CSE 446: Machine Learning10

How does # of inputs influence overfitting?

1 input (e.g., sq.ft.):Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($)

x

fŵy

Page 6: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

6

CSE 446: Machine Learning11

How does # of inputs influence overfitting?

d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):

Data must include examples of all possible(sq.ft., #bath, #bed, lot size, year,…., $) combosto avoid overfitting

©2017 Emily Fox

square feet (sq.ft.)

pri

ce

($)

x[1]

yx[2]

CSE 446: Machine Learning

Adding term to cost-of-fitto prefer small coefficients

©2017 Emily Fox

Page 7: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

7

CSE 446: Machine Learning13

Desired total cost format

Want to balance:

i. How well function fits data

ii. Magnitude of coefficients

Total cost =

measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

small # = good fit to training data

small # = not overfit

want to balancemeasure quality of fit

CSE 446: Machine Learning14

Measure of fit to training data

©2017 Emily Fox

RSS(w) = (yi-h(xi)Tw)2

square feet (sq.ft.)

pri

ce (

$)

x[1]

y

x[2]

Page 8: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

8

CSE 446: Machine Learning15

What summary # is indicative of size of regression coefficients?

- Sum?

- Sum of absolute value?

- Sum of squares (L2 norm)

©2017 Emily Fox

Measure of magnitude of regression coefficient

CSE 446: Machine Learning16

Consider specific total cost

Total cost =measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

Page 9: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

9

CSE 446: Machine Learning17

Consider specific total cost

Total cost =measure of fit + measure of magnitude of coefficients

©2017 Emily Fox

RSS(w) ||w||22

CSE 446: Machine Learning18

Consider resulting objective

What if ŵselected to minimize

If λ=0:

If λ=∞:

If λ in between:

RSS(w) + ||w||2tuning parameter = balance of fit and magnitude

©2017 Emily Fox

λ 2

Page 10: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

10

CSE 446: Machine Learning19

Consider resulting objective

What if ŵselected to minimize

RSS(w) + ||w||2

©2017 Emily Fox

λ

Ridge regression(a.k.a L2 regularization)

tuning parameter = balance of fit and magnitude

2

CSE 446: Machine Learning20

Bias-variance tradeoff

Large λ:

high bias, low variance

(e.g., ŵ=0 for λ=∞)

Small λ:

low bias, high variance

(e.g., standard least squares (RSS) fit ofhigh-order polynomial for λ=0)

©2017 Emily Fox

In essence, λcontrols model

complexity

Page 11: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

11

CSE 446: Machine Learning21

Revisit polynomial fit demo

What happens if we refit our high-orderpolynomial, but now using ridge regression?

Will consider a few settings of λ …

©2017 Emily Fox

CSE 446: Machine Learning22

Coefficient path

©2017 Emily Fox

λ

co

eff

icie

nts

ŵj

Page 12: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

12

CSE 446: Machine Learning

Fitting the ridge regression model(for given λ value)

©2017 Emily Fox

CSE 446: Machine Learning

Step 1:Rewrite total cost in matrix notation

©2017 Emily Fox

Page 13: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

13

CSE 446: Machine Learning25

Recall matrix form of RSS

Model for all N observations together

©2017 Emily Fox

= +y H ε

CSE 446: Machine Learning26

Recall matrix form of RSS

©2017 Emily Fox

RSS(w) = (yi- h(xi)Tw)2

= (y-Hw)T(y-Hw)

Page 14: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

14

CSE 446: Machine Learning27

Rewrite magnitude of coefficients in vector notation

©2017 Emily Fox

||w||2 = w02 + w1

2 + w22 + … + wD

2

=

2

CSE 446: Machine Learning28

Putting it all together

©2017 Emily Fox

In matrix form, ridge regression cost is:

RSS(w) + λ||w||2

= (y-Hw)T(y-Hw) + λwTw

2

Page 15: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

15

CSE 446: Machine Learning

Step 2:Compute the gradient

©2017 Emily Fox

CSE 446: Machine Learning30

Why? By analogy to 1d case…

wTw analogous to w2 and derivative of w2=2w

Gradient of ridge regression cost

©2017 Emily Fox

[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]

[(y-Hw)T(y-Hw)] + λ [wTw]

ΔΔ Δ

-2HT(y-Hw)

Why?

Δ

=

2

2w

Page 16: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

16

CSE 446: Machine Learning

Step 3, Approach 1:Set the gradient = 0

©2017 Emily Fox

CSE 446: Machine Learning32

Ridge closed-form solution

©2017 Emily Fox

cost(w) = -2HT(y-Hw) +2λIw= 0

Δ

Solve for w:

Page 17: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

17

CSE 446: Machine Learning33

Interpreting ridge closed-form solution

©2017 Emily Fox

ŵ= ( HTH + λI)-1 HTy

If λ=0:

If λ=∞:

CSE 446: Machine Learning34

Recall discussion onprevious closed-form solution

©2017 Emily Fox

ŵ= ( HTH )-1 HTyInvertible if:

In general, (# linearly independent obs)

N > D

Complexity of inverse:O(D3)

Page 18: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

18

CSE 446: Machine Learning35

Discussion ofridge closed-form solution

©2017 Emily Fox

ŵ= ( HTH + λI)-1 HTyInvertible if:

Always if λ>0, even if N < D

Complexity of inverse:

O(D3)…big for large D!

CSE 446: Machine Learning

Step 3, Approach 2:Gradient descent

©2017 Emily Fox

Page 19: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

19

CSE 446: Machine Learning37

Elementwise ridge regressiongradient descent algorithm

©2017 Emily Fox

wj(t+1) wj

(t) – η *

[-2 hj(xi)(yi-ŷi(w(t)))

+2λwj(t) ]

Update to jth feature weight:

cost(w) = -2HT(y-Hw) +2λw

Δ

CSE 446: Machine Learning38

Recall previous algorithm

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || RSS(w(t))|| > εfor j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

wj(t+1) wj

(t) – η partial[j]

t t + 1

Δ

Page 20: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

20

CSE 446: Machine Learning39

Summary of ridge regression algorithm

©2017 Emily Fox

init w(1)=0 (or randomly, or smartly), t=1

while || RSS(w(t))|| > εfor j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

wj(t+1) (1-2ηλ)wj

(t) – η partial[j]

t t + 1

Δ

CSE 446: Machine Learning

How to choose λ

©2017 Emily Fox

Page 21: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

21

CSE 446: Machine Learning41

The regression/ML workflow

1. Model selectionNeed to choose tuning parameters λ controlling model complexity

2. Model assessmentHaving selected a model, assess generalization error

©2017 Emily Fox

CSE 446: Machine Learning42

Hypothetical implementation

1. Model selectionFor each considered λ :i. Estimate parameters ŵλ on training dataii. Assess performance of ŵλ on test dataiii. Choose λ* to be λ with lowest test error

2. Model assessmentCompute test error of ŵλ* (fitted model for selected λ*) to approx. generalization error

©2017 Emily Fox

Training set Test set

Overly optimistic!

Page 22: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

22

CSE 446: Machine Learning43

Hypothetical implementation

©2017 Emily Fox

Issue: Just like fitting ŵand assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)

• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates

Training set Test set

CSE 446: Machine Learning44

Training set Test set

Practical implementation

©2017 Emily Fox

Solution: Create two “test” sets!

1. Select λ* such that ŵλ* minimizes error on validation set

2. Approximate generalization error of ŵλ* using test set

Validation set

Training setTest set

Page 23: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

23

CSE 446: Machine Learning45

Practical implementation

©2017 Emily Fox

Validation set

Training setTest set

fitŵλtest performance ofŵλ to select λ*

assess generalization

error of ŵλ*

CSE 446: Machine Learning46

Typical splits

©2017 Emily Fox

Validation set

Training setTest set

80% 10% 10%

50% 25% 25%

Page 24: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

24

CSE 446: Machine Learning

How to handle the intercept

©2017 Emily Fox

CSE 446: Machine Learning48

Recall multiple regression model

©2017 Emily Fox

Model:yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi

= wj hj(xi) + εi

feature 1 = h0(x)…often 1 (constant)feature 2 = h1(x)… e.g., x[1]feature 3 = h2(x)… e.g., x[2]…feature D+1 = hD(x)… e.g., x[d]

Page 25: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

25

CSE 446: Machine Learning49

If constant feature…

©2017 Emily Fox

yi = w0 + w1 h1(xi) + … + wD hD(xi)+ εi

In matrix notation for N observations:w01

11111111111111

CSE 446: Machine Learning50

Do we penalize intercept?

Standard ridge regression cost:

Encourages intercept w0 to also be small

Do we want a small intercept? Conceptually, not indicative of overfitting…

©2017 Emily Fox

RSS(w) + ||w||2λ 2

strength of penalty

Page 26: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

26

CSE 446: Machine Learning51

Option 1: Don’t penalize intercept

Modified ridge regression cost:

How to implement this in practice?

©2017 Emily Fox

RSS(w0,wrest) + ||wrest||2λ 2

CSE 446: Machine Learning52

Option 1: Don’t penalize intercept– Closed-form solution –

©2017 Emily Fox

ŵ= ( HTH + λImod)-1 HTy

0

Page 27: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

27

CSE 446: Machine Learning53

Option 1: Don’t penalize intercept– Gradient descent algorithm –

©2017 Emily Fox

while || RSS(w(t))|| > εfor j=0,…,D

partial[j] =-2 hj(xi)(yi-ŷi(w(t)))

if j==0w0

(t+1) w0(t) – η partial[j]

else

wj(t+1) (1-2ηλ)wj

(t) – η partial[j]

t t + 1

Δ

CSE 446: Machine Learning54

Option 2: Center data first

If data are first centered about 0, thenfavoring small intercept not so worrisome

Step 1: Transform y to have 0 mean

Step 2: Run ridge regression as normal(closed-form or gradient algorithms)

©2017 Emily Fox

Page 28: Ridge Regression - University of Washington · Ridge regression (a.k.a L 2 regularization) tuning parameter = balance of fit and magnitude 2 20 CSE 446: Machine Learning Bias-variance

1/13/2017

28

CSE 446: Machine Learning

Summary for ridge regression

©2017 Emily Fox

CSE 446: Machine Learning56

What you can do now…• Describe what happens to magnitude of estimated

coefficients when model is overfit

• Motivate form of ridge regression cost function

• Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied

• Interpret coefficient path plot

• Estimate ridge regression parameters:- In closed form

- Using an iterative gradient descent algorithm

• Use a validation set to select the ridge regression tuning parameter λ

©2017 Emily Fox