Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep...

50
Regularization Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Transcript of Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep...

Page 1: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Page 2: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Administrative

โ€ข Women in Data Science Blacksburgโ€ข Location: Holtzman Alumni Center

โ€ข Welcome, 3:30 - 3:40, Assembly hall

โ€ข Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall

โ€ข Career Panel, 4:05 - 5:00, hall

โ€ข Break , 5:00 - 5:20, Grand hallAssembly

โ€ข Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall

โ€ข Dinner with breakout discussion groups, 5:45 - 7:00, Museum

โ€ข Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall

โ€ข Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

Page 3: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

k-NN (Classification/Regression)

โ€ข Model๐‘ฅ 1 , ๐‘ฆ 1 , ๐‘ฅ 2 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฅ ๐‘š , ๐‘ฆ ๐‘š

โ€ข Cost function

None

โ€ข Learning

Do nothing

โ€ข Inference

เทœ๐‘ฆ = โ„Ž ๐‘ฅtest = ๐‘ฆ(๐‘˜), where ๐‘˜ = argmin๐‘– ๐ท(๐‘ฅtest, ๐‘ฅ(๐‘–))

Page 4: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Linear regression (Regression)

โ€ข Modelโ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ1 + ๐œƒ2๐‘ฅ2 +โ‹ฏ+ ๐œƒ๐‘›๐‘ฅ๐‘› = ๐œƒโŠค๐‘ฅ

โ€ข Cost function

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2

โ€ข Learning

1) Gradient descent: Repeat {๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘šฯƒ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—

๐‘–}

2) Solving normal equation ๐œƒ = (๐‘‹โŠค๐‘‹)โˆ’1๐‘‹โŠค๐‘ฆ

โ€ข Inferenceเทœ๐‘ฆ = โ„Ž๐œƒ ๐‘ฅtest = ๐œƒโŠค๐‘ฅtest

Page 5: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Naรฏve Bayes (Classification)

โ€ข Modelโ„Ž๐œƒ ๐‘ฅ = ๐‘ƒ(๐‘Œ|๐‘‹1, ๐‘‹2, โ‹ฏ , ๐‘‹๐‘›) โˆ ๐‘ƒ ๐‘Œ ฮ ๐‘–๐‘ƒ ๐‘‹๐‘– ๐‘Œ)

โ€ข Cost functionMaximum likelihood estimation: ๐ฝ ๐œƒ = โˆ’ log ๐‘ƒ Data ๐œƒMaximum a posteriori estimation :๐ฝ ๐œƒ = โˆ’ log ๐‘ƒ Data ๐œƒ ๐‘ƒ ๐œƒ

โ€ข Learning๐œ‹๐‘˜ = ๐‘ƒ(๐‘Œ = ๐‘ฆ๐‘˜)

(Discrete ๐‘‹๐‘–) ๐œƒ๐‘–๐‘—๐‘˜ = ๐‘ƒ(๐‘‹๐‘– = ๐‘ฅ๐‘–๐‘—๐‘˜|๐‘Œ = ๐‘ฆ๐‘˜)

(Continuous ๐‘‹๐‘–) mean ๐œ‡๐‘–๐‘˜, variance ๐œŽ๐‘–๐‘˜2 , ๐‘ƒ ๐‘‹๐‘– ๐‘Œ = ๐‘ฆ๐‘˜) = ๐’ฉ(๐‘‹๐‘–|๐œ‡๐‘–๐‘˜ , ๐œŽ๐‘–๐‘˜

2 )

โ€ข Inference๐‘Œ โ† argmax

๐‘ฆ๐‘˜

๐‘ƒ ๐‘Œ = ๐‘ฆ๐‘˜ ฮ ๐‘–๐‘ƒ ๐‘‹๐‘–test ๐‘Œ = ๐‘ฆ๐‘˜)

Page 6: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Logistic regression (Classification)

โ€ข Modelโ„Ž๐œƒ ๐‘ฅ = ๐‘ƒ ๐‘Œ = 1 ๐‘‹1, ๐‘‹2, โ‹ฏ , ๐‘‹๐‘› =

1

1+๐‘’โˆ’๐œƒโŠค๐‘ฅ

โ€ข Cost function

๐ฝ ๐œƒ =1

๐‘š

๐‘–=1

๐‘š

Cost(โ„Ž๐œƒ(๐‘ฅ๐‘– ), ๐‘ฆ(๐‘–))) Cost(โ„Ž๐œƒ ๐‘ฅ , ๐‘ฆ) = เต

โˆ’log โ„Ž๐œƒ ๐‘ฅ if ๐‘ฆ = 1

โˆ’log 1 โˆ’ โ„Ž๐œƒ ๐‘ฅ if ๐‘ฆ = 0

โ€ข LearningGradient descent: Repeat {๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ

1

๐‘šฯƒ๐‘–=1๐‘š โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—

๐‘–}

โ€ข Inference๐‘Œ = โ„Ž๐œƒ ๐‘ฅtest =

1

1 + ๐‘’โˆ’๐œƒโŠค๐‘ฅtest

Page 7: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Logistic Regression

โ€ขHypothesis representation

โ€ขCost function

โ€ข Logistic regression with gradient descent

โ€ขRegularization

โ€ขMulti-class classification

โ„Ž๐œƒ ๐‘ฅ =1

1 + ๐‘’โˆ’๐œƒโŠค๐‘ฅ

Cost(โ„Ž๐œƒ ๐‘ฅ , ๐‘ฆ) = เตโˆ’log โ„Ž๐œƒ ๐‘ฅ if ๐‘ฆ = 1

โˆ’log 1 โˆ’ โ„Ž๐œƒ ๐‘ฅ if ๐‘ฆ = 0

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ(๐‘–) ๐‘ฅ๐‘—(๐‘–)

Page 8: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

How about MAP?

โ€ข Maximum conditional likelihood estimate (MCLE)

โ€ข Maximum conditional a posterior estimate (MCAP)

๐œƒMCLE = argmax๐œƒ

ฯ‚๐‘–=1๐‘š ๐‘ƒ๐œƒ ๐‘ฆ(๐‘–)|๐‘ฅ ๐‘–

๐œƒMCAP = argmax๐œƒ

ฯ‚๐‘–=1๐‘š ๐‘ƒ๐œƒ ๐‘ฆ(๐‘–)|๐‘ฅ ๐‘– ๐‘ƒ(๐œƒ)

Page 9: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Prior ๐‘ƒ(๐œƒ)

โ€ข Common choice of ๐‘ƒ(๐œƒ): โ€ข Normal distribution, zero mean, identity covariance

โ€ข โ€œPushesโ€ parameters towards zeros

โ€ข Corresponds to Regularizationโ€ข Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

Page 10: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

MLE vs. MAP

โ€ข Maximum conditional likelihood estimate (MCLE)

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ(๐‘–) ๐‘ฅ๐‘—(๐‘–)

โ€ข Maximum conditional a posterior estimate (MCAP)

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ๐œ†๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ(๐‘–) ๐‘ฅ๐‘—(๐‘–)

Page 11: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Logistic Regression

โ€ขHypothesis representation

โ€ขCost function

โ€ข Logistic regression with gradient descent

โ€ขRegularization

โ€ขMulti-class classification

Page 12: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Multi-class classification

โ€ข Email foldering/taggning: Work, Friends, Family, Hobby

โ€ข Medical diagrams: Not ill, Cold, Flu

โ€ข Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

Page 13: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Binary classification

๐‘ฅ2

๐‘ฅ1

Multiclass classification

๐‘ฅ2

๐‘ฅ1

Page 14: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

One-vs-all (one-vs-rest)

๐‘ฅ2

๐‘ฅ1Class 1:Class 2: Class 3:

โ„Ž๐œƒ๐‘–๐‘ฅ = ๐‘ƒ ๐‘ฆ = ๐‘– ๐‘ฅ; ๐œƒ (๐‘– = 1, 2, 3)

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

โ„Ž๐œƒ1

๐‘ฅ

โ„Ž๐œƒ2

๐‘ฅ

โ„Ž๐œƒ3

๐‘ฅ

Slide credit: Andrew Ng

Page 15: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

One-vs-all

โ€ขTrain a logistic regression classifier โ„Ž๐œƒ๐‘–๐‘ฅ for

each class ๐‘– to predict the probability that ๐‘ฆ = ๐‘–

โ€ขGiven a new input ๐‘ฅ, pick the class ๐‘– that maximizes

maxi

โ„Ž๐œƒ๐‘–๐‘ฅ

Slide credit: Andrew Ng

Page 16: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Generative ApproachEx: Naรฏve Bayes

Estimate ๐‘ƒ(๐‘Œ) and ๐‘ƒ(๐‘‹|๐‘Œ)

Predictionเทœ๐‘ฆ = argmax๐‘ฆ ๐‘ƒ ๐‘Œ = ๐‘ฆ ๐‘ƒ(๐‘‹ = ๐‘ฅ|๐‘Œ = ๐‘ฆ)

Discriminative ApproachEx: Logistic regression

Estimate ๐‘ƒ(๐‘Œ|๐‘‹) directly

(Or a discriminant function: e.g., SVM)

Predictionเทœ๐‘ฆ = ๐‘ƒ(๐‘Œ = ๐‘ฆ|๐‘‹ = ๐‘ฅ)

Page 17: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Further readings

โ€ข Tom M. MitchellGenerative and discriminative classifiers: Naรฏve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

โ€ข Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Page 18: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression

Page 19: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression

Page 20: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Example: Linear regressionPrice ($)in 1000โ€™s

Size in feet^2

Price ($)in 1000โ€™s

Size in feet^2

Price ($)in 1000โ€™s

Size in feet^2

โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ

2 +๐œƒ3๐‘ฅ

3 + ๐œƒ4๐‘ฅ4 +โ‹ฏ

Underfitting OverfittingJust right

Slide credit: Andrew Ng

Page 21: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Overfitting

โ€ข If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2โ‰ˆ 0

but fail to generalize to new examples (predict prices on new examples).

Slide credit: Andrew Ng

Page 22: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Example: Linear regressionPrice ($)in 1000โ€™s

Size in feet^2

Price ($)in 1000โ€™s

Size in feet^2

Price ($)in 1000โ€™s

Size in feet^2

โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ

2 +๐œƒ3๐‘ฅ

3 + ๐œƒ4๐‘ฅ4 +โ‹ฏ

Underfitting OverfittingJust right

High bias High varianceSlide credit: Andrew Ng

Page 23: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Bias-Variance Tradeoff

โ€ขBias: difference between what you expect to learn and truthโ€ข Measures how well you expect to represent true solution

โ€ข Decreases with more complex model

โ€ขVariance: difference between what you expect to learn and what you learn from a particular dataset โ€ข Measures how sensitive learner is to specific dataset

โ€ข Increases with more complex model

Page 24: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Low variance High variance

Low bias

High bias

Page 25: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Biasโ€“variance decomposition

โ€ข Training set { ๐‘ฅ1, ๐‘ฆ1 , ๐‘ฅ2, ๐‘ฆ2 , โ‹ฏ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘› }

โ€ข ๐‘ฆ = ๐‘“ ๐‘ฅ + ๐œ€

โ€ข We want แˆ˜๐‘“ ๐‘ฅ that minimizes ๐ธ ๐‘ฆ โˆ’ แˆ˜๐‘“ ๐‘ฅ2

๐ธ ๐‘ฆ โˆ’ แˆ˜๐‘“ ๐‘ฅ2= Bias แˆ˜๐‘“ ๐‘ฅ

2+ Var แˆ˜๐‘“ ๐‘ฅ + ๐œŽ2

Bias แˆ˜๐‘“ ๐‘ฅ = ๐ธ แˆ˜๐‘“ ๐‘ฅ โˆ’ ๐‘“(๐‘ฅ)

Var แˆ˜๐‘“ ๐‘ฅ = ๐ธ แˆ˜๐‘“ ๐‘ฅ 2 โˆ’ ๐ธ แˆ˜๐‘“ ๐‘ฅ2

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Page 26: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Overfitting

Tumor Size

Age

Tumor Size

Age

Tumor Size

Age

โ„Ž๐œƒ ๐‘ฅ = ๐‘”(๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2) โ„Ž๐œƒ ๐‘ฅ = ๐‘”(๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 +๐œƒ3๐‘ฅ1

2 + ๐œƒ4๐‘ฅ22 + ๐œƒ5๐‘ฅ1๐‘ฅ2)

โ„Ž๐œƒ ๐‘ฅ = ๐‘”(๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 +๐œƒ3๐‘ฅ1

2 + ๐œƒ4๐‘ฅ22 + ๐œƒ5๐‘ฅ1๐‘ฅ2 +

๐œƒ6๐‘ฅ13๐‘ฅ2 + ๐œƒ7๐‘ฅ1๐‘ฅ2

3 +โ‹ฏ)

Underfitting OverfittingSlide credit: Andrew Ng

Page 27: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Addressing overfitting

โ€ข ๐‘ฅ1 = size of house

โ€ข ๐‘ฅ2 = no. of bedrooms

โ€ข ๐‘ฅ3 = no. of floors

โ€ข ๐‘ฅ4 = age of house

โ€ข ๐‘ฅ5 = average income in neighborhood

โ€ข ๐‘ฅ6 = kitchen size

โ€ข โ‹ฎ

โ€ข ๐‘ฅ100

Price ($)in 1000โ€™s

Size in feet^2

Slide credit: Andrew Ng

Page 28: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Addressing overfitting

โ€ข 1. Reduce number of features.โ€ข Manually select which features to keep.

โ€ข Model selection algorithm (later in course).

โ€ข 2. Regularization.โ€ข Keep all the features, but reduce magnitude/values of parameters ๐œƒ๐‘—.

โ€ข Works well when we have a lot of features, each of which contributes a bit to predicting ๐‘ฆ.

Slide credit: Andrew Ng

Page 29: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Overfitting Thriller

โ€ข https://www.youtube.com/watch?v=DQWI1kvmwRg

Page 30: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression

Page 31: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Intuition

โ€ข Suppose we penalize and make ๐œƒ3, ๐œƒ4 really small.

min๐œƒ

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ 1000 ๐œƒ3

2 + 1000 ๐œƒ42

โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ

2 + ๐œƒ3๐‘ฅ3 + ๐œƒ4๐‘ฅ

4

Price ($)in 1000โ€™s

Size in feet^2

Price ($)in 1000โ€™s

Size in feet^2

Slide credit: Andrew Ng

Page 32: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization.

โ€ขSmall values for parameters ๐œƒ1, ๐œƒ2, โ‹ฏ , ๐œƒ๐‘›โ€ข โ€œSimplerโ€ hypothesisโ€ข Less prone to overfitting

โ€ขHousing:โ€ข Features: ๐‘ฅ1, ๐‘ฅ2, โ‹ฏ , ๐‘ฅ100โ€ขParameters: ๐œƒ0, ๐œƒ1, ๐œƒ2, โ‹ฏ , ๐œƒ100

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

๐œƒ๐‘—2

Slide credit: Andrew Ng

Page 33: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

๐œƒ๐‘—2

min๐œƒ

๐ฝ(๐œƒ)

Price ($)in 1000โ€™s

Size in feet^2

๐œ†: Regularization parameter

Slide credit: Andrew Ng

Page 34: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Question

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

๐œƒ๐‘—2

What if ๐œ† is set to an extremely large value (say ๐œ† = 1010)?

1. Algorithm works fine; setting to be very large canโ€™t hurt it

2. Algorithm fails to eliminate overfitting.

3. Algorithm results in underfitting. (Fails to fit even training data well).

4. Gradient descent will fail to converge.

Slide credit: Andrew Ng

Page 35: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Question

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

๐œƒ๐‘—2

What if ๐œ† is set to an extremely large value (say ๐œ† = 1010)?Price ($)in 1000โ€™s

Size in feet^2

โ„Ž๐œƒ ๐‘ฅ = ๐œƒ0 + ๐œƒ1๐‘ฅ1 + ๐œƒ2๐‘ฅ2 +โ‹ฏ+ ๐œƒ๐‘›๐‘ฅ๐‘› = ๐œƒโŠค๐‘ฅSlide credit: Andrew Ng

Page 36: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression

Page 37: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularized linear regression

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

๐œƒ๐‘—2

min๐œƒ

๐ฝ(๐œƒ)

๐‘›: Number of features

๐œƒ0 is not panelizedSlide credit: Andrew Ng

Page 38: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Gradient descent (Previously)

Repeat {

๐œƒ0 โ‰” ๐œƒ0 โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–

}

(๐‘— = 1, 2, 3,โ‹ฏ , ๐‘›)

Slide credit: Andrew Ng

(๐‘— = 0)

Page 39: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Gradient descent (Regularized)

Repeat {

๐œƒ0 โ‰” ๐œƒ0 โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–+ ๐œ†๐œƒ๐‘—

}๐œƒ๐‘— โ‰” ๐œƒ๐‘—(1 โˆ’ ๐›ผ

๐œ†

๐‘š) โˆ’ ๐›ผ

1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–

Slide credit: Andrew Ng

Page 40: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Comparison

Regularized linear regression

๐œƒ๐‘— โ‰” ๐œƒ๐‘—(1 โˆ’ ๐›ผ๐œ†

๐‘š) โˆ’ ๐›ผ

1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–

Un-regularized linear regression

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–

1 โˆ’ ๐›ผ๐œ†

๐‘š< 1: Weight decay

Page 41: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Normal equation

โ€ข ๐‘‹ =

๐‘ฅ 1 โŠค

๐‘ฅ 2 โŠค

โ‹ฎ

๐‘ฅ ๐‘š โŠค

โˆˆ ๐‘…๐‘šร—(๐‘›+1) ๐‘ฆ =

๐‘ฆ(1)

๐‘ฆ(2)

โ‹ฎ๐‘ฆ(๐‘š)

โˆˆ ๐‘…๐‘š

โ€ข min๐œƒ

๐ฝ(๐œƒ)

โ€ข ๐œƒ = ๐‘‹โŠค๐‘‹ + ๐œ†

0 0 โ‹ฏ 00 1 0 0โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ0 0 0 1

โˆ’1

๐‘‹โŠค๐‘ฆ

(๐‘› + 1 ) ร— (๐‘› + 1) Slide credit: Andrew Ng

Page 42: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularization

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression

Page 43: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Regularized logistic regression

โ€ข Cost function:

๐ฝ ๐œƒ =1

๐‘š

๐‘–=1

๐‘š

๐‘ฆ ๐‘– log โ„Ž๐œƒ ๐‘ฅ ๐‘– + (1 โˆ’ ๐‘ฆ ๐‘– ) log 1 โˆ’ โ„Ž๐œƒ ๐‘ฅ ๐‘– +๐œ†

2

๐‘—=1

๐‘›

๐œƒ๐‘—2

Tumor Size

Age

โ„Ž๐œƒ ๐‘ฅ = ๐‘”(๐œƒ0 + ๐œƒ1๐‘ฅ + ๐œƒ2๐‘ฅ2 +๐œƒ3๐‘ฅ1

2 + ๐œƒ4๐‘ฅ22 + ๐œƒ5๐‘ฅ1๐‘ฅ2 +

๐œƒ6๐‘ฅ13๐‘ฅ2 + ๐œƒ7๐‘ฅ1๐‘ฅ2

3 +โ‹ฏ)

Slide credit: Andrew Ng

Page 44: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Gradient descent (Regularized)

Repeat {

๐œƒ0 โ‰” ๐œƒ0 โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘–

๐œƒ๐‘— โ‰” ๐œƒ๐‘— โˆ’ ๐›ผ1

๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– ๐‘ฅ๐‘—๐‘–โˆ’ ๐œ†๐œƒ๐‘—

}

โ„Ž๐œƒ ๐‘ฅ =1

1 + ๐‘’โˆ’๐œƒโŠค๐‘ฅ

๐œ•

๐œ•๐œƒ๐‘—๐ฝ(๐œƒ)

Slide credit: Andrew Ng

Page 45: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

๐œƒ 1: Lasso regularization

๐ฝ ๐œƒ =1

2๐‘š

๐‘–=1

๐‘š

โ„Ž๐œƒ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2+ ๐œ†

๐‘—=1

๐‘›

|๐œƒ๐‘—|

LASSO: Least Absolute Shrinkage and Selection Operator

Page 46: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Single predictor: Soft Thresholding

โ€ขminimize๐œƒ1

2๐‘šฯƒ๐‘–=1๐‘š ๐‘ฅ(๐‘–)๐œƒ โˆ’ ๐‘ฆ ๐‘– 2

+ ๐œ† ๐œƒ 1

๐œƒ =

1

๐‘š< ๐’™, ๐’š > โˆ’๐œ† if

1

๐‘š< ๐’™, ๐’š > > ๐œ†

0 if1

๐‘š| < ๐’™, ๐’š > | โ‰ค ๐œ†

1

๐‘š< ๐’™, ๐’š > +๐œ† if

1

๐‘š< ๐’™, ๐’š > < โˆ’๐œ†

๐œƒ = ๐‘†๐œ†(1

๐‘š< ๐’™, ๐’š >)

Soft Thresholding operator ๐‘†๐œ† ๐‘ฅ = sign ๐‘ฅ ๐‘ฅ โˆ’ ๐œ† +

Page 47: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Multiple predictors: : Cyclic Coordinate Desce

โ€ข minimize๐œƒ1

2๐‘šฯƒ๐‘–=1๐‘š ๐‘ฅ๐‘—

๐‘–๐œƒ๐‘— + ฯƒ๐‘˜โ‰ ๐‘— ๐‘ฅ๐‘–๐‘—

๐‘–๐œƒ๐‘˜ โˆ’ ๐‘ฆ ๐‘–

2+

๐œ†

๐‘˜โ‰ ๐‘—

|๐œƒ๐‘˜| + ๐œ† ๐œƒ๐‘— 1

For each ๐‘—, update ๐œƒ๐‘— with

minimize๐œƒ1

2๐‘š

๐‘–=1

๐‘š

๐‘ฅ๐‘—๐‘–๐œƒ๐‘— โˆ’ ๐‘Ÿ๐‘—

(๐‘–) 2+ ๐œ† ๐œƒ๐‘— 1

where ๐‘Ÿ๐‘—(๐‘–)

= ๐‘ฆ ๐‘– โˆ’ ฯƒ๐‘˜โ‰ ๐‘— ๐‘ฅ๐‘–๐‘—๐‘–๐œƒ๐‘˜

Page 48: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

L1 and L2 balls

Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

Page 49: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

TerminologyRegularization function

Name Solver

๐œƒ 22 =

๐‘—=1

๐‘›

๐œƒ๐‘—2

Tikhonov regularizationRidge regression

Close form

๐œƒ1=

๐‘—=1

๐‘›

|๐œƒ๐‘—|LASSO regression Proximal gradient

descent, least angle regression

๐›ผ ๐œƒ1+ (1 โˆ’ ๐›ผ) ๐œƒ 2

2 Elastic net regularization Proximal gradient descent

Page 50: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โ€ขKeep all the features, but reduce magnitude/values of parameters ๐œƒ . โ€ขWorks well

Things to remember

โ€ข Overfitting

โ€ข Cost function

โ€ข Regularized linear regression

โ€ข Regularized logistic regression