Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep...
Transcript of Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep...
![Page 1: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/1.jpg)
Regularization
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
![Page 2: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/2.jpg)
Administrative
โข Women in Data Science Blacksburgโข Location: Holtzman Alumni Center
โข Welcome, 3:30 - 3:40, Assembly hall
โข Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall
โข Career Panel, 4:05 - 5:00, hall
โข Break , 5:00 - 5:20, Grand hallAssembly
โข Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall
โข Dinner with breakout discussion groups, 5:45 - 7:00, Museum
โข Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall
โข Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
![Page 3: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/3.jpg)
k-NN (Classification/Regression)
โข Model๐ฅ 1 , ๐ฆ 1 , ๐ฅ 2 , ๐ฆ 2 , โฏ , ๐ฅ ๐ , ๐ฆ ๐
โข Cost function
None
โข Learning
Do nothing
โข Inference
เท๐ฆ = โ ๐ฅtest = ๐ฆ(๐), where ๐ = argmin๐ ๐ท(๐ฅtest, ๐ฅ(๐))
![Page 4: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/4.jpg)
Linear regression (Regression)
โข Modelโ๐ ๐ฅ = ๐0 + ๐1๐ฅ1 + ๐2๐ฅ2 +โฏ+ ๐๐๐ฅ๐ = ๐โค๐ฅ
โข Cost function
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2
โข Learning
1) Gradient descent: Repeat {๐๐ โ ๐๐ โ ๐ผ1
๐ฯ๐=1๐ โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐
๐}
2) Solving normal equation ๐ = (๐โค๐)โ1๐โค๐ฆ
โข Inferenceเท๐ฆ = โ๐ ๐ฅtest = ๐โค๐ฅtest
![Page 5: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/5.jpg)
Naรฏve Bayes (Classification)
โข Modelโ๐ ๐ฅ = ๐(๐|๐1, ๐2, โฏ , ๐๐) โ ๐ ๐ ฮ ๐๐ ๐๐ ๐)
โข Cost functionMaximum likelihood estimation: ๐ฝ ๐ = โ log ๐ Data ๐Maximum a posteriori estimation :๐ฝ ๐ = โ log ๐ Data ๐ ๐ ๐
โข Learning๐๐ = ๐(๐ = ๐ฆ๐)
(Discrete ๐๐) ๐๐๐๐ = ๐(๐๐ = ๐ฅ๐๐๐|๐ = ๐ฆ๐)
(Continuous ๐๐) mean ๐๐๐, variance ๐๐๐2 , ๐ ๐๐ ๐ = ๐ฆ๐) = ๐ฉ(๐๐|๐๐๐ , ๐๐๐
2 )
โข Inference๐ โ argmax
๐ฆ๐
๐ ๐ = ๐ฆ๐ ฮ ๐๐ ๐๐test ๐ = ๐ฆ๐)
![Page 6: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/6.jpg)
Logistic regression (Classification)
โข Modelโ๐ ๐ฅ = ๐ ๐ = 1 ๐1, ๐2, โฏ , ๐๐ =
1
1+๐โ๐โค๐ฅ
โข Cost function
๐ฝ ๐ =1
๐
๐=1
๐
Cost(โ๐(๐ฅ๐ ), ๐ฆ(๐))) Cost(โ๐ ๐ฅ , ๐ฆ) = เต
โlog โ๐ ๐ฅ if ๐ฆ = 1
โlog 1 โ โ๐ ๐ฅ if ๐ฆ = 0
โข LearningGradient descent: Repeat {๐๐ โ ๐๐ โ ๐ผ
1
๐ฯ๐=1๐ โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐
๐}
โข Inference๐ = โ๐ ๐ฅtest =
1
1 + ๐โ๐โค๐ฅtest
![Page 7: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/7.jpg)
Logistic Regression
โขHypothesis representation
โขCost function
โข Logistic regression with gradient descent
โขRegularization
โขMulti-class classification
โ๐ ๐ฅ =1
1 + ๐โ๐โค๐ฅ
Cost(โ๐ ๐ฅ , ๐ฆ) = เตโlog โ๐ ๐ฅ if ๐ฆ = 1
โlog 1 โ โ๐ ๐ฅ if ๐ฆ = 0
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ(๐) ๐ฅ๐(๐)
![Page 8: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/8.jpg)
How about MAP?
โข Maximum conditional likelihood estimate (MCLE)
โข Maximum conditional a posterior estimate (MCAP)
๐MCLE = argmax๐
ฯ๐=1๐ ๐๐ ๐ฆ(๐)|๐ฅ ๐
๐MCAP = argmax๐
ฯ๐=1๐ ๐๐ ๐ฆ(๐)|๐ฅ ๐ ๐(๐)
![Page 9: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/9.jpg)
Prior ๐(๐)
โข Common choice of ๐(๐): โข Normal distribution, zero mean, identity covariance
โข โPushesโ parameters towards zeros
โข Corresponds to Regularizationโข Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
![Page 10: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/10.jpg)
MLE vs. MAP
โข Maximum conditional likelihood estimate (MCLE)
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ(๐) ๐ฅ๐(๐)
โข Maximum conditional a posterior estimate (MCAP)
๐๐ โ ๐๐ โ ๐ผ๐๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ(๐) ๐ฅ๐(๐)
![Page 11: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/11.jpg)
Logistic Regression
โขHypothesis representation
โขCost function
โข Logistic regression with gradient descent
โขRegularization
โขMulti-class classification
![Page 12: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/12.jpg)
Multi-class classification
โข Email foldering/taggning: Work, Friends, Family, Hobby
โข Medical diagrams: Not ill, Cold, Flu
โข Weather: Sunny, Cloudy, Rain, Snow
Slide credit: Andrew Ng
![Page 13: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/13.jpg)
Binary classification
๐ฅ2
๐ฅ1
Multiclass classification
๐ฅ2
๐ฅ1
![Page 14: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/14.jpg)
One-vs-all (one-vs-rest)
๐ฅ2
๐ฅ1Class 1:Class 2: Class 3:
โ๐๐๐ฅ = ๐ ๐ฆ = ๐ ๐ฅ; ๐ (๐ = 1, 2, 3)
๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1
โ๐1
๐ฅ
โ๐2
๐ฅ
โ๐3
๐ฅ
Slide credit: Andrew Ng
![Page 15: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/15.jpg)
One-vs-all
โขTrain a logistic regression classifier โ๐๐๐ฅ for
each class ๐ to predict the probability that ๐ฆ = ๐
โขGiven a new input ๐ฅ, pick the class ๐ that maximizes
maxi
โ๐๐๐ฅ
Slide credit: Andrew Ng
![Page 16: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/16.jpg)
Generative ApproachEx: Naรฏve Bayes
Estimate ๐(๐) and ๐(๐|๐)
Predictionเท๐ฆ = argmax๐ฆ ๐ ๐ = ๐ฆ ๐(๐ = ๐ฅ|๐ = ๐ฆ)
Discriminative ApproachEx: Logistic regression
Estimate ๐(๐|๐) directly
(Or a discriminant function: e.g., SVM)
Predictionเท๐ฆ = ๐(๐ = ๐ฆ|๐ = ๐ฅ)
![Page 17: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/17.jpg)
Further readings
โข Tom M. MitchellGenerative and discriminative classifiers: Naรฏve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
โข Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
![Page 18: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/18.jpg)
Regularization
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression
![Page 19: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/19.jpg)
Regularization
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression
![Page 20: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/20.jpg)
Example: Linear regressionPrice ($)in 1000โs
Size in feet^2
Price ($)in 1000โs
Size in feet^2
Price ($)in 1000โs
Size in feet^2
โ๐ ๐ฅ = ๐0 + ๐1๐ฅ โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ2 โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ
2 +๐3๐ฅ
3 + ๐4๐ฅ4 +โฏ
Underfitting OverfittingJust right
Slide credit: Andrew Ng
![Page 21: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/21.jpg)
Overfitting
โข If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2โ 0
but fail to generalize to new examples (predict prices on new examples).
Slide credit: Andrew Ng
![Page 22: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/22.jpg)
Example: Linear regressionPrice ($)in 1000โs
Size in feet^2
Price ($)in 1000โs
Size in feet^2
Price ($)in 1000โs
Size in feet^2
โ๐ ๐ฅ = ๐0 + ๐1๐ฅ โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ2 โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ
2 +๐3๐ฅ
3 + ๐4๐ฅ4 +โฏ
Underfitting OverfittingJust right
High bias High varianceSlide credit: Andrew Ng
![Page 23: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/23.jpg)
Bias-Variance Tradeoff
โขBias: difference between what you expect to learn and truthโข Measures how well you expect to represent true solution
โข Decreases with more complex model
โขVariance: difference between what you expect to learn and what you learn from a particular dataset โข Measures how sensitive learner is to specific dataset
โข Increases with more complex model
![Page 24: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/24.jpg)
Low variance High variance
Low bias
High bias
![Page 25: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/25.jpg)
Biasโvariance decomposition
โข Training set { ๐ฅ1, ๐ฆ1 , ๐ฅ2, ๐ฆ2 , โฏ , ๐ฅ๐, ๐ฆ๐ }
โข ๐ฆ = ๐ ๐ฅ + ๐
โข We want แ๐ ๐ฅ that minimizes ๐ธ ๐ฆ โ แ๐ ๐ฅ2
๐ธ ๐ฆ โ แ๐ ๐ฅ2= Bias แ๐ ๐ฅ
2+ Var แ๐ ๐ฅ + ๐2
Bias แ๐ ๐ฅ = ๐ธ แ๐ ๐ฅ โ ๐(๐ฅ)
Var แ๐ ๐ฅ = ๐ธ แ๐ ๐ฅ 2 โ ๐ธ แ๐ ๐ฅ2
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
![Page 26: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/26.jpg)
Overfitting
Tumor Size
Age
Tumor Size
Age
Tumor Size
Age
โ๐ ๐ฅ = ๐(๐0 + ๐1๐ฅ + ๐2๐ฅ2) โ๐ ๐ฅ = ๐(๐0 + ๐1๐ฅ + ๐2๐ฅ2 +๐3๐ฅ1
2 + ๐4๐ฅ22 + ๐5๐ฅ1๐ฅ2)
โ๐ ๐ฅ = ๐(๐0 + ๐1๐ฅ + ๐2๐ฅ2 +๐3๐ฅ1
2 + ๐4๐ฅ22 + ๐5๐ฅ1๐ฅ2 +
๐6๐ฅ13๐ฅ2 + ๐7๐ฅ1๐ฅ2
3 +โฏ)
Underfitting OverfittingSlide credit: Andrew Ng
![Page 27: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/27.jpg)
Addressing overfitting
โข ๐ฅ1 = size of house
โข ๐ฅ2 = no. of bedrooms
โข ๐ฅ3 = no. of floors
โข ๐ฅ4 = age of house
โข ๐ฅ5 = average income in neighborhood
โข ๐ฅ6 = kitchen size
โข โฎ
โข ๐ฅ100
Price ($)in 1000โs
Size in feet^2
Slide credit: Andrew Ng
![Page 28: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/28.jpg)
Addressing overfitting
โข 1. Reduce number of features.โข Manually select which features to keep.
โข Model selection algorithm (later in course).
โข 2. Regularization.โข Keep all the features, but reduce magnitude/values of parameters ๐๐.
โข Works well when we have a lot of features, each of which contributes a bit to predicting ๐ฆ.
Slide credit: Andrew Ng
![Page 29: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/29.jpg)
Overfitting Thriller
โข https://www.youtube.com/watch?v=DQWI1kvmwRg
![Page 30: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/30.jpg)
Regularization
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression
![Page 31: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/31.jpg)
Intuition
โข Suppose we penalize and make ๐3, ๐4 really small.
min๐
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ 1000 ๐3
2 + 1000 ๐42
โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ2 โ๐ ๐ฅ = ๐0 + ๐1๐ฅ + ๐2๐ฅ
2 + ๐3๐ฅ3 + ๐4๐ฅ
4
Price ($)in 1000โs
Size in feet^2
Price ($)in 1000โs
Size in feet^2
Slide credit: Andrew Ng
![Page 32: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/32.jpg)
Regularization.
โขSmall values for parameters ๐1, ๐2, โฏ , ๐๐โข โSimplerโ hypothesisโข Less prone to overfitting
โขHousing:โข Features: ๐ฅ1, ๐ฅ2, โฏ , ๐ฅ100โขParameters: ๐0, ๐1, ๐2, โฏ , ๐100
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
๐๐2
Slide credit: Andrew Ng
![Page 33: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/33.jpg)
Regularization
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
๐๐2
min๐
๐ฝ(๐)
Price ($)in 1000โs
Size in feet^2
๐: Regularization parameter
Slide credit: Andrew Ng
![Page 34: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/34.jpg)
Question
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
๐๐2
What if ๐ is set to an extremely large value (say ๐ = 1010)?
1. Algorithm works fine; setting to be very large canโt hurt it
2. Algorithm fails to eliminate overfitting.
3. Algorithm results in underfitting. (Fails to fit even training data well).
4. Gradient descent will fail to converge.
Slide credit: Andrew Ng
![Page 35: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/35.jpg)
Question
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
๐๐2
What if ๐ is set to an extremely large value (say ๐ = 1010)?Price ($)in 1000โs
Size in feet^2
โ๐ ๐ฅ = ๐0 + ๐1๐ฅ1 + ๐2๐ฅ2 +โฏ+ ๐๐๐ฅ๐ = ๐โค๐ฅSlide credit: Andrew Ng
![Page 36: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/36.jpg)
Regularization
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression
![Page 37: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/37.jpg)
Regularized linear regression
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
๐๐2
min๐
๐ฝ(๐)
๐: Number of features
๐0 is not panelizedSlide credit: Andrew Ng
![Page 38: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/38.jpg)
Gradient descent (Previously)
Repeat {
๐0 โ ๐0 โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐
}
(๐ = 1, 2, 3,โฏ , ๐)
Slide credit: Andrew Ng
(๐ = 0)
![Page 39: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/39.jpg)
Gradient descent (Regularized)
Repeat {
๐0 โ ๐0 โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐+ ๐๐๐
}๐๐ โ ๐๐(1 โ ๐ผ
๐
๐) โ ๐ผ
1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐
Slide credit: Andrew Ng
![Page 40: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/40.jpg)
Comparison
Regularized linear regression
๐๐ โ ๐๐(1 โ ๐ผ๐
๐) โ ๐ผ
1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐
Un-regularized linear regression
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐
1 โ ๐ผ๐
๐< 1: Weight decay
![Page 41: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/41.jpg)
Normal equation
โข ๐ =
๐ฅ 1 โค
๐ฅ 2 โค
โฎ
๐ฅ ๐ โค
โ ๐ ๐ร(๐+1) ๐ฆ =
๐ฆ(1)
๐ฆ(2)
โฎ๐ฆ(๐)
โ ๐ ๐
โข min๐
๐ฝ(๐)
โข ๐ = ๐โค๐ + ๐
0 0 โฏ 00 1 0 0โฎ โฎ โฑ โฎ0 0 0 1
โ1
๐โค๐ฆ
(๐ + 1 ) ร (๐ + 1) Slide credit: Andrew Ng
![Page 42: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/42.jpg)
Regularization
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression
![Page 43: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/43.jpg)
Regularized logistic regression
โข Cost function:
๐ฝ ๐ =1
๐
๐=1
๐
๐ฆ ๐ log โ๐ ๐ฅ ๐ + (1 โ ๐ฆ ๐ ) log 1 โ โ๐ ๐ฅ ๐ +๐
2
๐=1
๐
๐๐2
Tumor Size
Age
โ๐ ๐ฅ = ๐(๐0 + ๐1๐ฅ + ๐2๐ฅ2 +๐3๐ฅ1
2 + ๐4๐ฅ22 + ๐5๐ฅ1๐ฅ2 +
๐6๐ฅ13๐ฅ2 + ๐7๐ฅ1๐ฅ2
3 +โฏ)
Slide credit: Andrew Ng
![Page 44: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/44.jpg)
Gradient descent (Regularized)
Repeat {
๐0 โ ๐0 โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐
๐๐ โ ๐๐ โ ๐ผ1
๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ ๐ฅ๐๐โ ๐๐๐
}
โ๐ ๐ฅ =1
1 + ๐โ๐โค๐ฅ
๐
๐๐๐๐ฝ(๐)
Slide credit: Andrew Ng
![Page 45: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/45.jpg)
๐ 1: Lasso regularization
๐ฝ ๐ =1
2๐
๐=1
๐
โ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2+ ๐
๐=1
๐
|๐๐|
LASSO: Least Absolute Shrinkage and Selection Operator
![Page 46: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/46.jpg)
Single predictor: Soft Thresholding
โขminimize๐1
2๐ฯ๐=1๐ ๐ฅ(๐)๐ โ ๐ฆ ๐ 2
+ ๐ ๐ 1
๐ =
1
๐< ๐, ๐ > โ๐ if
1
๐< ๐, ๐ > > ๐
0 if1
๐| < ๐, ๐ > | โค ๐
1
๐< ๐, ๐ > +๐ if
1
๐< ๐, ๐ > < โ๐
๐ = ๐๐(1
๐< ๐, ๐ >)
Soft Thresholding operator ๐๐ ๐ฅ = sign ๐ฅ ๐ฅ โ ๐ +
![Page 47: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/47.jpg)
Multiple predictors: : Cyclic Coordinate Desce
โข minimize๐1
2๐ฯ๐=1๐ ๐ฅ๐
๐๐๐ + ฯ๐โ ๐ ๐ฅ๐๐
๐๐๐ โ ๐ฆ ๐
2+
๐
๐โ ๐
|๐๐| + ๐ ๐๐ 1
For each ๐, update ๐๐ with
minimize๐1
2๐
๐=1
๐
๐ฅ๐๐๐๐ โ ๐๐
(๐) 2+ ๐ ๐๐ 1
where ๐๐(๐)
= ๐ฆ ๐ โ ฯ๐โ ๐ ๐ฅ๐๐๐๐๐
![Page 48: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/48.jpg)
L1 and L2 balls
Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
![Page 49: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/49.jpg)
TerminologyRegularization function
Name Solver
๐ 22 =
๐=1
๐
๐๐2
Tikhonov regularizationRidge regression
Close form
๐1=
๐=1
๐
|๐๐|LASSO regression Proximal gradient
descent, least angle regression
๐ผ ๐1+ (1 โ ๐ผ) ๐ 2
2 Elastic net regularization Proximal gradient descent
![Page 50: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/...ย ยท Regularization. โขKeep all the features, but reduce magnitude/values of parameters ๐ . โขWorks well](https://reader034.fdocuments.in/reader034/viewer/2022052000/6012dfffc0bf144fc62e63ef/html5/thumbnails/50.jpg)
Things to remember
โข Overfitting
โข Cost function
โข Regularized linear regression
โข Regularized logistic regression