Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie,...

99
Lecture 5 MACHINE LEARNING Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 22-26, 2019 Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 1 / 99

Transcript of Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie,...

Page 1: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Lecture 5MACHINE LEARNING

Bruce E. Hansen

Summer School in Economics and EconometricsUniversity of CreteJuly 22-26, 2019

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 1 / 99

Page 2: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Learning Machines: Daleks?

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 2 / 99

Page 3: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Everyone ready!

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 3 / 99

Page 4: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Today’s Schedule

Principal Component Analysis

Ridge Regression

Lasso

Regression Trees

Bagging

Random Forests

Ensembling

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 4 / 99

Page 5: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

References

Hastie, Tibshirani, and Friedman (2008) The Elements of StatisticalLearning: Data Mining, Inferenece, and Prediction

I Today’s Lecture is extracted from this textbook

James, Witten, Hastie, and Tibshirani (2013) An Introduction toStatistical Learning: with Applications in R

I Undergraduate level

Efron Hastie (2017): Computer Age Statistical Inference: Algorithms,Evidence, and Data Science

I Also introductory

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 5 / 99

Page 6: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

PRINCIPAL COMPONENT ANALYSIS

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 6 / 99

Page 7: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Principal Component Analysis

Large number p of regressors xi

Do we need all p regressors?If many regressors are very similar to one another —and highlycorrelated —maybe the question is not “which” regressor isimportant, but instead “which linear combination” is important

For example, if you have a data set with a large number of test scorestaken by students, and you are trying to predict some outcome(grades in future classes, success at a university, future wages), whatmight be the best predictor is an average, or linear combination, ofthe test scores

This leads to the concept of latent factors and factor analysis

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 7 / 99

Page 8: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Single Factor Model

xi = h fi + ui

I h and ui are p× 1I fi is 1× 1I fi is the factorI h are the factor loadingsI ui are the idiosyncratic errors

In this model, the factor fi affects all regressors xji

I But the magnitude is specific to the regressor and captured by h

ScalingI The scale of h and fi are not separately identified, nor their signI One option h′h = 1I Second option var ( fi) = 1

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 8 / 99

Page 9: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Testscore Example

xi is a set of test scores for an individual student

fi is the student’s latent ability

h is how ability affects the different test scoresI Some tests may be highly related to abilityI Some tests may be less relatedI Some may be unrelated (random?)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 9 / 99

Page 10: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Regressor Covariance Matrix

Σx = var (xi) = hh′σ2f + Ipσ2

u

Σxh =(

hh′σ2f + Ipσ2

u

)h = h

(σ2

f + σ2u

)I Thus h is an eigenvector of Σx with eigenvalue σ2

f + σ2u

I All other eigenvectors have eigenvalues σ2u

I Thus h is the eigenvector of Σx associated with the largest eigenvalue

EstimationI Σx = sample covariance matrix of xiI h = eigenvector associated with largest eigenvalue of Σx

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 10 / 99

Page 11: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

The Largest Principal Component Captures the GreatestVariation

Figure: Principal components of some input data pointsBruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 11 / 99

Page 12: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Multiple Factor Model

xi = ∑rm=1 hm fmi + ui = H ′ f i + ui

Interpretation in test score example:I There are more than one form of “ability”I i.e. literary and mathematicalI In labor economics, there has been hypothesized a distinction betweencognitive and non-cognitive ability which has been very useful inexplaining wage patterns (some jobs require one or the other, and someboth (e.g. surgeon)

Loadings normalized to be orthonormal, factors uncorrelated

Σx = H ′Σ f H + IKσ2u

The factor loadings hm are the eigenvectors of Σx associated with thelargest r eigenvaluesEstimation: H = eigenvectors of Σx associated with the largest reigenvalues

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 12 / 99

Page 13: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Identification of the number of factors

A diffi cult practical problem

Standard practice is to examine the eigenvalues of Σx and look forbreaks

The theory suggests there should be r “large” eigenvalues and theremainder “small”

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 13 / 99

Page 14: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Illustration

Data from Duflo, Dupas and Kremer (2011, AER)

Observations are testscores for first grade students in Kenya

The authors used the total testscore, but the data set includes scoreson subsections of the test

I Four literary and three mathematics

How many factors explain performance?

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 14 / 99

Page 15: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Eigenvalue Proportion1 4.02 0.572 1.04 0.153 0.57 0.084 0.52 0.085 0.37 0.056 0.29 0.047 0.19 0.03

First Factor Second Factorwords 0.41 −0.32sentences 0.32 −0.49letters 0.40 −0.13spelling 0.43 −0.28addition 0.38 0.41substraction 0.35 0.52multiplication 0.33 0.36

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 15 / 99

Page 16: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Implications

There appear to be two large eigenvalues relative to the others

The first eigenvector has similar loadings for all seven subjectsI A general abililty/achievement factor

The second factor loads negatively on all four literacy subjects, andpositively on all math subjects

I A “math vs literacy” factor!

Appears in first grade exams!

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 16 / 99

Page 17: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

School Factors

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 17 / 99

Page 18: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Estimation of Factors

Factor ModelI xi = H ′ f i + uiI If you knew H an estimator of f i would be

fi = H ′xi = f i + vi

vi = H ′ui

I The error is mean zero so f i is unbiased for f i.I With estimated loadings fi = H

′xi

I As p, n→ ∞ f i →p= f i

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 18 / 99

Page 19: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Factor-Augmented Regression

yi = f ′iβ+ ei

xi = H ′ f i + ui

Estimated in multiple stepsI Estimate the loadings H from the covariance matrix of xiI Estimate the factors f iI Estimate the coeffi cient β by least squares of yi on the estimatedfactors

Generated RegressorsI Problem diminishes as n, p→ ∞

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 19 / 99

Page 20: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

RIDGE REGRESSION

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 20 / 99

Page 21: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Linear Model

yi = x′iβ+ ei

Assume all variables demeaned

xi is p× 1

Classical OLS β = (X ′X)−1(X ′y)

I Unstable if X ′X is near singularI Unstable if X ′X is near collinearI These problems are not typical when p is small; but are typical when pis large and variables are correlated

I Infeasible if p > nF Many current applications (high dimension regression)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 21 / 99

Page 22: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Wisconsin Regression

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 22 / 99

Page 23: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Ridge Regression

Classical solution to near singularity

βridge =(X ′X + λIp

)−1(X ′y)

λ is a tuning parameter

Classical motivationI Solves multicollinearityI Solves singularityI Stabilizes estimator

Modern motivationI Penalized least squaresI Regularized least squares

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 23 / 99

Page 24: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Ridge Regression

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 24 / 99

Page 25: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Penalized Least Squares

Penalized criteria:I S2(β) = (y− Xβ)′ (y− Xβ) + λβ′β

SSE plus an L2 penalty on the coeffi cient

βridge = argminβ

S2(β)

F.O.C.I 0 = −2X ′ (y− Xβ) + 2λβ

SolutionI βridge =

(X ′X + λIp

)−1(X ′y)

Thus the ridge regression estimator is a penalized LS estiamtor

If the OLS estimator is “large”, the penalty pushes it towards zero

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 25 / 99

Page 26: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Dual Problem

The dual of a penalized minimization problem is a Lagrangian problem

Consider the problemI min (y− Xβ)′ (y− Xβ)subject to β′β ≤ τ

This minimizes the SSE over an ellipse around zero with radius τ

Lagrangian L (β, λ) = (y− Xβ)′ (y− Xβ) + λ(

β′β− τ)

F.O.C. for β

I 0 = −2X ′ (y− Xβ) + 2λβI The same as the penalized least squares problemI They have the same solution

There is a mapping between λ and τ

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 26 / 99

Page 27: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Dual (Lagrangian) Constrained Minimization

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 27 / 99

Page 28: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Shrinkage Interpretation

Suppose X ′X = Ip

Then βridge = (1+ λ)−1 βols

Shrinks OLS towards zero

Similar to Stein estimator

βridge is biased for β but has lower variance than least squares

MSE is not easy to characterize

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 28 / 99

Page 29: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Selection of Ridge Parameter

MallowsI µridge = X

(X ′X + λIp

)−1 X ′y is linear in y

I Penalty is 2σ2 tr((

X ′X + λIp)−1 X ′X

)Cross-Validation

I ei(λ) = yi − x′i βridge,−1(λ)

I CV(λ) = ∑ni=t ei(λ)

2

Theory: Li (1986, Annals of Statistics)I Mallows/CV selection asymptotically equivalent to infeasible optimal λ

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 29 / 99

Page 30: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Interpretation via the SVD

The singular value decomposition of a matrix is X = UDV ′ where Uand V are orthonormal and D is diagonal with the singular values ofX on the diagonal

Apply to the regressor matrix X = UDV ′

X ′X = V DU ′UDV ′ = V DDDV ′ = V D2V ′

(X ′X)−1= V D−2V ′

X ′y = V DU ′y

βols = (X′X)−1 X ′y = V D−2V ′V DU ′y

= V D−2DU ′y = V D−1U ′y

µols = X βols = UDV ′V D−1U ′y = UU ′y = ∑pj=1 uju′jy

The least squares estimator can be seen as a projection on theorthonormal basis uj

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 30 / 99

Page 31: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Interpretation via the SVD

Let Λ = λIp

X ′X +Λ = V D2V ′ +Λ = V(

D2 +Λ)

V ′

(X ′X +Λ)−1= V (D+Λ)−1 V ′

βridge = (X′X +Λ)

−1 X ′y = V (D+Λ)−1 V ′V DU ′y

= V (D+Λ)−1DU ′y = V(D+Λ)−1D−1U ′y

µridge = X βridge = UDV ′V(D+Λ)−1DU ′y

= UD(D+Λ)−1DU ′y = ∑pj=1

d2j

d2j + λ

uju′jy

A shrunken projection

Each basis uj is shrunk according to d2j /(d2

j + λ)

I Smaller d2j implies greater shrinkage

I Small singular values of X receive greatest shrinkage

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 31 / 99

Page 32: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Ridge Coeffi cients varying with Lambda

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 32 / 99

Page 33: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Features of Ridge Estimator

Well defined even if p > nAll p coeffi cients are estimated with non-zero estimatesDoes not perform simultaneous selection

Shrinkage greatest on small singular values of regressors

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 33 / 99

Page 34: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Computation of Ridge Regression

R: package glmnet(x,y,alpha=0)I Selects λ by cross-validation

MATLAB: ridge(y,X,λ)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 34 / 99

Page 35: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

LASSO(Least Absolute Shrinkage and Selection Operator)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 35 / 99

Page 36: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 36 / 99

Page 37: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 37 / 99

Page 38: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Lasso

Penalized criteria:I S1(β) = (y− Xβ)′ (y− Xβ) + λ ‖β‖1I ‖β‖1 = ∑

pj=1

∣∣∣βj

∣∣∣SSE plus an L1 penalty on the coeffi cient

βlasso = argminβ

S1(β)

The minimizer has no closed-form solution

F.O.C. for βj

I 0 = −2X ′j (y− Xβ) + λ sgn(

βj

)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 38 / 99

Page 39: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Dual Problem

min (y− Xβ)′ (y− Xβ)subject to ‖β‖1 ≤ τ

This minimizes the SSE over a constraint set which looks like asquare on its edge (a cross-polytope)

LagrangianI L (β, λ) = (y− Xβ)′ (y− Xβ) + λ (‖β‖1 − τ)

F.O.C. for βj

I 0 = −2X ′j (y− Xβ) + λ sgn(

βj

)I The same as the penalized problemI The solution is identical

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 39 / 99

Page 40: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 40 / 99

Page 41: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Lasso vs Ridge

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 41 / 99

Page 42: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Lasso solution tends to hit a corner

Since the constraint region has corners, the lasso tends to hit one

In contrast, the ridge minimizer tends to hit an interior point on thesmooth constraint region

The corners represent parameters where some coeffi cients equal zero

Hence the lasso solution sets some coeffi cients to zero

However, if the constraint is relaxed then Lasso=OLS

I If τ ≥∥∥∥βols

∥∥∥ then βlasso = βols

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 42 / 99

Page 43: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

As Constraint Set Decreases, Lasso Estimates Shrink tozero

70 3. Linear Methods for Regression

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

6

Shrinkage Factor s

Coe

ffici

ents

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

FIGURE 3.10. Profiles of lasso coefficients, as the tuning parameter t is varied.Coefficients are plotted versus s = t/

∑p

1 |βj |. A vertical line is drawn at s = 0.36,the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lassoprofiles hit zero, while those for ridge do not. The profiles are piece-wise linear,and so are computed only at the points displayed; see Section 3.4.4 for details.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 43 / 99

Page 44: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Effect of changing Lasso Parameter

For λ = 0, Lasso=OLSAs λ increases, the estimates shrink together

At some point, one estimate hits zero. It remains zero

As λ increases further, the estimates continue to shrink at a new rate

One by one, the estimates hit and stick at zero

When λ is suffi ciently large, all coeffi cients equal 0

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 44 / 99

Page 45: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Selection of Lasso Parameter

Most commonly by K-fold CV

Standard algorithms use CV by default or an option

Theoretical justifcation in developmentI Chernozhukov, Chetverikov, and Liao, working paper

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 45 / 99

Page 46: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Nesting Selection, Lasso and Ridge

βq = argminβ

(y− Xβ)′ (y− Xβ) + λ ‖β‖qq

‖β‖qq =

p

∑j=1

∣∣∣βj

∣∣∣q(q = 0) variable subset selection(q = 1) lasso(q = 2) ridge

72 3. Linear Methods for Regression

region for ridge regression is the disk β21 + β2

2 ≤ t, while that for lasso isthe diamond |β1| + |β2| ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero. When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.We can generalize ridge regression and the lasso, and view them as Bayes

estimates. Consider the criterion

β = argminβ

{N∑

i=1

(yi − β0 −

p∑

j=1

xijβj)2

+ λ

p∑

j=1

|βj |q}

(3.53)

for q ≥ 0. The contours of constant value of∑

j |βj |q are shown in Fig-ure 3.12, for the case of two inputs.Thinking of |βj |q as the log-prior density for βj , these are also the equi-

contours of the prior distribution of the parameters. The value q = 0 corre-sponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior correspond-ing to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ) exp(−|β|/τ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.In this view, the lasso, ridge regression and best subset selection are

Bayes estimates with different priors. Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.

Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) for

q = 4 q = 2 q = 1 q = 0.5 q = 0.1

FIGURE 3.12. Contours of constant value of∑

j |βj |q for given values of q.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 46 / 99

Page 47: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Elastic Net —Compromise between Lasso and Ridge

βnet = argminβ

(y− Xβ)′ (y− Xβ)

+ λ(

α ‖β‖22 + (1− α) ‖β‖1

)α = 0 is Lassoα = 1 is Ridge0 < α < 1 mixes Lasso and Ridge penalties

3.4 Shrinkage Methods 73

q = 1.2 α = 0.2

Lq Elastic Net

FIGURE 3.13. Contours of constant value of∑

j |βj |q for q = 1.2 (left plot),

and the elastic-net penalty∑

j(αβ2j +(1−α)|βj |) for α = 0.2 (right plot). Although

visually very similar, the elastic-net has sharp (non-differentiable) corners, whilethe q = 1.2 penalty does not.

setting coefficients exactly to zero. Partly for this reason as well as forcomputational tractability, Zou and Hastie (2005) introduced the elastic-net penalty

λ

p∑

j=1

(αβ2

j + (1− α)|βj |), (3.54)

a different compromise between ridge and lasso. Figure 3.13 compares theLq penalty with q = 1.2 and the elastic-net penalty with α = 0.2; it ishard to detect the difference by eye. The elastic-net selects variables likethe lasso, and shrinks together the coefficients of correlated predictors likeridge. It also has considerable computational advantages over the Lq penal-ties. We discuss the elastic-net further in Section 18.4.

3.4.4 Least Angle Regression

Least angle regression (LAR) is a relative newcomer (Efron et al., 2004),and can be viewed as a kind of “democratic” version of forward stepwiseregression (Section 3.3.2). As we will see, LAR is intimately connectedwith the lasso, and in fact provides an extremely efficient algorithm forcomputing the entire lasso path as in Figure 3.10.

Forward stepwise regression builds a model sequentially, adding one vari-able at a time. At each step, it identifies the best variable to include in theactive set, and then updates the least squares fit to include all the activevariables.

Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves. At the first step it identifies the variablemost correlated with the response. Rather than fit this variable completely,LAR moves the coefficient of this variable continuously toward its least-squares value (causing its correlation with the evolving residual to decreasein absolute value). As soon as another variable “catches up” in terms ofcorrelation with the residual, the process is paused. The second variablethen joins the active set, and their coefficients are moved together in a waythat keeps their correlations tied and decreasing. This process is continued

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 47 / 99

Page 48: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Minimum Distance Representation

When p < n

I (y− Xβ)′ (y− Xβ) = e′ e+(

βols − β)′

X ′X(

βols − β)

(algebraic trick)

Thus

I βlasso = argminβ

(βols − β

)′X ′X

(βols − β

)+ λ ‖β‖1

βlasso minimizes the weighted Euclidean distance to βols with penalty

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 48 / 99

Page 49: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Thresholding Representation

Suppose p < n and X ′X = Ip

Then βridge =1

1+ λβols evenly shrinks OLS towards zero

Selection using a critrical value of c (e.g. c = 1.96σ2)

I βtest,j = βols,j1(

β2ols,j ≥ c

)I This is called a “hard thresholding” rule

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 49 / 99

Page 50: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Thresholding Representation

Lasso criterion under X ′X = Ip

I ∑pj=1

((βols,j − βj

)2+ λ

∣∣∣βols,j − βj

∣∣∣)F.O.C. is

I −2(

βols,j − βj

)+ λ sgn

(βols,j − βj

)= 0

Solution

I βlasso,j =

βols,j − λ/2 βols,j > λ/2

0∣∣∣βols,j

∣∣∣ ≤ λ/2

βols,j + λ/2 βols,j ≤ −λ/2I This is called a “soft thresholding” rule

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 50 / 99

Page 51: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Selection, Ridge Regression and the Lasso

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 51 / 99

Page 52: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Scaling

The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tore-scaling the regressors

I The penalty λ ∑pj=1

∣∣∣βj

∣∣∣ is identical to each coeffi cientI If you rescale a regressor (e.g. change units of measurement) then thepenalty has a completely different meaning

I Hence the scale mattersI In practice, it is common to rescale the regressors so that all are meanzero and have the same variance

F Unless variables are already scaled similarly (e.g. interest rates)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 52 / 99

Page 53: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Which Regressors

The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tolinear transformations of the regressors

Suppose you have X1 and X2

OLS on (X1, X2) is the same as OLS on (X1 − X2, X2)

Lasso on (X1, X2) is different than Lasso on (X1 − X2, X2)

OrthogonalityI Much theoretical insight arises from the case of orthogonal regressorsI It may therfore be useful to start with transformed regressors which arenear orthogonal

I e.g. differences between interest rates (spreads) rather than levels

Getting the right zerosI Many theoretical results concern sparsity (more on this later)I This occurs when the true regression has many 0 coeffi cientsI It is therefore useful to start with transformed regressors which aremost likely to have many zero coeffi cients

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 53 / 99

Page 54: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Grouped Lasso

We can penalize groups of coeffi cients so that they areincluded/excluded as a group

Grouped Lasso criterion

I(

y−∑L`=1 X`β`

)′ (y−∑L

`=1 X`β`)+ λ ∑L

`=1√

p` ‖β`‖2I p` = group size

I Note the penalty is ‖β`‖2 =(

∑j β2`j

)1/2

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 54 / 99

Page 55: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Statistical Properties

There are asymptotic results for the Lasso allowing for p >> nThe results rely on a sparsity assumption: The true regression has p0non-zero coeffi cients, where p0 is fixed

I This assumption can be relaxed in some respects, but some form ofsparsity lies at the core of current theory

Under regularity conditions, Lasso estimation identifies the truepredictors with high probability

I Consistent model selectionI Similar to BIC selection

The non-zero coeffi cients, however are not consistently estimated butare biased

Proposals to eliminate the biasI Least squares after Lasso selectionI SCAD (smoothly clipped absolute deviation)I Adaptive Lasso

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 55 / 99

Page 56: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Sparsity

Sparsity is all the fashion in this literature

It seems to be an assumption used to justify theory which can beproved, not an assumption based on reality

People talk about “imposing sparsity”as if the theorist can influencethe world

The world is the way it is.

What does sparsity mean in practical econometrics?I That a few coeffi cients are “big”, the remainder zero or smallI In a series regression, only a few coeffi cients are non-zero, theremainder zero or small

I This does not make much senseI More reasonable, all coeffi cients are non-zero, and all are small

This is a challenge for Lasso-type theory

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 56 / 99

Page 57: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Computation via LAR algorithm

Least Angle Regression (LAR)

A modification of forward stepwise regressionI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y

F Take residuals along the wayF Stop when some other x` and xk have the same correlation with theresidual

I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual

I Continue until all predictors are in model

This algorithm gives the Lasso expansion path

Used to produce an ordering yields Least Angle Regression Selection(LARS)

I Alterntiave to stepwise regression

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 57 / 99

Page 58: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Least Angle Regression Solution Path

3.4 Shrinkage Methods 75

0 5 10 15

0.0

0.1

0.2

0.3

0.4

v2 v6 v4 v5 v3 v1

L1 Arc Length

Absolute

Correlations

FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Least Angle Regression

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Lasso

L1 Arc LengthL1 Arc Length

Coefficients

Coefficients

FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 58 / 99

Page 59: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

LAR modification to make equal to Lasso

Modified LAR algorithmI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y

F Take residuals along the wayF Stop when some other x` and xj have the same correlation with theresidual

F If a non-zero coeffi cient hits zero, drop from the active set of variablesand recompute the joint least squares direction.

I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual

I Continue until all predictors are in model

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 59 / 99

Page 60: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Comparison of performance of methods

78 3. Linear Methods for Regression

0.0 0.2 0.4 0.6 0.8 1.0

0.55

0.60

0.65

Forward StepwiseLARLassoForward StagewiseIncremental Forward Stagewise

E||β

(k)−β||2

Fraction of L1 arc-length

FIGURE 3.16. Comparison of LAR and lasso with forward stepwise, forwardstagewise (FS) and incremental forward stagewise (FS0) regression. The setupis the same as in Figure 3.6, except N = 100 here rather than 300. Here theslower FS regression ultimately outperforms forward stepwise. LAR and lassoshow similar behavior to FS and FS0. Since the procedures take different numbersof steps (across simulation replicates and methods), we plot the MSE as a functionof the fraction of total L1 arc-length toward the least-squares fit.

adaptively fitted to the training data. This definition is motivated anddiscussed further in Sections 7.4–7.6.

Now for a linear regression with k fixed predictors, it is easy to showthat df(y) = k. Likewise for ridge regression, this definition leads to theclosed-form expression (3.50) on page 68: df(y) = tr(Sλ). In both thesecases, (3.60) is simple to evaluate because the fit y = Hλy is linear in y.If we think about definition (3.60) in the context of a best subset selectionof size k, it seems clear that df(y) will be larger than k, and this can beverified by estimating Cov(yi, yi)/σ

2 directly by simulation. However thereis no closed form method for estimating df(y) for best subset selection.

For LAR and lasso, something magical happens. These techniques areadaptive in a smoother way than best subset selection, and hence estimationof degrees of freedom is more tractable. Specifically it can be shown thatafter the kth step of the LAR procedure, the effective degrees of freedom ofthe fit vector is exactly k. Now for the lasso, the (modified) LAR procedure

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 60 / 99

Page 61: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Computation

Dual representation of Lasso and Elastic Net is a quadraticprogramming problem

Effi cient when we have a fixed λ

I Numerically fast

The LARS algorithm provides the entire path as a function of thetuning parameter

I Useful for cross-validation

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 61 / 99

Page 62: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Computation

R (recommended)I package glmnet

F cv.glmnet(x,y)F Selects λ by cross-validation

I For ridge or elastic net

F cv.glmnet(x,y,alpha=a)F Set a = 0 for ridge, a = 1 for Lasso, 0 < a < 1 for elastic net

I package lars

F lars(x,y,type="lasso")F lars(x,y,type="lar")

MATLABI lasso(X,y)

I lasso(X,y,’CV’,K)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 62 / 99

Page 63: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Computation in R

library(glmnet)

mLasso <- cv.glmnet(X,y,family="gaussian",nfolds=200)

I beta <- coef(mLasso,mLasso$lambda.min)I Useful to specify the number of folds (default is 10)I More folds reduces instability, but takes longerI If you do not specify “lambda.min” the package will use “lambda.1se”which is different than the minimizer

mRidge <-cv.glmnet(X,y,alpha=0,family="gaussian",nfolds=200))

mElastic <-cv.glmnet(X,y,alpha=1/2,family="gaussian",nfolds=200))

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 63 / 99

Page 64: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Illustration

cps wage regression using subsample of Asian woment (n = 1149)Regressors:

I education (linear), and dummies for education equalling 12, 13, 14, 16,18, and 20

I experience in powers from 1 to 9I marriage dummies (6 of 7 categories), 3 region dummies, union dummy

Lasso, with λ selected by minimizing 200-fold CV

Selected regressors:I Education dummiesI Experience powers 1, 2, 3, 5, 6I All remaining dummies

Coeffi cient estimates: Most shrunk about 10% relative to leastsquares

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 64 / 99

Page 65: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

REGRESSION TREES

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 65 / 99

Page 66: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 66 / 99

Page 67: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Regression Tree

Partition regressor space into rectanglesI Split based if a regressor is below or exceeds a thresholdI Split againI Split again

Each split point is a nodeEach subset is a branchOn each branch, fit a simple model

I Often just the sample mean of yiI Or linear regression

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 67 / 99

Page 68: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2X1 ≤ t3

X2 ≤ t4

FIGURE 9.2. Partitions and CART. Top right panelshows a partition of a two-dimensional feature space byrecursive binary splitting, as used in CART, applied tosome fake data. Top left panel shows a general partitionthat cannot be obtained from recursive binary splitting.Bottom left panel shows the tree corresponding to thepartition in the top right panel, and a perspective plotof the prediction surface appears in the bottom rightpanel.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 68 / 99

Page 69: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125 edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125 edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

FIGURE 9.5. The pruned tree for the spam example.The split variables are shown in blue on the branches,and the classification is shown in every node.The num-bers under the terminal nodes indicate misclassificationrates on the test data.

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 69 / 99

Page 70: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Estimation of nodes

Regression —Minimizing SSE

This is the same as for threshold models in econometricsI Similar to structural change estimation

Potential split points equal n (at each observation point)Estimate up to n regressions, for each possible split pointFind split point which minimizes the SSE

This is the least squares estimator of the node

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 70 / 99

Page 71: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Tree Estimation

Given the nodes (the split points) you estimate the simple model(mean or linear regression) on each branch

The regression model is the tree structure plus the models for eachbranch

Prediction (estimation of the conditional mean at a point)I Given the regressor, which branch are we on?I Compute conditional mean for that branch

Take, for example, a wage regressionI splits might be based on sex, race, region, education levels, experiencelevels, etc

I Each split is binaryI A branch is a set of characteristics. The estimate (typically) is themean for this group

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 71 / 99

Page 72: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

How Many Nodes?

First fit (grow) a large tree, based on a pre-specified maximumnumber of nodes

Then prune back by miniminizing a cost criterionI T = treeI |T| = number of terminal nodesI yi = fitted value (mean or regression fit within each branch)I ei = yi − yiI C = ∑

|T|t=1 e2

i + α |T|I Penalized sum of squared errors, AIC-like

Penalty term α selected by CV

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 72 / 99

Page 73: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Comments on Trees

Flexible non-parametric approach

Typically used for prediction

Can be useful for decisions

Consider doctor deciding on a treatment:I What is your gender?I Is your cholesterol above 200?I Is your blood pressure above 130?I Is your age above 60?I Given this information, we recommend you take the BLUE pill

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 73 / 99

Page 74: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 74 / 99

Page 75: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

BAGGING(BOOTSTRAP AGGREGATION)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 75 / 99

Page 76: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 76 / 99

Page 77: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bagging

Bootstrap averaging for estimator m(x) of conditional mean m(x)I Example: Regression tree

Generate B random samples of size n by sampling with replacementfrom data

I On each bootstrap sample, fit estimator mb(x)

Average across the bootstrap samplesI mbag(x) = B−1 ∑B

b=1 mb(x)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 77 / 99

Page 78: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 78 / 99

Page 79: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bagging

If m(x) is linear (with large B) mbag(x) ' m(x)If m(x) is nonlinear they are different

I Bagging reduces variance (but adds bias)I Averaging smooths the estimator, smoothing reduces variance

Use mbag(x) for prediction

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 79 / 99

Page 80: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Bagging and Bias

If m(x) is biased then bagging increases the biasSimple bootstrap estimator of bias is

I bias(m) = B−1 ∑Bb=1 mb(x)− m(x) = mbag(x)− m(x)

Thus a bias-corrected estimator of m(x) is

I mbc(x) = m(x)− bias(m) = m(x)−(

mbag(x)− m(x))=

2m(x)− mbag(x)I Not mbag(x)!

Bagging does not reduce bias, but accentuates bias

Bagging is best applied to low-bias estimators

Goal is to reduce variance

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 80 / 99

Page 81: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

A little intuition

θ ∼ N(θ, Ip

)Thresholded estimator θ = θ1

(θ′θ > c

)Bootstrap

I θ∗ ∼ N

(θ, Ip

)I θ

∗= θ

∗1(

θ∗′

θ∗> c)

Bagging estimator

I θbag = B−1 ∑Bb=1 θ

∗ ' µ(θ) where µ(θ) = E(

θ)

θ is a non-smooth function of θ

µ(θ) is a smooth function of θ

θbag ' µ(θ) is a smooth function of θ

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 81 / 99

Page 82: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

When to use Bagging

Bagging is ideal for methods such as regression trees

Deep regression trees have low bias but high variance

Regression trees use hard thresholding rules

Bagging smooths the hard thresholding into smooth thresholding

Bagging reduces variance

Bagging is not useful for high-bias, smooth, or low-varianceprocedures

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 82 / 99

Page 83: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

RANDOM FORESTS

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 83 / 99

Page 84: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Random Forests

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 84 / 99

Page 85: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Random Forests

Similar to bagging, but with an adjustment to reduce variance

When you do bagging, you are averaging identically distributed butcorrelated bootstrapped trees

The correlation means that the averaging does not reduce thevariance as much as if the bootstrapped trees were uncorrelated

Random forests tries to de-correlate the bootstrapped trees

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 85 / 99

Page 86: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Random Forest Algorithm for Regression

For b = 1, ..., BI Draw a random sample of size n from the data setI Grow a random forest tree Tb to the bootstrapped data, by recursivelyrepeating the following steps for each terminal node of the tree untilthe minimum node size nmin (recommended nmin = 5)

F Select m variables at random from the p variables.Recommended m = p/3

F Pick the best variable/split point among the mF Split the node into two daughter nodes

m(x) = B−1 ∑Bb=1 Tb(x)

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 86 / 99

Page 87: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Why Random Forest?

The randomization over the variables means that the bootstrappedregression trees are less correlated than using standard bagging

Very popular

Numerical studies shows that it works well in many applications

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 87 / 99

Page 88: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Out-of-Bag Samples

Random forests have a evaluation device similar to cross-validation,called the out-of-bag (OOB) sample

Recall that a random forest predictor is calculated by averaging overB bootstrap samplesThe probability that a given observation i is in a given bootstrapsample is about 63%

For the OOS sampleI For each observation i

F Construct its random forest predictor by averaging only over the(approximately) 37% of bootstrap samples where observation i doesnot appear

F Compute the OOB prediction error

I Take sum of squared errors

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 88 / 99

Page 89: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

ENSEMBLING

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 89 / 99

Page 90: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Ensembling Econometricians

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 90 / 99

Page 91: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Emsembling is Averaging

Suppose you have several estimatorsI AIC-selectedI JMA model averagedI Stein shrinkageI Principle components regressionI RidgeI LassoI Elastic NetI Regression TreeI Random Forest

What do you do?

Let’s look at our favorite models again

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 91 / 99

Page 92: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Model 1: Kendall Jenner

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 92 / 99

Page 93: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Model 2: Fabio

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 93 / 99

Page 94: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Model 3: Einstein

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 94 / 99

Page 95: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Emsembling is Averaging the Estimators

Weight Selection Methods:

Method 1: Elements of Statistical Learning recommends penalizedregression (Ridge or Lasso penalty)

I Regress yi on predictions from the models, with penaltyI Regularization (penalty) is essentialI Unclear how to select λ

Method 2: Select weights by cross-validation

Considerable evidence indicates that ensembling (averaging) is betterthan using just one method

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 95 / 99

Page 96: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

What’s Next?Last time I taught here, my wife and I went to Gavdos

������������ ��� �Gavdos

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 96 / 99

Page 97: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Gavdos

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 97 / 99

Page 98: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

Assignment # 5

Use the cps dataset from before, but now use ALL observationsCreate a large set of regressors, that you believe are appropriate topotentially model wages

Estimate a regression for log(wage) using the following methodsI OLSI Ridge RegressionI LassoI Elastic Net with α = 1/2

Report your coeffi cient estimates in a table

Comment on your findings

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 98 / 99

Page 99: Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie, Tibshirani, and Friedman (2008) The Elements of Statistical Learning: Data Mining,

That’s It!

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 99 / 99