Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie,...

Lecture 5MACHINE LEARNING

Bruce E. Hansen

Summer School in Economics and EconometricsUniversity of CreteJuly 22-26, 2019

Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 1 / 99

Learning Machines: Daleks?


Everyone ready!


Today’s Schedule

Principal Component Analysis

Ridge Regression

Lasso

Regression Trees

Bagging

Random Forests

Ensembling


References

Hastie, Tibshirani, and Friedman (2008) The Elements of StatisticalLearning: Data Mining, Inferenece, and Prediction

I Today’s Lecture is extracted from this textbook

James, Witten, Hastie, and Tibshirani (2013) An Introduction toStatistical Learning: with Applications in R

I Undergraduate level

Efron Hastie (2017): Computer Age Statistical Inference: Algorithms,Evidence, and Data Science

I Also introductory


PRINCIPAL COMPONENT ANALYSIS


Principal Component Analysis

Large number p of regressors xi

Do we need all p regressors?If many regressors are very similar to one another —and highlycorrelated —maybe the question is not “which” regressor isimportant, but instead “which linear combination” is important

For example, if you have a data set with a large number of test scorestaken by students, and you are trying to predict some outcome(grades in future classes, success at a university, future wages), whatmight be the best predictor is an average, or linear combination, ofthe test scores

This leads to the concept of latent factors and factor analysis


Single Factor Model

xi = h fi + ui

I h and ui are p× 1I fi is 1× 1I fi is the factorI h are the factor loadingsI ui are the idiosyncratic errors

In this model, the factor fi affects all regressors xji

I But the magnitude is specific to the regressor and captured by h

ScalingI The scale of h and fi are not separately identified, nor their signI One option h′h = 1I Second option var ( fi) = 1


Testscore Example

xi is a set of test scores for an individual student

fi is the student’s latent ability

h is how ability affects the different test scoresI Some tests may be highly related to abilityI Some tests may be less relatedI Some may be unrelated (random?)


Regressor Covariance Matrix

Σx = var (xi) = hh′σ2f + Ipσ2

u

Σxh =(

hh′σ2f + Ipσ2

u

)h = h

(σ2

f + σ2u

)I Thus h is an eigenvector of Σx with eigenvalue σ2

f + σ2u

I All other eigenvectors have eigenvalues σ2u

I Thus h is the eigenvector of Σx associated with the largest eigenvalue

EstimationI Σx = sample covariance matrix of xiI h = eigenvector associated with largest eigenvalue of Σx


The Largest Principal Component Captures the GreatestVariation

Figure: Principal components of some input data pointsBruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 11 / 99

Multiple Factor Model

xi = ∑rm=1 hm fmi + ui = H ′ f i + ui

Interpretation in test score example:I There are more than one form of “ability”I i.e. literary and mathematicalI In labor economics, there has been hypothesized a distinction betweencognitive and non-cognitive ability which has been very useful inexplaining wage patterns (some jobs require one or the other, and someboth (e.g. surgeon)

Loadings normalized to be orthonormal, factors uncorrelated

Σx = H ′Σ f H + IKσ2u

The factor loadings hm are the eigenvectors of Σx associated with thelargest r eigenvaluesEstimation: H = eigenvectors of Σx associated with the largest reigenvalues


Identification of the number of factors

A diffi cult practical problem

Standard practice is to examine the eigenvalues of Σx and look forbreaks

The theory suggests there should be r “large” eigenvalues and theremainder “small”


Illustration

Data from Duflo, Dupas and Kremer (2011, AER)

Observations are testscores for first grade students in Kenya

The authors used the total testscore, but the data set includes scoreson subsections of the test

I Four literary and three mathematics

How many factors explain performance?


Eigenvalue Proportion1 4.02 0.572 1.04 0.153 0.57 0.084 0.52 0.085 0.37 0.056 0.29 0.047 0.19 0.03

First Factor Second Factorwords 0.41 −0.32sentences 0.32 −0.49letters 0.40 −0.13spelling 0.43 −0.28addition 0.38 0.41substraction 0.35 0.52multiplication 0.33 0.36


Implications

There appear to be two large eigenvalues relative to the others

The first eigenvector has similar loadings for all seven subjectsI A general abililty/achievement factor

The second factor loads negatively on all four literacy subjects, andpositively on all math subjects

I A “math vs literacy” factor!

Appears in first grade exams!


School Factors


Estimation of Factors

Factor ModelI xi = H ′ f i + uiI If you knew H an estimator of f i would be

fi = H ′xi = f i + vi

vi = H ′ui

I The error is mean zero so f i is unbiased for f i.I With estimated loadings fi = H

′xi

I As p, n→ ∞ f i →p= f i


Factor-Augmented Regression

yi = f ′iβ+ ei

xi = H ′ f i + ui

Estimated in multiple stepsI Estimate the loadings H from the covariance matrix of xiI Estimate the factors f iI Estimate the coeffi cient β by least squares of yi on the estimatedfactors

Generated RegressorsI Problem diminishes as n, p→ ∞


RIDGE REGRESSION


Linear Model

yi = x′iβ+ ei

Assume all variables demeaned

xi is p× 1

Classical OLS β = (X ′X)−1(X ′y)

I Unstable if X ′X is near singularI Unstable if X ′X is near collinearI These problems are not typical when p is small; but are typical when pis large and variables are correlated

I Infeasible if p > nF Many current applications (high dimension regression)


Wisconsin Regression


Ridge Regression

Classical solution to near singularity

βridge =(X ′X + λIp

)−1(X ′y)

λ is a tuning parameter

Classical motivationI Solves multicollinearityI Solves singularityI Stabilizes estimator

Modern motivationI Penalized least squaresI Regularized least squares


Ridge Regression


Penalized Least Squares

Penalized criteria:I S2(β) = (y− Xβ)′ (y− Xβ) + λβ′β

SSE plus an L2 penalty on the coeffi cient

βridge = argminβ

S2(β)

F.O.C.I 0 = −2X ′ (y− Xβ) + 2λβ

SolutionI βridge =

(X ′X + λIp

)−1(X ′y)

Thus the ridge regression estimator is a penalized LS estiamtor

If the OLS estimator is “large”, the penalty pushes it towards zero


Dual Problem

The dual of a penalized minimization problem is a Lagrangian problem

Consider the problemI min (y− Xβ)′ (y− Xβ)subject to β′β ≤ τ

This minimizes the SSE over an ellipse around zero with radius τ

Lagrangian L (β, λ) = (y− Xβ)′ (y− Xβ) + λ(

β′β− τ)

F.O.C. for β

I 0 = −2X ′ (y− Xβ) + 2λβI The same as the penalized least squares problemI They have the same solution

There is a mapping between λ and τ


Dual (Lagrangian) Constrained Minimization


Shrinkage Interpretation

Suppose X ′X = Ip

Then βridge = (1+ λ)−1 βols

Shrinks OLS towards zero

Similar to Stein estimator

βridge is biased for β but has lower variance than least squares

MSE is not easy to characterize


Selection of Ridge Parameter

MallowsI µridge = X

(X ′X + λIp

)−1 X ′y is linear in y

I Penalty is 2σ2 tr((

X ′X + λIp)−1 X ′X

)Cross-Validation

I ei(λ) = yi − x′i βridge,−1(λ)

I CV(λ) = ∑ni=t ei(λ)

2

Theory: Li (1986, Annals of Statistics)I Mallows/CV selection asymptotically equivalent to infeasible optimal λ


Interpretation via the SVD

The singular value decomposition of a matrix is X = UDV ′ where Uand V are orthonormal and D is diagonal with the singular values ofX on the diagonal

Apply to the regressor matrix X = UDV ′

X ′X = V DU ′UDV ′ = V DDDV ′ = V D2V ′

(X ′X)−1= V D−2V ′

X ′y = V DU ′y

βols = (X′X)−1 X ′y = V D−2V ′V DU ′y

= V D−2DU ′y = V D−1U ′y

µols = X βols = UDV ′V D−1U ′y = UU ′y = ∑pj=1 uju′jy

The least squares estimator can be seen as a projection on theorthonormal basis uj


Interpretation via the SVD

Let Λ = λIp

X ′X +Λ = V D2V ′ +Λ = V(

D2 +Λ)

V ′

(X ′X +Λ)−1= V (D+Λ)−1 V ′

βridge = (X′X +Λ)

−1 X ′y = V (D+Λ)−1 V ′V DU ′y

= V (D+Λ)−1DU ′y = V(D+Λ)−1D−1U ′y

µridge = X βridge = UDV ′V(D+Λ)−1DU ′y

= UD(D+Λ)−1DU ′y = ∑pj=1

d2j

d2j + λ

uju′jy

A shrunken projection

Each basis uj is shrunk according to d2j /(d2

j + λ)

I Smaller d2j implies greater shrinkage

I Small singular values of X receive greatest shrinkage


Ridge Coeffi cients varying with Lambda


Features of Ridge Estimator

Well defined even if p > nAll p coeffi cients are estimated with non-zero estimatesDoes not perform simultaneous selection

Shrinkage greatest on small singular values of regressors


Computation of Ridge Regression

R: package glmnet(x,y,alpha=0)I Selects λ by cross-validation

MATLAB: ridge(y,X,λ)


LASSO(Least Absolute Shrinkage and Selection Operator)


Lasso

Penalized criteria:I S1(β) = (y− Xβ)′ (y− Xβ) + λ ‖β‖1I ‖β‖1 = ∑

pj=1

∣∣∣βj

∣∣∣SSE plus an L1 penalty on the coeffi cient

βlasso = argminβ

S1(β)

The minimizer has no closed-form solution

F.O.C. for βj

I 0 = −2X ′j (y− Xβ) + λ sgn(

βj

)


Dual Problem

min (y− Xβ)′ (y− Xβ)subject to ‖β‖1 ≤ τ

This minimizes the SSE over a constraint set which looks like asquare on its edge (a cross-polytope)

LagrangianI L (β, λ) = (y− Xβ)′ (y− Xβ) + λ (‖β‖1 − τ)

F.O.C. for βj

I 0 = −2X ′j (y− Xβ) + λ sgn(

βj

)I The same as the penalized problemI The solution is identical


Lasso vs Ridge

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.Bruce Hansen (University of Wisconsin) Machine Learning July 22-26, 2019 41 / 99

Lasso solution tends to hit a corner

Since the constraint region has corners, the lasso tends to hit one

In contrast, the ridge minimizer tends to hit an interior point on thesmooth constraint region

The corners represent parameters where some coeffi cients equal zero

Hence the lasso solution sets some coeffi cients to zero

However, if the constraint is relaxed then Lasso=OLS

I If τ ≥∥∥∥βols

∥∥∥ then βlasso = βols


As Constraint Set Decreases, Lasso Estimates Shrink tozero

70 3. Linear Methods for Regression

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

20.

40.

6

Shrinkage Factor s

Coe

ffici

ents

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

FIGURE 3.10. Profiles of lasso coefficients, as the tuning parameter t is varied.Coefficients are plotted versus s = t/

∑p

1 |βj |. A vertical line is drawn at s = 0.36,the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lassoprofiles hit zero, while those for ridge do not. The profiles are piece-wise linear,and so are computed only at the points displayed; see Section 3.4.4 for details.


Effect of changing Lasso Parameter

For λ = 0, Lasso=OLSAs λ increases, the estimates shrink together

At some point, one estimate hits zero. It remains zero

As λ increases further, the estimates continue to shrink at a new rate

One by one, the estimates hit and stick at zero

When λ is suffi ciently large, all coeffi cients equal 0


Selection of Lasso Parameter

Most commonly by K-fold CV

Standard algorithms use CV by default or an option

Theoretical justifcation in developmentI Chernozhukov, Chetverikov, and Liao, working paper


Nesting Selection, Lasso and Ridge

βq = argminβ

(y− Xβ)′ (y− Xβ) + λ ‖β‖qq

‖β‖qq =

p

∑j=1

∣∣∣βj

∣∣∣q(q = 0) variable subset selection(q = 1) lasso(q = 2) ridge


region for ridge regression is the disk β21 + β2

2 ≤ t, while that for lasso isthe diamond |β1| + |β2| ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero. When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.We can generalize ridge regression and the lasso, and view them as Bayes

estimates. Consider the criterion

β = argminβ

{N∑

i=1

(yi − β0 −

p∑

j=1

xijβj)2

+ λ

p∑

j=1

|βj |q}

(3.53)

for q ≥ 0. The contours of constant value of∑

j |βj |q are shown in Fig-ure 3.12, for the case of two inputs.Thinking of |βj |q as the log-prior density for βj , these are also the equi-

contours of the prior distribution of the parameters. The value q = 0 corre-sponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior correspond-ing to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ) exp(−|β|/τ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.In this view, the lasso, ridge regression and best subset selection are

Bayes estimates with different priors. Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.

Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) for

q = 4 q = 2 q = 1 q = 0.5 q = 0.1

FIGURE 3.12. Contours of constant value of∑

j |βj |q for given values of q.


Elastic Net —Compromise between Lasso and Ridge

βnet = argminβ

(y− Xβ)′ (y− Xβ)

+ λ(

α ‖β‖22 + (1− α) ‖β‖1

)α = 0 is Lassoα = 1 is Ridge0 < α < 1 mixes Lasso and Ridge penalties


q = 1.2 α = 0.2

Lq Elastic Net

FIGURE 3.13. Contours of constant value of∑

j |βj |q for q = 1.2 (left plot),

and the elastic-net penalty∑

j(αβ2j +(1−α)|βj |) for α = 0.2 (right plot). Although

visually very similar, the elastic-net has sharp (non-differentiable) corners, whilethe q = 1.2 penalty does not.

setting coefficients exactly to zero. Partly for this reason as well as forcomputational tractability, Zou and Hastie (2005) introduced the elastic-net penalty

λ

p∑

j=1

(αβ2

j + (1− α)|βj |), (3.54)

a different compromise between ridge and lasso. Figure 3.13 compares theLq penalty with q = 1.2 and the elastic-net penalty with α = 0.2; it ishard to detect the difference by eye. The elastic-net selects variables likethe lasso, and shrinks together the coefficients of correlated predictors likeridge. It also has considerable computational advantages over the Lq penal-ties. We discuss the elastic-net further in Section 18.4.

3.4.4 Least Angle Regression

Least angle regression (LAR) is a relative newcomer (Efron et al., 2004),and can be viewed as a kind of “democratic” version of forward stepwiseregression (Section 3.3.2). As we will see, LAR is intimately connectedwith the lasso, and in fact provides an extremely efficient algorithm forcomputing the entire lasso path as in Figure 3.10.

Forward stepwise regression builds a model sequentially, adding one vari-able at a time. At each step, it identifies the best variable to include in theactive set, and then updates the least squares fit to include all the activevariables.

Least angle regression uses a similar strategy, but only enters “as much”of a predictor as it deserves. At the first step it identifies the variablemost correlated with the response. Rather than fit this variable completely,LAR moves the coefficient of this variable continuously toward its least-squares value (causing its correlation with the evolving residual to decreasein absolute value). As soon as another variable “catches up” in terms ofcorrelation with the residual, the process is paused. The second variablethen joins the active set, and their coefficients are moved together in a waythat keeps their correlations tied and decreasing. This process is continued


Minimum Distance Representation

When p < n

I (y− Xβ)′ (y− Xβ) = e′ e+(

βols − β)′

X ′X(

βols − β)

(algebraic trick)

Thus

I βlasso = argminβ

(βols − β

)′X ′X

(βols − β

)+ λ ‖β‖1

βlasso minimizes the weighted Euclidean distance to βols with penalty


Thresholding Representation

Suppose p < n and X ′X = Ip

Then βridge =1

1+ λβols evenly shrinks OLS towards zero

Selection using a critrical value of c (e.g. c = 1.96σ2)

I βtest,j = βols,j1(

β2ols,j ≥ c

)I This is called a “hard thresholding” rule


Thresholding Representation

Lasso criterion under X ′X = Ip

I ∑pj=1

((βols,j − βj

)2+ λ

∣∣∣βols,j − βj

∣∣∣)F.O.C. is

I −2(

βols,j − βj

)+ λ sgn

(βols,j − βj

)= 0

Solution

I βlasso,j =

βols,j − λ/2 βols,j > λ/2

0∣∣∣βols,j

∣∣∣ ≤ λ/2

βols,j + λ/2 βols,j ≤ −λ/2I This is called a “soft thresholding” rule


Selection, Ridge Regression and the Lasso


TABLE 3.4. Estimators of βj in the case of orthonormal columns of X.M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.


Scaling

The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tore-scaling the regressors

I The penalty λ ∑pj=1

∣∣∣βj

∣∣∣ is identical to each coeffi cientI If you rescale a regressor (e.g. change units of measurement) then thepenalty has a completely different meaning

I Hence the scale mattersI In practice, it is common to rescale the regressors so that all are meanzero and have the same variance

F Unless variables are already scaled similarly (e.g. interest rates)


Which Regressors

The Lasso criterion (y− Xβ)′ (y− Xβ) + λ ‖β‖1 is not invariant tolinear transformations of the regressors

Suppose you have X1 and X2

OLS on (X1, X2) is the same as OLS on (X1 − X2, X2)

Lasso on (X1, X2) is different than Lasso on (X1 − X2, X2)

OrthogonalityI Much theoretical insight arises from the case of orthogonal regressorsI It may therfore be useful to start with transformed regressors which arenear orthogonal

I e.g. differences between interest rates (spreads) rather than levels

Getting the right zerosI Many theoretical results concern sparsity (more on this later)I This occurs when the true regression has many 0 coeffi cientsI It is therefore useful to start with transformed regressors which aremost likely to have many zero coeffi cients


Grouped Lasso

We can penalize groups of coeffi cients so that they areincluded/excluded as a group

Grouped Lasso criterion

I(

y−∑L`=1 X`β`

)′ (y−∑L

`=1 X`β`)+ λ ∑L

`=1√

p` ‖β`‖2I p` = group size

I Note the penalty is ‖β`‖2 =(

∑j β2`j

)1/2


Statistical Properties

There are asymptotic results for the Lasso allowing for p >> nThe results rely on a sparsity assumption: The true regression has p0non-zero coeffi cients, where p0 is fixed

I This assumption can be relaxed in some respects, but some form ofsparsity lies at the core of current theory

Under regularity conditions, Lasso estimation identifies the truepredictors with high probability

I Consistent model selectionI Similar to BIC selection

The non-zero coeffi cients, however are not consistently estimated butare biased

Proposals to eliminate the biasI Least squares after Lasso selectionI SCAD (smoothly clipped absolute deviation)I Adaptive Lasso


Sparsity

Sparsity is all the fashion in this literature

It seems to be an assumption used to justify theory which can beproved, not an assumption based on reality

People talk about “imposing sparsity”as if the theorist can influencethe world

The world is the way it is.

What does sparsity mean in practical econometrics?I That a few coeffi cients are “big”, the remainder zero or smallI In a series regression, only a few coeffi cients are non-zero, theremainder zero or small

I This does not make much senseI More reasonable, all coeffi cients are non-zero, and all are small

This is a challenge for Lasso-type theory


Computation via LAR algorithm

Least Angle Regression (LAR)

A modification of forward stepwise regressionI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y

F Take residuals along the wayF Stop when some other x` and xk have the same correlation with theresidual

I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual

I Continue until all predictors are in model

This algorithm gives the Lasso expansion path

Used to produce an ordering yields Least Angle Regression Selection(LARS)

I Alterntiave to stepwise regression


Least Angle Regression Solution Path


0 5 10 15

0.0

0.1

0.2

0.3

0.4

v2 v6 v4 v5 v3 v1

L1 Arc Length

Absolute

Correlations

FIGURE 3.14. Progression of the absolute correlations during each step of theLAR procedure, using a simulated data set with six predictors. The labels at thetop of the plot indicate which variables enter the active set at each step. The steplength are measured in units of L1 arc length.

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Least Angle Regression

0 5 10 15

−1.

5−

1.0

−0.

50.

00.

5

Lasso

L1 Arc LengthL1 Arc Length

Coefficients

Coefficients

FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulateddata, as a function of the L1 arc length. The right panel shows the Lasso profile.They are identical until the dark-blue coefficient crosses zero at an arc length ofabout 18.


LAR modification to make equal to Lasso

Modified LAR algorithmI Start with all coeffi cients equal zeroI Find xj most correlated with yI Increase the coeffi cient βj in the direction of its correlation with y

F Take residuals along the wayF Stop when some other x` and xj have the same correlation with theresidual

F If a non-zero coeffi cient hits zero, drop from the active set of variablesand recompute the joint least squares direction.

I Increase (βj, β`) in their joint least squares direction, until some otherxm has as much correlation with the residual

I Continue until all predictors are in model


Comparison of performance of methods


0.0 0.2 0.4 0.6 0.8 1.0

0.55

0.60

0.65

Forward StepwiseLARLassoForward StagewiseIncremental Forward Stagewise

E||β

(k)−β||2

Fraction of L1 arc-length

FIGURE 3.16. Comparison of LAR and lasso with forward stepwise, forwardstagewise (FS) and incremental forward stagewise (FS0) regression. The setupis the same as in Figure 3.6, except N = 100 here rather than 300. Here theslower FS regression ultimately outperforms forward stepwise. LAR and lassoshow similar behavior to FS and FS0. Since the procedures take different numbersof steps (across simulation replicates and methods), we plot the MSE as a functionof the fraction of total L1 arc-length toward the least-squares fit.

adaptively fitted to the training data. This definition is motivated anddiscussed further in Sections 7.4–7.6.

Now for a linear regression with k fixed predictors, it is easy to showthat df(y) = k. Likewise for ridge regression, this definition leads to theclosed-form expression (3.50) on page 68: df(y) = tr(Sλ). In both thesecases, (3.60) is simple to evaluate because the fit y = Hλy is linear in y.If we think about definition (3.60) in the context of a best subset selectionof size k, it seems clear that df(y) will be larger than k, and this can beverified by estimating Cov(yi, yi)/σ

2 directly by simulation. However thereis no closed form method for estimating df(y) for best subset selection.

For LAR and lasso, something magical happens. These techniques areadaptive in a smoother way than best subset selection, and hence estimationof degrees of freedom is more tractable. Specifically it can be shown thatafter the kth step of the LAR procedure, the effective degrees of freedom ofthe fit vector is exactly k. Now for the lasso, the (modified) LAR procedure


Computation

Dual representation of Lasso and Elastic Net is a quadraticprogramming problem

Effi cient when we have a fixed λ

I Numerically fast

The LARS algorithm provides the entire path as a function of thetuning parameter

I Useful for cross-validation


Computation

R (recommended)I package glmnet

F cv.glmnet(x,y)F Selects λ by cross-validation

I For ridge or elastic net

F cv.glmnet(x,y,alpha=a)F Set a = 0 for ridge, a = 1 for Lasso, 0 < a < 1 for elastic net

I package lars

F lars(x,y,type="lasso")F lars(x,y,type="lar")

MATLABI lasso(X,y)

I lasso(X,y,’CV’,K)


Computation in R

library(glmnet)

mLasso <- cv.glmnet(X,y,family="gaussian",nfolds=200)

I beta <- coef(mLasso,mLasso$lambda.min)I Useful to specify the number of folds (default is 10)I More folds reduces instability, but takes longerI If you do not specify “lambda.min” the package will use “lambda.1se”which is different than the minimizer

mRidge <-cv.glmnet(X,y,alpha=0,family="gaussian",nfolds=200))

mElastic <-cv.glmnet(X,y,alpha=1/2,family="gaussian",nfolds=200))


Illustration

cps wage regression using subsample of Asian woment (n = 1149)Regressors:

I education (linear), and dummies for education equalling 12, 13, 14, 16,18, and 20

I experience in powers from 1 to 9I marriage dummies (6 of 7 categories), 3 region dummies, union dummy

Lasso, with λ selected by minimizing 200-fold CV

Selected regressors:I Education dummiesI Experience powers 1, 2, 3, 5, 6I All remaining dummies

Coeffi cient estimates: Most shrunk about 10% relative to leastsquares


REGRESSION TREES


Regression Tree

Partition regressor space into rectanglesI Split based if a regressor is below or exceeds a thresholdI Split againI Split again

Each split point is a nodeEach subset is a branchOn each branch, fit a simple model

I Often just the sample mean of yiI Or linear regression


Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2X1 ≤ t3

X2 ≤ t4

FIGURE 9.2. Partitions and CART. Top right panelshows a partition of a two-dimensional feature space byrecursive binary splitting, as used in CART, applied tosome fake data. Top left panel shows a general partitionthat cannot be obtained from recursive binary splitting.Bottom left panel shows the tree corresponding to thepartition in the top right panel, and a perspective plotof the prediction surface appears in the bottom rightpanel.


Elements of Statistical Learning (2nd Ed.) c©Hastie, Tibshirani & Friedman 2009 Chap 9

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125 edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125 edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

FIGURE 9.5. The pruned tree for the spam example.The split variables are shown in blue on the branches,and the classification is shown in every node.The num-bers under the terminal nodes indicate misclassificationrates on the test data.


Estimation of nodes

Regression —Minimizing SSE

This is the same as for threshold models in econometricsI Similar to structural change estimation

Potential split points equal n (at each observation point)Estimate up to n regressions, for each possible split pointFind split point which minimizes the SSE

This is the least squares estimator of the node


Tree Estimation

Given the nodes (the split points) you estimate the simple model(mean or linear regression) on each branch

The regression model is the tree structure plus the models for eachbranch

Prediction (estimation of the conditional mean at a point)I Given the regressor, which branch are we on?I Compute conditional mean for that branch

Take, for example, a wage regressionI splits might be based on sex, race, region, education levels, experiencelevels, etc

I Each split is binaryI A branch is a set of characteristics. The estimate (typically) is themean for this group


How Many Nodes?

First fit (grow) a large tree, based on a pre-specified maximumnumber of nodes

Then prune back by miniminizing a cost criterionI T = treeI |T| = number of terminal nodesI yi = fitted value (mean or regression fit within each branch)I ei = yi − yiI C = ∑

|T|t=1 e2

i + α |T|I Penalized sum of squared errors, AIC-like

Penalty term α selected by CV


Comments on Trees

Flexible non-parametric approach

Typically used for prediction

Can be useful for decisions

Consider doctor deciding on a treatment:I What is your gender?I Is your cholesterol above 200?I Is your blood pressure above 130?I Is your age above 60?I Given this information, we recommend you take the BLUE pill


BAGGING(BOOTSTRAP AGGREGATION)


Bagging

Bootstrap averaging for estimator m(x) of conditional mean m(x)I Example: Regression tree

Generate B random samples of size n by sampling with replacementfrom data

I On each bootstrap sample, fit estimator mb(x)

Average across the bootstrap samplesI mbag(x) = B−1 ∑B

b=1 mb(x)


Bagging

If m(x) is linear (with large B) mbag(x) ' m(x)If m(x) is nonlinear they are different

I Bagging reduces variance (but adds bias)I Averaging smooths the estimator, smoothing reduces variance

Use mbag(x) for prediction


Bagging and Bias

If m(x) is biased then bagging increases the biasSimple bootstrap estimator of bias is

I bias(m) = B−1 ∑Bb=1 mb(x)− m(x) = mbag(x)− m(x)

Thus a bias-corrected estimator of m(x) is

I mbc(x) = m(x)− bias(m) = m(x)−(

mbag(x)− m(x))=

2m(x)− mbag(x)I Not mbag(x)!

Bagging does not reduce bias, but accentuates bias

Bagging is best applied to low-bias estimators

Goal is to reduce variance


A little intuition

θ ∼ N(θ, Ip

)Thresholded estimator θ = θ1

(θ′θ > c

)Bootstrap

I θ∗ ∼ N

(θ, Ip

)I θ

∗= θ

∗1(

θ∗′

θ∗> c)

Bagging estimator

I θbag = B−1 ∑Bb=1 θ

∗ ' µ(θ) where µ(θ) = E(

θ)

θ is a non-smooth function of θ

µ(θ) is a smooth function of θ

θbag ' µ(θ) is a smooth function of θ


When to use Bagging

Bagging is ideal for methods such as regression trees

Deep regression trees have low bias but high variance

Regression trees use hard thresholding rules

Bagging smooths the hard thresholding into smooth thresholding

Bagging reduces variance

Bagging is not useful for high-bias, smooth, or low-varianceprocedures


RANDOM FORESTS


Random Forests


Random Forests

Similar to bagging, but with an adjustment to reduce variance

When you do bagging, you are averaging identically distributed butcorrelated bootstrapped trees

The correlation means that the averaging does not reduce thevariance as much as if the bootstrapped trees were uncorrelated

Random forests tries to de-correlate the bootstrapped trees


Random Forest Algorithm for Regression

For b = 1, ..., BI Draw a random sample of size n from the data setI Grow a random forest tree Tb to the bootstrapped data, by recursivelyrepeating the following steps for each terminal node of the tree untilthe minimum node size nmin (recommended nmin = 5)

F Select m variables at random from the p variables.Recommended m = p/3

F Pick the best variable/split point among the mF Split the node into two daughter nodes

m(x) = B−1 ∑Bb=1 Tb(x)


Why Random Forest?

The randomization over the variables means that the bootstrappedregression trees are less correlated than using standard bagging

Very popular

Numerical studies shows that it works well in many applications


Out-of-Bag Samples

Random forests have a evaluation device similar to cross-validation,called the out-of-bag (OOB) sample

Recall that a random forest predictor is calculated by averaging overB bootstrap samplesThe probability that a given observation i is in a given bootstrapsample is about 63%

For the OOS sampleI For each observation i

F Construct its random forest predictor by averaging only over the(approximately) 37% of bootstrap samples where observation i doesnot appear

F Compute the OOB prediction error

I Take sum of squared errors


ENSEMBLING


Ensembling Econometricians


Emsembling is Averaging

Suppose you have several estimatorsI AIC-selectedI JMA model averagedI Stein shrinkageI Principle components regressionI RidgeI LassoI Elastic NetI Regression TreeI Random Forest

What do you do?

Let’s look at our favorite models again


Model 1: Kendall Jenner


Model 2: Fabio


Model 3: Einstein


Emsembling is Averaging the Estimators

Weight Selection Methods:

Method 1: Elements of Statistical Learning recommends penalizedregression (Ridge or Lasso penalty)

I Regress yi on predictions from the models, with penaltyI Regularization (penalty) is essentialI Unclear how to select λ

Method 2: Select weights by cross-validation

Considerable evidence indicates that ensembling (averaging) is betterthan using just one method


What’s Next?Last time I taught here, my wife and I went to Gavdos

�� Gavdos


Gavdos


Assignment # 5

Use the cps dataset from before, but now use ALL observationsCreate a large set of regressors, that you believe are appropriate topotentially model wages

Estimate a regression for log(wage) using the following methodsI OLSI Ridge RegressionI LassoI Elastic Net with α = 1/2

Report your coeffi cient estimates in a table

Comment on your findings


That’s It!


Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie,...

Documents

Transcript of Lecture 5 MACHINE LEARNINGssc.wisc.edu/~bhansen/crete2019/BruceHansenLecture5.pdfReferences Hastie,...