Reading the Lasso 1996 paper by Robert Tibshirani

READING SEMINAR ON CLASSICS

Regression Shrinkage and Selection via the LASSOBy Robert Tibshirani

Presented by Ulcinaite Agne

November 4, 2012

Presented by Ulcinaite Agne LASSO November 4, 2012 1 / 41

Outline

1 Introduction

2 OLS estimatesOLS criticsStandard improving techniques

3 LASSODefinitionMotivation for LASSOOrthonormal design caseFunction formsExample of prostate cancerPrediction error and estimation of t

4 Algorithm for finding LASSO solutions

5 Simulation

6 Conclusions

Outline

1 Introduction

5 Simulation

6 Conclusions

Outline

1 Introduction

5 Simulation

6 Conclusions

Outline

1 Introduction

5 Simulation

6 Conclusions

Outline

1 Introduction

5 Simulation

6 Conclusions

Outline

1 Introduction

5 Simulation

6 Conclusions

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Introduction

The Article

Regression Shrinkage and Selection via the LASSO by RobertTibshirani

Published in 1996 for the Royal Statistical Society. Series B(Methodological), vol. 58, No.1

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

OLS estimates

We consider the usual regression situation.The data: ( xi , y i ), i = 1, . . . ,N, where xi = (xi1, . . . , xip)T and yiare the regressors and the response for the ith observation.

The ordinary least square (OLS) estimates minimize the residual sum ofsquares (RSS):

RSS =N∑i=1

(yi − βo −p∑

xijβj)2

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

OLS critics

The two reasons why data analysts are often not satisfied with OLSestimates:

Prediction accuracy: OLS estimates having low bias but large variance

Iterpretation: when having too much predictors, it would be better tohave smaller subset exhibiting stronger effects

OLS critics

The two reasons why data analysts are often not satisfied with OLSestimates:

Prediction accuracy: OLS estimates having low bias but large variance

Iterpretation: when having too much predictors, it would be better tohave smaller subset exhibiting stronger effects

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Standard improving techniques

Subset selection: small changes in data can result in very differentmodels

Ridge regression:

βridge = argmin

N∑i=1

(yi − β0 −∑j

βjxij)2

subject to ∑

β2j ≤ t

Does not set any of the coefficients to 0 and hence does not give aneasily interpretable model

Standard improving techniques

Subset selection: small changes in data can result in very differentmodels

Ridge regression:

βridge = argmin

N∑i=1

(yi − β0 −∑j

βjxij)2

subject to ∑

β2j ≤ t

Does not set any of the coefficients to 0 and hence does not give aneasily interpretable model

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Definition

We are considering the same data as in OLS estimation case:( xi , y i ), i = 1, . . . ,N, where xi = (xi1, . . . , xip)T

The LASSO (Least Absolute Shrinkage and Selection Operator) estimate(α, β) is defined by

(α, β) = argmin

N∑i=1

(yi − α−∑j

βjxij)2

subject to ∑

|βj | ≤ t

Definition

(α, β) = argmin

N∑i=1

(yi − α−∑j

βjxij)2

subject to ∑j

|βj | ≤ t

Definition

(α, β) = argmin

N∑i=1

(yi − α−∑j

βjxij)2

subject to ∑

|βj | ≤ t

Definition

The amount of shrinkage is controlled by parameter t ≥ 0 which is appliedto the estimates.

Let βoj be the full least square estimates and let t0 =∑|βoj |.

Values t < t0 will shrink the solutions towards 0, some coefficients makingequal to 0.

For example, taking t = t0/2, we will have the effect roughly similar tofinding the best subset of size p/2.

Definition

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Motivation for LASSO

LASSO came from the proposal of Breiman (1993).Breiman’s non-negative garotte minimizes

N∑i=1

(yi − α−∑j

cjβoj xij)

subject to

cj ≥ 0,∑

cj ≤ t.

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Orthonormal design case

Let X the n × p design matrix with ijth entry xij and XTX = I.The solution of previous minimization problem is

βj = sign(βoj )(|βoj | − γ)+

Best subset selection (of size k)

Ridge regression solutions: 11+γ β

Garotte estimates: (1− γ/βo2j )+βoj

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Function forms

(a) Subset regression, (b) ridge regression, (c) the LASSO, (d) the garrotte

Estimation picture for (a) the LASSO and (b) ridge regression

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Example of prostate cancer

Data examined: from a study byStamey(1989)The factors:

log(cancer volume) lcavol

log(prostate weigth) lweigth

log(benign prostatic hyperplasiaamount) lbph

seminal vesicle invasion svi

log(capsular penetration) lcp

Gleason score gleason

percentage Gleason scores pgg45

Linear model to log(prostate specificantigen) lpsa

Example of prostate cancer

Data examined: from a study byStamey(1989)The factors:

log(cancer volume) lcavol

log(prostate weigth) lweigth

log(benign prostatic hyperplasiaamount) lbph

seminal vesicle invasion svi

log(capsular penetration) lcp

Gleason score gleason

percentage Gleason scores pgg45

Linear model to log(prostate specificantigen) lpsa

Statistics of the example

Estimated coefficients and test error results, for different subset andshrinkage methods applied to the prostate data. The blank entriescorrespond to variables omitted.

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Prediction error and estimation of t

Methods for the estimation of the LASSO parameter t:

Cross-validation

Generalized cross-validation

Analytical unbiased estimate of risk

Strictly speaking the first two methods are applicable in the ’X-random’case, and the third method applies to the X-fixed case.

Cross-validation

Suppose thatY = η(X) + ε

where E (ε) = 0 and var(ε) = σ2

ME = E{η(X)− η(X)}2

PE = E{Y − η(X)}2 = ME + σ2

ME = E{η(X)− η(X)}2

PE = E{Y − η(X)}2 = ME + σ2

ME = E{η(X)− η(X)}2

PE = E{Y − η(X)}2 = ME + σ2

Cross-validation

The Prediction Error (PE) is estimated by fivefold cross-validation. TheLASSO is indexed in terms of the normalised parameter s = t/

∑βoj , PE

is estimated over a grid of values of s from 0 to 1 inclusive.

Create a 5-fold partition of the dataset

For each fold, all-but-one of the chunks are used for training and theremaining chunk - for testing.

Repeat 5 times so that each chunk is used once for testing.

Value s yielding the lowest estimated PE is selected.

Cross-validation

∑βoj , PE

Cross-validation

∑βoj , PE

Cross-validation

∑βoj , PE

Cross-validation

∑βoj , PE

Generalized Cross-validation

The constrained is re-written as∑β2j /|βj | ≤ t. So the constrained

solution β can be expressed as the ridge regression estimator

β = (XTX + λW−)−1XT y

where W = diag(|βj |) and W− denotes a generalized inverse. The numberof effective parameters in the constrained fit β may be approximated by

p(t) = tr{

X(XTX + λW−)−1XT )}

The generalised cross-validation style statistic

GCV (t) =1

RSS(t)

{1− p(t)/N}2

Unbiased estimate of risk

This method is based on Stein’s (1981) unbiased estimate of risk.Denote the estimated standard error of βoj by τ = σ/

√N, where

σ2 =∑

(yi − yi )2/(N − p). Then the formula is derived

R{β(γ)

}≈ τ2

p − 2#(j ; |βoj /τ | < γ) +

p∑j=1

max(|βoj /τ |, γ)2

as an approximately unbiased estimate of the risk . Hence an estimate of

γ can be obtained as the minimizer of R{β(γ)

γ = argminγ≥0[R{β(γ)

From this we obtain an estimate of the LASSO parameter t:

t =∑

(|βoj | − γ)+.

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Algorithm for finding LASSO solutions

We fix t ≥ 0. The minimization problem of

N∑i=1

(yi −∑j

βjxij)2

subject to∑

j |βj | ≤ t can be seen as a least squares problem with 2p

inequality constraints.

Denote G an m × p matrix, corresponding to m linear inequalityconstraints of the p-vector β. For our problem, m = 2p.Denote g(β) =

∑Ni=1(yi −

∑j βjxij)

2.Set E is the equality set corresponding to those constraints which areexactly met.

N∑i=1

(yi −∑j

βjxij)2

subject to∑

inequality constraints.Denote G an m × p matrix, corresponding to m linear inequalityconstraints of the p-vector β. For our problem, m = 2p.

Denote g(β) =∑N

i=1(yi −∑

j βjxij)2.

Set E is the equality set corresponding to those constraints which areexactly met.

N∑i=1

(yi −∑j

βjxij)2

subject to∑

inequality constraints.Denote G an m × p matrix, corresponding to m linear inequalityconstraints of the p-vector β. For our problem, m = 2p.Denote g(β) =

∑Ni=1(yi −

∑j βjxij)

2.Set E is the equality set corresponding to those constraints which areexactly met.

Outline of the algorithm

1 Start with E = {i0} where δi0 = sign(βo)

2 Find β to minimize g(β) subject to GEβ ≤ t1

3 While{∑|βj | > t

4 add i to the set E where δi = sign(β). Find β to minimize

g(β) =N∑i=1

(yi −∑j

βjxij)2

subject to GEβ ≤ t1.

This procedure must always converge to in a finite number of steps sinceone element is added to the set E at each step, and there is a total of 2p

elements.

g(β) =N∑i=1

(yi −∑j

βjxij)2

elements.

g(β) =N∑i=1

(yi −∑j

βjxij)2

elements.

g(β) =N∑i=1

(yi −∑j

βjxij)2

elements.

g(β) =N∑i=1

(yi −∑j

βjxij)2

elements.

g(β) =N∑i=1

(yi −∑j

βjxij)2

elements.

Least angle regression algorithm (Efron 2004)

Least Angle Regression Algorithm

1 Standardize the predictors to have mean zero and unit norm. Startwith the residual r = y − y , β1, . . . , βp = 0.

2 Find the predictor xj most correlated with r.

3 Move βj from 0 towards its least-squares coefficient (xj , r), until someother competitor xk has as much correlation with the current residualas does xj .

4 Move βj and βk in the direction defined by their joint least squarescoefficient of the current residual on (xj , xk), until some othercompetitor xl has as much correlation with the current residual.

5 If a non-zero coefficient hits zero, drop its variable from the active setof variables and recompute the current joint least squares direction.

6 Continue in this way until all p predictors have been entered. Aftermin(N-1, p) steps, we arrive at the full least-squares solution.

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Simulation

In the example, 50 data sets consisting of 20 observations from the model

y = βT + σε

were simulated, where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and ε is standardnormal.

Mean-squared errors over 200 simulations from the model

Simulation

In the example, 50 data sets consisting of 20 observations from the model

y = βT + σε

were simulated, where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and ε is standardnormal.

Mean-squared errors over 200 simulations from the model

Simulation

Most frequent models selected byLASSO

Most frequent models selected bysubset regression

Table of Contents

1 Introduction

5 Simulation

6 Conclusions

Conclusions

LASSO - a worthy competitor to subset selection and ridge regression.

Performance in different scenarios:

Small number of large effects - Subset selection does best, LASSO- not quite as well, ridge regression - quite poorly.

Small to moderate number of moderate-size effects - LASSOdoes best, followed by ridge regression and then subset selection.

Large number of small effects - Ridge regression does best,followed by LASSO and then subset selection.

References

Robert Tibshirani (1996)

Regression Shrinkage and Selection via the LASSO

Journal of the Royal Statistical Society 58(1), 267–288.

Travor Hastie, Robert Tibshirani, Jerome Friedman (2008)

The Elements of Statistical Learning

Springer-Verlag, 57–73.

Abhimanyu Das, David Kempe

Algorithms for Subset Selection in Linear Regression

Yizao Wang (2007)

A Note on the LASSO in Model Selection

The End

Reading the Lasso 1996 paper by Robert Tibshirani

Education

Transcript of Reading the Lasso 1996 paper by Robert Tibshirani