Abstract - University of Minnesota Duluth Reports/Technical Reports 2007-/TR... · Abstract Key...

44
Abstract Key words: 1 l -NORM CONSTRAINT; LASSO; VARIABLE SELECTION; SUBSET SELECTION; BAYSIAN LOGISTIC REGRESSION LASSO is an innovative variable selection method for regression. Variable selection in regression is extremely important when we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. LASSO not only helps to improve the prediction accuracy when dealing with multicolinearity data, but also carries several nice properties such as interpretability and numerical stability. This project also includes a couple of numerical approaches for solving LASSO. Case studies and simulations are studied in both the linear and logistic LASSO models, and are implemented in R; the source code is available in the appendix.

Transcript of Abstract - University of Minnesota Duluth Reports/Technical Reports 2007-/TR... · Abstract Key...

Abstract

Key words: 1l -NORM CONSTRAINT; LASSO; VARIABLE SELECTION; SUBSET SELECTION;

BAYSIAN LOGISTIC REGRESSION

LASSO is an innovative variable selection method for regression. Variable selection in regression is extremely important when we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. LASSO not only helps to improve the prediction accuracy when dealing with multicolinearity data, but also carries several nice properties such as interpretability and numerical stability. This project also includes a couple of numerical approaches for solving LASSO. Case studies and simulations are studied in both the linear and logistic LASSO models, and are implemented in R; the source code is available in the appendix.

University of Minnesota Duluth

II

Acknowledgements

I would like to express my special thanks to my advisor, Dr. Kang James. She found this interesting topic for me and helped me to overcome the difficulties throughout this research. She taught me how to do research and what attitude and characteristics will lead me to success and happiness. She also helped me develop my presentation skills and was very encouraging. Thank Drs. Barry James and Yongcheng Qi for being my committee members and helping me with my thesis defense. I would like to thank my friend Lindsey Dietz and my boy friend Brad Jannsen for their help in proof reading my thesis. I would also like to acknowledge the support from my parents in China who always motivated me to work hard.

University of Minnesota Duluth

1

Table of Contents Acknowledgements ....................................................................................................... I Abstract......................................................................................................................... II Chapter 1 Introduction...................................................................................................2 Chapter 2 LASSO application in the Linear Regression Model ...................................4

2.1 The principle of the generalized linear LASSO model ....................................4 2.2 Geometrical Interpretation of LASSO..............................................................7 2.3 Prediction Error and Mean Square Error..........................................................8 2.4 CV/GCV methods and Estimate of LASSO parameter....................................9 2.5 Standard Error of LASSO Estimators.............................................................12 2.6 Case Study: Diabetes data ..............................................................................13 2.7 Simulation.......................................................................................................17

Chapter 3 LASSO application in the Logistic Regression Model............................. 21 3.1 Logistic LASSO model and Maximum likelihood estimate ..........................21 3.2 Maximum likelihood estimate and Maximum a posteriori estimate ..............24 3.3 Case study: Kyphosis data..............................................................................27 3.4 Simulation.......................................................................................................29

Chapter 4 Computation of LASSO..............................................................................31 4.1Osborne’s dual algorithm.................................................................................31 4.2 Least Angle Regression algorithm..................................................................32

References ...................................................................................................................34 Appendix .....................................................................................................................35

University of Minnesota Duluth

2

Chapter 1 Introduction A “lasso” is usually recognized as a loop of rope that is designed to be thrown around a target

and tighten when pulled. It is a well-known tool of the American cowboy. In this context, it is

fittingly being used as a metaphor of 1l constraint applied to linear model. Coincidently,

LASSO is also the initials for Least Absolute Shrinkage and Selection Operator.

Consider the usual linear regression model with data 1 2( , , , , ), 1, ,i i ip ix x x y i n= , where

the ijx ’s are the regressors and iy is the response variable of the i th observation. The

ordinary Least Squares (OLS) regression method finds the unbiased linear combination of

the ijx ’s that minimizes the residual sum of squares. However, if p is large or the regression

coefficients are highly correlated (multicolinearity), the OLS may yield estimates with large

variance which reduces the accuracy of the prediction. A widely-known method to solve this

problem is Ridge Regression and subset selection. As an alternative to these techniques,

Robert Tibshirani (1996) presented “LASSO” which minimized the residual sum of squares

subject to the sum of absolute values of the coefficient being less than a constant.

2

1

ˆ arg min{ ( ) }N

Li j ij

i jy xβ α β

=

= − −∑ ∑ (1.1)

subject to

1

ˆp

Lj

jtβ

=

≤∑ (Constant) . (1.2)

If 1

ˆp

oj

jt β

=

>∑ , then the LASSO algorithm will yield the same estimate as OLS estimate.

However, if 1

ˆ0p

oj

jt β

=

< <∑ , then the problem is equivalent to

2

1

ˆ arg min ( )N

Li j ij j

i j jy xβ α β λ β

=

⎛ ⎞= − − +⎜ ⎟

⎝ ⎠∑ ∑ ∑ , (1.3)

0λ > . It will be shown later that the relation between λ and LASSO parameter t is

one-to-one. Due to the nature of the constraint, LASSO tends to produce some coefficients to

University of Minnesota Duluth

3

be exactly zero. Compared to the OLS, whose predicted coefficient ˆ oβ is an unbiased

estimator ofβ , both ridge regression and LASSO sacrifice a little bias to reduce the variance

of the predicted values and improve the overall prediction accuracy.

In my project, I focus on two main aspects of this topic. In chapter 2, definitions and

principles of LASSO in both the generalized linear case and orthogonal case are discussed. In

Chapter 3, I illustrate the principles of the LASSO logistic model and give a couple of

examples. Chapter 4 introduces the existing algorithm for solving LASSO estimates. Finally,

I study two main numerical algorithms.

University of Minnesota Duluth

4

Chapter 2 Linear LASSO Model

2.1 The principle of the generalized linear LASSO model

Let2

1( , , , )

p

i j ij jj j

G X Y y xβ λ β λ β=

⎛ ⎞= − +⎜ ⎟

⎝ ⎠∑ ∑ ∑ ; G can also be written in matrix form

( , , , ) ( ) ( )TpG X Y Y X Y X Iβ λ β β λ β= − − + , (2.1.1)

Minimizing G, we can get best estimate ofβ which can be notated as β .

Let pI represent the p p× identity matrix, β be the diagonal matrix with j th diagonal

element jβ , X be the design matrix.

( )1

2

0 00

0

0 0

j

p

diag

ββ

β β

β

⎡ ⎤⎢ ⎥⎢ ⎥= =⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

, ( )

11

11 12

1

0 0

00

0 0

j

p

diag

β

ββ β

β

−− −

⎡ ⎤⎢ ⎥⎢ ⎥

= =⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

.

We will use these notations through out the rest of this paper.

Firstly, we note that (2.1.1) can be written as

( , , , ) T T T T T TG X Y Y Y X Y Y X X Xβ λ β β β β λ β= − − + + . (2.1.2)

Take partial derivative of (2.1.2) with respect toβ :

( , , , )2 2( ) ( )T TG X YY X X X sign

β λβ λ β

β∂

= − + +∂

, (2.1.3)

where1( )

( )( )p

signsign

sign

ββ

β

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

.

Set ( , , , )

0G X Yβ λ

β∂

=∂

(2.1.4), and solve for β . For simplicity, we assume X is

orthonormal, TX X I= .

ˆ ojβ denotes the OLS estimate. For the multivariate case, 1ˆ ( )o T T TX X X Y X Yβ −= = .

University of Minnesota Duluth

5

By solving (2.1.4), we obtain the following result:

( )ˆ ˆ ˆ2

L o Lj j jsignλβ β β= − . (2.1.5)

Lemma2.1 If ( )2

y x sign yλ= − , then x and y share the same sign.

( ) ( )2 2

y x sign y x y sign yλ λ= − ⇔ = +

If y is positive, ( )2

y sign yλ+ must be positive, then x is positive. On the contrary, if y is

negative, then x is negative. Thus, x and y share the same sign. ■

According to Lemma 2.1 and (2.1.5), ( )ˆ Ljsign β = ( )ˆ o

jsign β . Then (2.1.5) becomes

( )ˆ ˆ ˆ2

L o oj j jsignλβ β β= −

ˆ ˆ 0

2ˆ ˆ 0

2

o oj j

o oj j

if

if

λβ β

λβ β

⎧ − ≥⎪⎪= ⎨⎪ + <⎪⎩

ˆ ˆ[ 0] [ 0]ˆ ˆ

2 2o oj j

o oj jI I

β β

λ λβ β≥ <

⎛ ⎞ ⎛ ⎞= − + −⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

. (2.1.6)

It follows that

( )ˆ ˆ ˆ 1, ,2

L o oj j jsign j pλβ β β

+⎛ ⎞= − =⎜ ⎟⎝ ⎠

. (2.1.7)

+ denotes the positive part of expression inside the parenthesis.

University of Minnesota Duluth

6

Fig 2.1 Form of coefficient shrinkage of LASSO in the orthonormal case. LASSO. LASSO tends to

produce zero coefficients.

Intuitively, 2λ is the threshold set for nonzero estimates; in another words, if the

corresponding ˆ ojβ is less than the threshold, we will set ˆ L

jβ equal to zero.

The parameter λ can be computed by solving the equations

( )ˆ ˆ ˆ , (2.1.8)2

ˆ . (2.1.9)

L o oj j j

Lj

j

sign

t

λβ β β

β

+⎧ ⎛ ⎞= −⎪ ⎜ ⎟⎪ ⎝ ⎠⎨⎪ =⎪⎩∑

whereλ is chosen so that jj

tβ =∑ . Therefore, each λ value corresponds to a unique t

value. This is illustrated by solving the simplest case of p=2. Without loss of generality, we

assume that the least squares estimates ˆ ojβ are both positive.

1 1

2 2

ˆ ˆ2

ˆ ˆ2

L o

L o

λβ β

λβ β

+

+

⎧ ⎛ ⎞= −⎪ ⎜ ⎟⎪ ⎝ ⎠⎨

⎛ ⎞⎪ = −⎜ ⎟⎪ ⎝ ⎠⎩

(2.1.10) and 1 2ˆ ˆL L tβ β+ = . (2.1.11)

First solving for λ , I get

1 2ˆ ˆo o tλ β β= + − . (2.1.12)

University of Minnesota Duluth

7

Substituting (2.1.12) into (2.1.10), I obtain the solution for LASSO estimates

1 2 1 21 1

1 2 1 22 2

ˆ ˆ ˆ ˆˆ ˆ ,2 2 2

ˆ ˆ ˆ ˆˆ ˆ .2 2 2

o o o oL o

o o o oL o

t t

t t

β β β ββ β

β β β ββ β

+ +

+ +

⎧ ⎛ ⎞ ⎛ ⎞+ − −⎪ = − = +⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎪⎪ ⎝ ⎠ ⎝ ⎠⎨

⎛ ⎞ ⎛ ⎞⎪ + − −= − = −⎜ ⎟ ⎜ ⎟⎪ ⎜ ⎟ ⎜ ⎟⎪ ⎝ ⎠ ⎝ ⎠⎩

(2.1.13)

When extended to the general case,

1

ˆ ˆ( )2

po oj j

i

t sign p λβ β=

= −∑ . (2.1.14)

Now it can be seen that the relationship between λ and the LASSO parameter t is

one-to-one. Unfortunately, LASSO estimate doesn’t usually have a closed-form solution.

However, statisticians have already developed several numerical approximation algorithms,

which will be discussed in Chapter 4.

2.2 Geometrical Interpretation of LASSO

Fig 2.2.1 A geometrical interpretation of LASSO in 2-dimension and 3-dimension. The left panel is from

Tibshirani (1996). The right panel is from Meinshausen (2008).

Fig 2.2.1 illustrates the geometric image of LASSO regression for the two- and three-

dimensional cases. Looking at the left panel, the center point of the ellipse is ˆ oβ (OLS

estimates). The ellipse contour corresponds to some specific residual sum of square values.

University of Minnesota Duluth

8

The area inside the square around the origin satisfies the LASSO restriction. It means that

1( ,..., )Tpβ β β= inside the black square satisfies the constraint j

j

tβ ≤∑ . This implies that

minimizing the residual sum of squares according to the constraint corresponds to the contour

tangent to the pyramid. The LASSO solution is the first place that the contours touch the

square; this will sometimes occur at a corner (due to the nature of the pyramid),

corresponding to a zero coefficient. It is the same in three dimensions. From the right panel

of Fig 2.2.1, one can see the contour touching one margin of the pyramid on the X-Y plane

which corresponds to 1 2,β β , so it assigns the value zero to the variable 3β .

2.3 Prediction Error and Mean Square Error

Suppose that ˆˆ( ) , ( )Y x x Xη ε η β= + = is an estimate of ( )x Xη β= , ε is random error with

normal distribution, and ( ) 20, ( )E Varε ε σ= = .

The mean-squared error of an estimate ˆ( )xη is defined by

2ˆ[ ( ) ( )]MSE E x xη η= − . (2.3.1)

It is hard to estimate MSE because ( )x Xη β= is unknown. However, predicted error is

easier to calculate and is closely related to MSE.

2ˆ[ ( )]PSE E Y xη= − 2MSE σ= + . (2.3.2)

Lemma 2.3 2PSE MSE σ= +

2 2ˆ ˆ[ ( )] [ ( ) ( ) ( )]PSE E Y x E Y x x xη η η η= − = − + −

2 2ˆ ˆ[ ( ) ( )] 2 [( ( ) ( ))( ( ))] [ ( )]E x x E x x Y x E Y xη η η η η η= − + − − + − .

The first and the third term 2 2[ ( )]E Y xη σ− = , 2ˆ[ ( ) ( )]E x x MSEη η− = , the middle term

University of Minnesota Duluth

9

[ ] ( )ˆ ˆ( ( ) ( ))( ( )) ( ) ( ) 0E x x Y x E x xη η η η η ε− − = − =⎡ ⎤⎣ ⎦ .

Thus, 2PSE MSE σ= + . Minimizing PSE is equivalent to minimizing MSE. In the next

section, I will introduce how to get the optimal LASSO parameter by minimizing the effect of

predicted error.

2.4 CV/GCV methods and Estimate of the LASSO parameter

2.4.1 Relative Bound and Absolute Bound

The tuning parameter t is called LASSO parameter, which is also recognized as the

absolute bound. 1

ˆn

Lj

j

tβ=

=∑ . Here I define another parameter, s , as the relative bound.

1

1

ˆ

. [0,1]ˆ

pL

jjp

oj

j

s sβ

β

=

=

= ∈∑

∑ (2.4.1)

The relative bound can be seen as a normalized version of LASSO parameter. There are two

algorithms mentioned in Tibshirani (1996) to compute the best s : N-fold Cross-validation

and Generalized Cross-validation(GCV).

2.4.2 N-fold Cross-validation

Cross-validation is a general procedure that can be applied to estimate tuning parameters in

a wide variety of problems. The bias in RSS is a result of using the same data for model

fitting and model evaluation. CV can reduce the bias of RSS by splitting the whole data into

two subsamples: a training (calibration) sample for model fitting and a test (validation)

sample for model evaluation. The idea behind the cross-validation is to recycle data by

switching the roles of training and test samples.

University of Minnesota Duluth

10

The optimal s can be denoted by s . Prediction error can be estimated for the LASSO

procedure by ten-fold cross-validation. The LASSO is indexed in terms of s , and the

prediction error is estimated over a grid of values of s from 0 to 1 inclusive. We wish to

predict with small variance, thus we wish to choose the constraint s as small as we can. The

value s which achieves the minimum predicted error of ˆ( )xη is selected (Tibshirani

1996).

For example, we have Acetylene data, which is from Marquardt, et al. 1975. In this data, the

corresponding variable y is percentage of conversion of n-Heptane to acetylene. The three

explanatory variables are:

T=Reactor Temperature (°C);

H=Ratio of H2 to n-Heptane (mole ratio);

C=Contact time (sec).

T and C are highly correlated variables the covariance matrix is

Accordingly, we will get very large variance for each estimates. Instead, if we use LASSO

method, we will get the following GCV score plot, we choose ˆ 0.25s = .

Fig 2.4.1 Acetylene data GCV versus s plot. (We will illustrate GCV in the next section)

T H CT 1 0.223628 -0.9582H 0.223628 1 -0.24023C -0.9582 -0.24023 1

University of Minnesota Duluth

11

However, sometimes the data we use may yield minimum predicted error when 1s =

which means OLS estimates is LASSO estimates. In this case, we will choose the elbow

position of PE as the corresponding s .

Fig 2.4.2 Predicted Error as a function of relative bound from Diabetes example

An N-fold cross-validation selects a model as follows.

1. Split the whole data into N disjoint subsamples 1, , NS S… .

2. For 1, ,i N= , fit model to the training sample ii v

S≠∪ , and compute discrepancy, ( )vd s

using the test sample vS . 2

( )v i

i vv S Sd s PE PE

⎡ ⎤= −⎢ ⎥⎣ ⎦∪ .

3. Find the optimal s as the minimizer of the overall discrepancy1

( ) ( )N

vv

d s d s=

= ∑ .

The drawback of N-fold cross-validation is its lack of efficiency. Suppose I try to optimize

s over a grid of 40 values ranging from 0 to 1. Using five-fold cross-validation, I need to run

the LASSO computation procedure 200 times, which will take a long time. The inefficiency

becomes especially significant when the sample size is big and the number of variables is

University of Minnesota Duluth

12

large. Due to this feature, Tibshirani (1996) introduced another algorithm: Generalized

Cross-validation(GCV) which has a significant computational advantage over N-fold

cross-validation.

2.4.3 Generalized Cross-validation

The generalized cross-validation (GCV) criterion was firstly proposed by Craven, P. and Wahba, G. (1979).

We have already proved in the previous section that

( )ˆ ˆ ˆ 1, ,2

L o oj j jsign j pλβ β β

+⎛ ⎞= − =⎜ ⎟⎝ ⎠

. (2.1.7)

If we rewrite it in matrix form,

11ˆ

2L T TX X X Yλβ β

−−⎛ ⎞= −⎜ ⎟

⎝ ⎠i . (2.4.2)

Therefore the effective number of parameters (Appendix A) in the model is

11

( )2

T Td Tr X X X Xλλ β−

−⎡ ⎤⎛ ⎞= −⎢ ⎥⎜ ⎟⎝ ⎠⎢ ⎥⎣ ⎦

i . (2.4.3)

We construct the generalized cross-validation statistic for LASSO to be

( )( )

2

12

ˆ1( )

1 /

nL

i ii

y xGCV

n d n

βλ

λ=

−=

−⎡ ⎤⎣ ⎦

∑. (2.4.4)

Compared to N-fold Cross-validation, GCV is more efficient. Suppose I still try to evaluate

s over a grid of 40 values ranging from 0 to 1. Using GCV instead, the LASSO computation

procedure only needs to run 40 times. However, there is also a drawback of GCV. A major

difficulty lies in the evaluation of the cross-validation function, which requires the calculation

of the trace of an inverse matrix. The problem becomes especially considerable when dealing

University of Minnesota Duluth

13

with large scale data. However, this is another area of study which will not be discussed here.

2.5 Standard Error of LASSO Estimators

LASSO estimate (2.1.5) can also be written as ( )ˆ ˆ ˆ2

L ojsignλβ β β= − ,

11ˆ

ˆ ˆˆ2 2

oL o T T

oX X X Y

βλ λβ β ββ

−−⎛ ⎞= − = −⎜ ⎟

⎝ ⎠i i . (2.5.1)

( )1

1ˆ2

L T TVar Var X X X Yλβ β−

−⎡ ⎤⎛ ⎞= −⎢ ⎥⎜ ⎟⎝ ⎠⎢ ⎥⎣ ⎦

i

1 1

1 1( )

2 2

T

T T T TX X X Var Y X X Xλ λβ β− −

− −⎛ ⎞⎛ ⎞ ⎛ ⎞= − −⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠ ⎝ ⎠⎝ ⎠i i

1 1

1 12

2 2T T TX X X X X Xλ λσ β β

− −− −⎛ ⎞ ⎛ ⎞= − −⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠i i . (2.5.2)

2( )Var Y σ= is an estimate of the error variance. In my research, I usually use the bootstrap

(keep resampling from the original data set with replacement) method to estimate the

standard error of LASSO estimates.

2.6 Case Study: Diabetes data This data was originally used by Efron, Hastie, Johnstone and Tibshirani (2003). Table 2.6.1

shows a small portion of the data. There are ten baseline variables, age, sex, body mass index

(bmi), average blood pressure (BP) and six blood serum measurements. N=422 diabetes

patients were measured. The response variable y is a quantitative measure of disease

progression one year after baseline.

Obs AGE SEX BMI BP S1 S2 S3 S4 S5 S6 Y 1 59 2 32.1 101 157 93.2 38 4 4.8598 87 151

University of Minnesota Duluth

14

2 48 1 21.6 87 183 103.2 70 3 3.8918 69 75 3 72 2 30.5 93 156 93.6 41 4 4.6728 85 141 4 24 1 25.3 84 198 131.4 40 5 4.8903 89 206 5 50 1 23 101 192 125.4 52 4 4.2905 80 135 6 23 1 22.6 89 139 64.8 61 2 4.1897 68 97 … … … … … … … … … … … ... 438 60 2 28.2 112 185 113.8 42 4 4.9836 93 178 439 47 2 24.9 75 225 166 42 5 4.4427 102 104 440 60 2 24.9 99.67 162 106.6 43 3.77 4.1271 95 132 441 36 1 30 95 201 125.2 42 4.79 5.1299 85 220 442 36 1 19.6 71 250 133.2 97 3 4.5951 92 57

Table 2.6.1 Data structure of Diabetes Data

Ideally, the model would produce accurate baseline predictions of response for future patients,

and also the form of the model would suggest which covariates were important factors in

diabetes treatment progression.

Firstly, I evaluate the LASSO model on a grid of 40 s values. The optimal LASSO

parameter s by GCV scores can be observed. (see Fig 2.6.1)

Fig 2.6.1 GCV score as a function of relative bound s. ˆ 0.4s =

From Fig 2.6.1, one can see that the speed of GCV scores decreases dramatically until s =0.4.

Thus I pick 0.4 as our optimized s value.

University of Minnesota Duluth

15

In order to see how LASSO shrinks and predicts the coefficients more clearly, the LASSO

estimates as a function of the standardized relative bound s as plotted. Intuitively, every

coefficient will be squeezed to zero as s goes to zero.

1: age; 2: Sex; 3: BMI; 4: BP; 5~10: 6 levels of blood serum measurements

Fig 2.6.2: LASSO coefficient shrinkage in diabetes example: each monotone decreasing curve represents a coefficient as a function of relative bound s . The vertical lines show at which s value that each coefficient shrinks to zero. The covariates enter the regression equation sequentially as s increase, in order i=3,9,4,7,2,10,5,8,6,1. If s=0.4 as chosen by GCV, 9(S5) 3(BMI), as shown by the vertical red line, only the coefficients on the left of the red lines 4(BP) and 7(S3), and 2(Sex) are assigned to be nonzero.

Second, I evaluate Diabetes data at ˆ 0.4s = . The table below shows the LASSO estimate and

OLS g. The standard errors (SE) were estimated by bootstrap resampling of residuals from

the original data set. LASSO chooses sex, bmi, BP, S3 and S5. Notice that LASSO yielded

smaller SE for sex, bmi, BP, S3and S5 than those yielded by OLS. This shows that LASSO

predicts the coefficients with more accuracy. The table shows a tendency that LASSO

estimate is smaller than the ones by OLS. This is due to its constraint nature, that all

predictions are subtracted by a threshold value. Also, Cp statistics of OLS is 11 while Cp

statistics of LASSO is 5. Cp results illustrates that LASSO also has some good properties of

subset selection.

University of Minnesota Duluth

16

Predictor LASSO Results OLS Results Coefficients SE Z-score Pr(>|t|) Coefficients SE Z-score Pr(>|t|) Intercept 2.6461 0.30737 8.608846 0 152.133 2.576 59.061 < 2e-16 ***age 0 51.6555 0 1 -10.012 59.749 -0.168 0.867 sex -52.5341 56.18816 -0.93497 0.174903 -239.819 61.222 -3.917 0.000104 ***bmi 509.6485 57.26192 8.900304 0 519.84 66.534 7.813 4.30E-14 ***BP 221.3422 55.14417 4.013882 0.00003 324.39 65.422 4.958 1.02E-06 ***S1 0 106.5671 0 1 -792.184 416.684 -1.901 0.057947 . S2 0 89.25398 0 1 476.746 339.035 1.406 0.160389 S3 -153.097 69.5406 -2.20155 0.013848 101.045 212.533 0.475 0.634721 S4 0 77.67842 0 1 177.064 161.476 1.097 0.273456 S5 447.3803 73.6392 6.075301 0 751.279 171.902 4.37 1.56E-05 ***S6 0 45.80782 0 1 67.625 65.984 1.025 0.305998

Significant code: P-value “***” 0, “**” 0.001, “*” 0.01, “.” 0.05; Table 2.6.2 Results from Diabetes Data Example

2.7 Simulation Study

2.7.1 Autoregressive model

We use AR (1) model to generate multi-correlated variables.

AR (1) model is 0 1 1t t tx xφ φ ε−= + + , where 1 2,ε ε … is an i.i.d. sequence with mean 0 and

variance 2σ . Assuming weakly stationary, it is easy to see

0 1 1( ) ( )t tE x E xφ φ −= + . Substituted ( )tE x by μ , 0 1μ φ φ μ= + , then we can get

0

11φμφ

=−

. (2.7.1)

The variance can be inducted as following

( )1 1 11t t tx xφ μ φ ε−= − + + 1 1( )t t tx xμ φ μ ε−⇔ − = − +

( ) ( )( )1 1( , ) ( ) ( )l t t l t t l t t t lCov x x E x x E x xρ μ μ φ μ ε μ− − − −= = − − = − + −⎡ ⎤⎣ ⎦

University of Minnesota Duluth

17

( )( ) [ ] ( )( )1 1 1 1 1 1 1 1 0( ) lt t l t t t t l lE x x E x E x x r rφ μ μ ε μ φ μ μ φ φ− − − − − −= − − + − = − − = =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦ .

Since ijρ denotes the correlation between ix and jx ,. Simplifying, I can write 1φ φ= , which

will leads to

,ll

i i lo

rr

ρ φ− = = (2.7.2)

Given Signal-to-Noise ratio( SNR), which is defined asT XSNR βσ

= . In order to find φ , I

assume that 0xφ = (initial value of the sequence). Then we can easily calculate φ and 0x .

The details of calculation will be shown in the next section.

2.7.2 Simulation I

The model is Ty Xβ ε= + .

I generated 1000 observations with 8 correlated variables. The parameters are designed to be

(3,1.5,0,0,2,0,0,0)Tβ = . ε is random error with normal distribution ( )20,N σ . The

correlation between ix , jx was i jφ − , 0.5φ = , 3σ = . The data generated yields a SNR to be

approximately 5.7.

In our example, 0.5φ = l i j= − , i jij lρ ρ φ −= = .

According to ( )1 1 11 ( 1)t t tx xφ μ φ ε−= − + − + , we have 1 00.5 0.5 tx x μ ε= + + ;

( )2 1 0 00.5 0.5 0.5 0.5 0.5 0.5 0.75 0.25t t tx x x xμ ε μ μ ε μ ε= + + = + + + = + + ;

University of Minnesota Duluth

18

5 00.0625 0.9375x x μ= + ;

If I wish to get SNR=5.7 and 3σ = , then 5.7 3 17T Xβ = × = ,

Note that

( ) ( ) ( )1 2 5 0 0 03 1.5 3 3 0.5 0.5 1.5 0.75 0.25 2 0.0625 0.9375T X x x x x x xβ μ μ μ= + + = + + + + +

where 0x is a initial value, which can be set to be μ . Solving forμ , 0 2.6153xμ = = .

To generalize the procedure, 1

p

jj

SNR σ β μ=

⎛ ⎞× = ×⎜ ⎟

⎝ ⎠∑ , and 0

1

p

jj

SNRx σμβ

=

×= =

⎛ ⎞⎜ ⎟⎝ ⎠∑

. (2.7.3)

All initial values and means of the other simulation examples are decided by the same

method. The other two simulations followed the same procedure. The result of simulation I is

shown in Fig 2.7.1.

Fig 2.7.1 LASSO coefficient shrinkage in simulation 1: each monotone decreasing curve represents a coefficient as a function of relative bound s. The covariates enter the regression equation sequentially as s increase, in order i=1, 5, 2, 6, …, 7. If s=0.6, shown by the vertical red line, chosen by GCV (shown in

the right panel), variables 1, 5, 2 are nonzero, which is consistent with (3,1.5,0,0,2,0,0,0)Tβ = .

University of Minnesota Duluth

19

2.7.3 Simulation II

I generated 1000 observations with 8 correlated variables. The parameters are designed to

be (0.85,0.85,0.85,0.85,0.85,0.85,0.85,0.85)Tβ = . The correlation between ix ,

jx remains i jφ − , 0.5φ = , standard deviation of error is 3σ = . The data generated yields a

signal-to-noise ratio (SNR) of approximately 1.8. The result of Simulation II is shown in Fig

2.7.2.

Fig2.7.2 LASSO coefficient shrinkage in simulation 2: The GCV score plot doesn’t have an obvious elbow position. If I have to choose one, optimal s can be 0.8. The covariates enter the regression equation sequentially as s increase, in order i=5, 7, 3…, 8. If s=0.8, shown by the vertical red line, all variables are nonzero, which is consistent with the assigned (0.85,0.85,0.85,0.85,0.85,0.85,0.85,0.85)Tβ =

2.7.4 Simulation III

I generated 1000 observations with 8 correlated variables. The parameters are designed to

be (6,0,0,0,0,0,0,0)Tβ = . The correlation between ix , jx is still i jφ − , 0.5φ = , standard

deviation of error is 2σ = . The data generated yields a signal-to-noise ratio (SNR) of

approximately 7. The result of Simulation III is shown in Fig 2.7.3.

University of Minnesota Duluth

20

Fig 2.7.3 LASSO coefficient shrinkage in simulation 3: The GCV score plot one the right panel have an obvious elbow position, optimal s is approximately 0.6. The covariates enter the regression equation sequentially as s increase, the 1st variable enter first, all the other variables are hard to tell the order of entering. If s=0.6, only the 1st variable are nonzero, which is consistent with the assigned

(6,0,0,0,0,0,0,0)Tβ =

From Fig 2.7.3, because one of the variables is clearly more significant than the others, the

GCV score plot shows a more obvious elbow position. To generalize, the significance level

of variables determines how distinguished the elbow position is.

University of Minnesota Duluth

21

Chapter 3 Logistic LASSO model

3.1 Logistic LASSO model and Maximum likelihood estimate

3.1.1 Logistic LASSO model

The 1l -norm constraint of the LASSO can also be applied to logistic regression models

(Genkin et al., 2004) by minimizing the corresponding negative log-likelihood instead of the residual sum of squares. LASSO logistic model is similar to the LASSO linear model except

iY ’s can only take two possible values, 0 and 1. I will assume that the response variable iY is

a Bernoulli random variable with probability

( 1| , )i i iP Y x β π= = , ( 0 | , ) 1i i iP Y x β π= = − . (3.1.1)

It is easily proven that

( )i iE Y π= (3.1.2)

and ( )2 1iY i iσ π π= − . (3.1.3)

Generally, when the response variable is binary, a monotonically increasing (or decreasing) S-shaped function is usually employed. The function is called the logistic response function, or logistic function.

1( 1| , ) ( )1 exp( )i i i i

i

P Y x E Yx

ββ

= = =+ −

, (3.1.4)

University of Minnesota Duluth

22

Fig 3.1.1 Logistic Response function

Replacing Tixβ by T

ixη β= (linear response), then

ln1πηπ

=−

. (3.1.5)

3.1.2 Laplace Priors & LASSO Logistic Regression

We extend the ordinary logistic model to the LASSO logistic regression model by imposing

an 1l constraint on the parameters:

jj

tβ ≤∑ .

Tibshirani(1996) firstly suggested that the Bayesian approach to the logistic regression

involves a Laplace prior distribution of β . For LASSO Logistic regression model,

Genkin(2004) assumed that jβ arises from a normal distribution with mean 0, and variance

jτ , that is

21( | ) exp22

jj j

jj

β ττπτ

⎛ ⎞= −⎜ ⎟⎜ ⎟

⎝ ⎠. (3.1.6)

The assumption of mean 0 indicates our belief that jβ is close to zero. The variances jτ

University of Minnesota Duluth

23

are positive constants. A small value of jτ represents a prior belief that jβ is close to zero.

Conversely, a large value of jτ represents a less informative prior belief. In the simplest

case, assume that jτ τ= , for all j , and the component of β are independent and hence the

overall prior of β is the product of the prior of each of the components jβ ’s. Tibshirani

(1996) suggested jτ arises from an Laplace prior (double exponential distribution) with

density

( )| exp2 2

j jj j jp

γ γτ γ τ

⎛ ⎞= −⎜ ⎟

⎝ ⎠. (3.1.7)

Integrating out jτ can give us the distribution of jβ as follows (Lemma 3.1)

( )| exp2 2

j jj j jp

ζ ζβ ζ β

⎛ ⎞= −⎜ ⎟

⎝ ⎠, j jζ γ= . (3.1.8)

Lemma 3.1 β arises from a normal distribution with mean 0, and varianceτ , andτ arises

from an Laplace prior with coefficientγ , then ( )| exp2 2

pγ γ

β γ β⎛ ⎞

= −⎜ ⎟⎜ ⎟⎝ ⎠

. .

( )0

| ( | ) ( | )p p p dβ γ β τ τ γ τ∞

= ∫

2

0

1 1 exp2 2 22

dγ γ βτ ττπ π

∞ ⎡ ⎤⎛ ⎞= − +⎢ ⎥⎜ ⎟

⎝ ⎠⎣ ⎦∫ .

Substituting2

x γ τ= , 2 2d dxττγ

= and2 2

22 4xβ β γτ= leads to

( )2

220

| exp4

p x dxx

γ β γβ γπ

∞ ⎡ ⎤⎛ ⎞= − +⎢ ⎥⎜ ⎟

⎝ ⎠⎣ ⎦∫ .

We know that [ ]2

220

exp exp 22

a x dx ax

π∞ ⎡ ⎤⎛ ⎞− + = −⎢ ⎥⎜ ⎟⎝ ⎠⎣ ⎦

∫ , where2

a β γ= . Now we reach our

designated form ( )| exp2 2

pγ γ

β γ β⎛ ⎞

= −⎜ ⎟⎜ ⎟⎝ ⎠

. If we do substitutionζ γ= , we have

University of Minnesota Duluth

24

( )| exp2 2

p ζ ζβ ζ β⎛ ⎞= −⎜ ⎟⎝ ⎠

■ This is the density of double exponential distribution. Fig 3.1.2 gives the plot of this density function together with normal density function.

Fig 3.1.2 Double exponential density function (black) and normal density function (red).

From the Fig 3.1.2, I can see that Laplace prior (double exponential distribution) favor values

around 0, which indicates that LASSO favors zeros as estimates for some variables.

3.2 Maximum likelihood and Maximum a Posteriori estimate

3.2.1 Maximum Likelihood (ML) Estimate

Firstly recall the linear logistic regression model; we will use maximum likelihood to

estimate the parameters in Tixβ . The density function of each sample observation is

1( ) (1 )i iy yi i i if y π π −= − . (3.2.1)

Since each observation is assumed to be independent, the likelihood function is

1

1 1

( , ) ( ) (1 )i i

n ny y

i i i ii i

L Y f yβ π π −

= =

= = −∏ ∏ . (3.2.2)

The log-likelihood function is

( )1 1 1 11

ln ( , ) ln ( ) ln ln 1 ln 1 exp( )1

n n n n nT Ti

i i i i i i ii i i ii i

L Y f y y y x xπβ π β βπ= = = ==

⎡ ⎤⎛ ⎞⎡ ⎤= = + − = − +⎢ ⎥⎜ ⎟ ⎣ ⎦−⎝ ⎠⎣ ⎦

∑ ∑ ∑ ∑∏ .

University of Minnesota Duluth

25

Let ( )( )

expln ( , ) 01 exp

TiT T

i i iTi i i

xd L Y y x xd x

βββ β

⎡ ⎤⎢ ⎥= − =+⎢ ⎥⎣ ⎦

∑ ∑ . (3.2.4)

We can solve (3.2.4) and get an estimate forβ

There is no closed-form solution for this likelihood equation. The most common optimization

approach in statistical software, such as SAS and Splus/R , is multidimensional Newton-

Raphson algorithm implemented via iteratively reweighted least squares (IRLS) algorithm

(Dennis and Schnabel 1989; Hastie and Pregibon 1992). Newton-Raphson method has the

advantage of converging after very few iterations if you have a very good initial value. For

detail, please refer to Introduction to Linear Regression Analysis by D. Montgomery (4th

edition) appendix C.14.

3.2.2 Maximum A Posteriori (MAP) Estimate

MAP was introduced initially by Harold W. Sorenson (1980). MAP estimate can be used to

obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely

related to Fisher's method of maximum likelihood (ML). It incorporates an option for prior

distribution of parameters which allows us to apply constraint to the problem. MAP estimate

can be seen as a special case of ML estimate. Assume that I want to estimate an unobserved

population parameter θ on the basis of observations x. Let f be the sampling distribution of X,

such that ( | )f x θ is the probability of x when the underlying population parameter is θ.

Then the function ( | )f x θ is known as the likelihood function and the estimate

ˆ ( ) arg max ( | )ML x f xβ

θ θ= (3.2.5)

is the maximum likelihood estimate of θ.

Now assume that θ has a prior distribution g. We will treat θ as a random variable in

Bayesian statistics. Then the posteriori distribution of θ is as follows:

University of Minnesota Duluth

26

' ' '

( | ) ( )| ~( | ) ( )f x gx

f x g dθ θθ

θ θ θΘ∫

. (3.2.6)

Θ is the domain of g.

MAP method then estimates θ as the argument maximizes posteriori density of this random

variable θ:

' ' '

( | ) ( )ˆ ( ) arg max arg max ( | ) ( )( | ) ( )MAPf x gx f x g

f x g dβ β

θ θθ θ θθ θ θ

Θ

= =∫

. (3.2.7)

The denominator of the posteriori distribution does not depend on θ. Therefore, minimizing

the posteriori distribution is equivalent to minimizing numerator. Observe that the MAP

estimate of θ coincides with the ML estimate when the prior g is uniform. The MAP estimate

is the Bayes estimator under the uniform loss function.

When MAP is applied to LASSO, ˆ arg max ( | ) ( )MAP f x p

ββ β β= is equivalent to maximizing the log likelihood function.

[ ]ˆ arg max ln ( , ) ln ( )MAP L Y pβ

β β β= +

1 1 1

arg max ln 1 exp( ) (ln 2 ln )pn n

T Ti i i j j j

i i jy x x

ββ β λ λ β

= = =

⎡ ⎤⎡ ⎤= − + − − +⎢ ⎥⎣ ⎦

⎣ ⎦∑ ∑ ∑ . (3.2.8)

Let 1 1 1

( , , ) ln 1 exp( ) (ln 2 ln )pn n

T Ti i i j j j

i i j

f x y y x xβ β β λ λ β= = =

⎡ ⎤= − + − − +⎣ ⎦∑ ∑ ∑ . (3.2.9)

Taking the derivative of (3.2.9) with respect toβ , I can get the following function

1 1 1

exp( ) ( ) 01 exp( )

T pn nT Ti

i i j jTi i ji

xy x x signxβ λ ββ= = =

− − =+∑ ∑ ∑ . (3.2.10)

There is no closed-form solution for equation (3.2.10); however, a variety of alternate

optimization approaches have been explored for MAP optimization, especially when p is

large (number of variables is large). For further detail, you can refer to Yuan, M. and Lin, Y.

(2006) and Genkin, A., Lewis, D. D. and Madigan, D. (2004). Fortunately, some numerical

University of Minnesota Duluth

27

approach has been implemented into statistical software, such as R package LASSO2 (written

by R. Hastie) and will be introduced in Chapter 4

3.3 Case Study: Kyphosis Data

The kyphosis data is a very popular data set in logistic regression analysis. The data frame

has 81 rows representing data on 81 children who have had corrective spinal surgery. The

outcome Kyphosis is a binary variable; the other three variables (columns) are numeric. It

was first published by John M. Chambers and Trevor J. Hastie (1992)

Kyphosis a factor telling whether a postoperative deformity (kyphosis) is "present" or

"absent”.

Age the age of the child (unit: months).

Number the number of vertebrae involved in the operation.

Start the beginning of the range of vertebrae involved in the operation.

The data structure looks like

index Kyphosis Age Number Start1 absent 71 3 52 absent 158 3 143 present 128 4 54 absent 2 5 1

… … … … …74 absent 1 4 1575 absent 168 3 1876 absent 1 3 1678 absent 78 6 1579 absent 175 5 1380 absent 80 5 1681 absent 27 4 9

Table 3.3 data structure of Kyphosis data

University of Minnesota Duluth

28

Fig 3.3.1 Boxplots of kyphosis data. Age and number doesn’t show a strong location shift; it turns out that quadratic forms of age and number should be added to the model After adding the quadratic terms of age, number and start, we analyze the data using logistic LASSO model. The result is shown in Fig 3.3.2 and Table 3.3.

1: intercept; 2: age; 3: Number; 4: Start; 5: age^2; 6: Number^2; 7: Start^2

Fig3.3.2 LASSO coefficient shrinkage in kyphosis example: GCV score plot is discrete because error of logistic model is non-normally distributed. Monotone decreasing curve represents a coefficient as a function of relative bound s. The covariates enter the regression equation sequentially as s increase, in order i=4, 3, 5, 2, 7. If s=0.55, shown by the vertical red line, chosen by GCV (shown in the right panel), variables 4(Start), 3(Number), 5(Age^2), 2(Age) are nonzero. We can see from Table 3.3 that LASSO yields a much smaller standard error than the OLS method as expected.

University of Minnesota Duluth

29

Predictor LASSO Results OLS Results Coefficients SE Z-score Coefficients SE Z-score Pr(>|t|) Intercept -1.0054 0.445053 -2.25906 -0.1942 0.6391 -0.304 0.76125 Age 0.1425 0.178466 0.798473 1.1128 0.5728 1.943 0.05204 . Number 0.2765 0.258829 1.068271 1.0031 0.5817 1.725 0.08461 . Start -0.6115 0.291552 -2.09739 -2.6504 0.9361 -2.831 0.00463 ** Age^2 -0.5987 0.379761 -1.57652 -1.4006 0.664 -2.109 0.03491 * Number^2 0 0.043329 0 -0.3076 0.2398 -1.283 0.19955 Start^2 0 0.047823 0 -1.1143 0.5281 -2.11 0.03486 *

Table 3.3 Results of kyphosis data

3.4 Simulation Study

The model is( )

exp( )1 exp

T

i T

XX

β επβ ε

+=

+ +. 1, 2......,i n=

We also use autoregressive model to generate multi-correlated data. ε is random error with

normal distribution ( )20,N σ . There are 5 variables, and a total of 40 observations. The

correlation between ix , jx is i jφ − , 0.3φ = , 1σ = . The assigned coefficients are

( )3,1.5,0,0,2 Tβ =

index y x1 x2 x3 x4 x5 1 1 1.002997 1.339722 1.556792 -1.06295 -0.98428 2 0 -1.55593 -0.61825 -0.02163 -0.02956 -3.10966 3 0 -0.25711 0.944475 0.59679 -1.21193 -0.62585 4 1 0.884903 0.558206 0.828492 -0.01518 0.285952 5 0 -1.67065 0.497213 -0.24751 1.258528 0.670419

… … … … … … … 37 0 -0.31754 -0.6843 -0.41027 -0.08336 -0.08773 38 0 -0.7015 0.404193 -0.58057 0.235259 -1.67983 39 0 0.305101 1.483059 0.467948 2.321507 -1.00605 40 1 -0.22636 -0.37299 -0.94524 -1.30565 1.511724

Table 3.4.1 Data structure of Simulation

We can see the LASSO analysis procedure more clearly from Fig 3.4.1. The left diagram

shows the change of GCV scores with respect to the relative bound. There is an evident

elbow position at approximately s=0.6. The right diagram denotes the change of LASSO

University of Minnesota Duluth

30

coefficients with respect to relative bound. If I choose s to be 0.6, and draw a vertical line

through the diagram, I can pick out 2(x1), 3(x2), and 6(x5). The black line represents

intercept.

Fig 3.4.1 LASSO shrinkage of coefficients in Simulation example: from the left panel, I can see the obvious elbow position is at 0.6. Thus, ˆ 0.6s = . The right panel show how lasso shrinkage the coefficients along with the shrinkage of s

Now I will fix s at 0.6, and analyze the data set using both general linear model and LASSO. I can tell from the table below how LASSO benefits us in minimizing standard error. The standard errors of LASSO estimates and OLS are computed by bootstrapping from the original data set. LASSO yields a much smaller standard error than the OLS method as expected.

Predictor LASSO Results OLS Results Coefficients SE Z-score Coefficients SE Z-score Pr(>|t|) Intercept 1.127088 0.354877 3.175998 1.49703 0.96905 1.545 0.1224 x1 2.625973 0.546351 4.806383 3.961 1.78794 2.215 0.0267 **x2 0.05362 0.207421 0.258509 0.97223 0.93206 1.043 0.2969 * x3 0 0.069338 0 -0.90678 0.97318 -0.932 0.3515 x4 0 0.183042 0 -0.08364 0.70519 -0.119 0.9056 x5 0.543464 0.232676 2.335708 1.51508 0.9877 1.534 0.125 *

Table 3.4.2 Results fro the Simulation study

University of Minnesota Duluth

31

Chapter 4 Computation of the LASSO

Tibshirani(1996) firstly suggested that the computation of the LASSO solutions is a quadratic

programming problem, which doesn’t necessarily have a closed form solution. However,

there are numerous algorithms developed by statisticians which can approach the solution

numerically. The algorithm proposed by Tibshirani (1996) is adequate for moderate values of

p, but it is not the most efficient possible. It is inefficient when number of variables is large; it

is unusable when the number of variables is larger than the number of observations. Thus,

statisticians have explored how to find more efficient ways to estimate parameters for years.

A few significant results are the dual algorithm (M. Osborne, B. Presnell, and B. Turlach,

2000); L-2 boosting approaches (N. Meinshausen, G. Rocha and B. Yu. 2007); Shooting

method (Wenjiang Fu, 1998); Least Angle Regression (LARS) (Efron, B., Johnstone, I.,

Hastie, T. and Tibshirani, R., 2002). LARS is the most widely used algorithm. This

algorithm exploits the special structure of the LASSO problem, and provides an efficient way

to compute the solutions simultaneously for all values of s . In this chapter, I will only briefly

introduced Osborne’s dual algorithm and B. Efron’s LARS algorithm.

4.1 Osborne’s dual algorithm

Osborne et. al present a fast converging algorithm in his paper LASSO and its Dual(2000). He

first introduces a general algorithm based on his duality theory described in his original paper.

In LASSO and its Dual(2000), the algorithm can be used to compute the LASSO estimates in

any setting. For the details, please refer to Osborne (2000). However, here I will only

illustrate a simpler case, the orthogonal design case.

I have already proven that the solutions to (1.1) subjected to (1.2) is (2.1.7), I

substituted 2λ γ= , then (2.1.7) becomes

( )( )ˆ ˆ ˆ 1, ,L o oj j jsign j pβ β β γ

+= − = . (4.1.1)

Suppose that ˆ oj o

j

tβ =∑ and ˆ Lj

j

tβ =∑ ,

University of Minnesota Duluth

32

( ){ }ˆ ˆo L o Lo j j j j

j j jt t β β β β γ

+

− = − = − −∑ ∑ ∑

( ) ( )ˆ ˆ ˆo o oj j j

j j

I Iβ β γ γ β γ= ≤ + ≥∑ ∑

( )jj

b p Kγ= + −∑ . (4.1.1)

Where, p is the number of variables; 1 2 pb b b≤ ≤ ≤… are the ordered statistics of

1 2ˆ ˆ ˆ, ,...,o o o

pβ β β , max{ : }jK j b γ= ≤ . Since usually, ot t< , K p< and 1K Kb bγ +≤ ≤ .

• Start with 0 0c = .

• Do the iteration of j from 1 to p, 1

( )j

j i ji

c b b p j=

= + −∑ such that

0 1 00 pc c c t= ≤ ≤ ≤ =… .

• Let 0max{ : }iK i c t t= ≤ − which is easily computed after t is chosen by GCV.

• Corresponding ( )

1( )

K

o ii

t t b

p Kγ =

⎧ ⎫− −⎨ ⎬⎩ ⎭= −

∑ . (4.1.2)

• Solving all the LASSO estimates by ( )( )ˆ ˆ ˆ . 1, ,L o oj j jsign j pβ β β γ

+= − =

Based on Osborne’s algorithm, Justin Lokhorst, Bill Venables and Berwin Turlach developed an R package “lasso2”. The package and the manual can be downloaded from the following link: http://www.maths.uwa.edu.au/~berwin/software/lasso.html

4.2 Least angle regression algorithm

Least Angle Regression (LARS) is a new model selection algorithm. LARS is introduced in

detail in a paper by Brad Efron, Trevor Hastie, Iain Johnstone and Rob Tibshirani. In the

paper, they establish the theory behind the algorithm, and then introduce how LARS relates

to LASSO and forward stepwise regression. One advantage of LARS is that it implements

University of Minnesota Duluth

33

these two techniques. The modification from LARS to LASSO is that if a non-zero

coefficient hits zero, remove it from the active set of predictors and recompute the joint

direction. (Hastie 2002). The algorithm is showed below:

• Start with all coefficients jβ equal to zero.

• Find the predictor jx most correlated with jy .

• Increase the coefficient jβ in the direction of the sign of its correlation with jy . Take

residuals ˆj j jr y y= − along the way. Stop when some other predictor kx has as much

correlation with r as jx has.

• Increase ( jβ , kβ ) in their joint least squares direction, until some other predictor xm has

as much correlation with the residual r.

• Continue in this way until all p predictors have been entered. After p steps, one arrives

at the full least-squares solutions.

Based on this algorithm, Trevor Hastie and Brad Efron developed a R package called

“LARS”. The package and the manual can be downloaded from the following link: http://www-stat.stanford.edu/~hastie/Papers/#LARS

University of Minnesota Duluth

34

References

Chambers J. and Hastie T., (1992) Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200. Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Numer. Math., 31: 377-403. Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression , Annals of Statistics 20 Fu Wenjiang(1998). Penalized regressions: the bridge versus LASSO. JCGS vol 7, no.3, 397-416. Genkin, A., Lewis, D. D. and Madigan, D. (2004) Large-Scale Bayesian Logistic Regression for Text Categorization Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical earning; Data Mining, Inference and Prediction, Springer Verlag, New York. Hastie T., Taylor J., Tibshirani R. and Walther G., Forward Stagewise Regression and the Monotone LASSO, Harold W. Sorenson, (1980) Parameter Estimate: Principles and Problems, Marcel Dekker Meinshausen N, Rocha G., and Yu G. (2007) A TALE OF THREE COUSINS: LASSO, L-2BOOSTING. AND DANTZIG. Submitted to Annual of Statistics, 2008. UC Berkeley. Osborne, M., B. Presnell, and B. Turlach (2000). On the LASSO and its dual. Journal of Computational and Graphical Statistics 9, 319–337.

Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288) Yuan, M. and Lin, Y. (2006), Model Selection and Estimate in Regression with Grouped Variables. Journal of the Royal Statistical Society Series B, 68, 49–67

University of Minnesota Duluth

35

Appendix A: The effective Number of Parameters

The effective Number of Parameters was introduced in the book by Hastie, Tibshirani, Friedman (2001)

The concept of “number of parameters” can be generalized, especially to models where regulation is used in the fitting.

Y HY=

Y is a vector composed with outcomes 1, , ny y… , Y is a vector composed with predictors

1ˆ ˆ, , ny y… , H is a n n× matrix depending on the input vectors X .

The effective number of parameters is defined as

( ) ( )d H trace H=

Appendix B: R Code for Examples and Simulations

1. Acetylene data

library(lasso2)

library(lars)

plot.new()

P<-as.vector(c(49,50.2,50.5,48.5,47.5,44.5,28,31.5,34.5,35,38,38.5,15,17,20.5,29.5),mode="numeric")

Temperature<-as.vector(c(1300,1300,1300,1300,1300,1300,1200,1200,1200,1200,1200,1200,1100,1100,1

100,1100),mode="numeric")

H2<-as.vector(c(7.5,9,11,13.5,17,23,5.3,7.5,11,13.5,17,23,5.3,7.5,11,17),mode="numeric")

ContactTime<-as.vector(c(0.012,0.012,0.0115,0.013,0.0135,0.012,0.04,0.038,0.032,0.026,0.034,0.041,0.0

84,0.098,0.092,0.086),mode="numeric")

acetylene<-data.frame(Temperature=Temperature,H2=H2,ContactTime=ContactTime)

#plot(Temperature,ContactTime,xlab="Reactor Temperature",ylab="Contact time")

cor(acetylene)

TH<-Temperature*H2

University of Minnesota Duluth

36

TC<-Temperature*ContactTime

HC<-H2*ContactTime

T_2<-Temperature^2

H2_2<-H2^2

C_2<-ContactTime^2

acetylene<-data.frame(T=Temperature,H=H2,C=ContactTime,TC=TC,TH=TH,HC=HC,T2=T_2,H2=H2_

2,C2=C_2)

acetylene

a.mean<-apply(acetylene,2,mean)

acetylene<-sweep(acetylene,2,a.mean,"-")

a.var<-apply(acetylene,2,var)

acetylene<-sweep(acetylene,2,sqrt(a.var),"/")

xstd<-as.matrix(acetylene)

acetylene<-data.frame(y=P,acetylene)

a.lm<-lm(y~.,data=acetylene)

summary(a.lm)

anova(a.lm)

#ace_step<-step(a.lm)

#summary(ace_step)

ace_lasso2<-l1ce(y~.,data=acetylene,bound=(1:40)/40)

ystd<-as.vector(P,mode="numeric")

ace_lars <- lars(xstd,ystd,type="lasso",intercept=TRUE)

#plot(ace_lars,main="shrinkage of coefficients Acetylene")

############plot the GCV vs relative bound graph#############

lassogcv<-gcv(ace_lasso2)

lassogcv

lgcv<-matrix(lassogcv,ncol=4)

plot(lgcv[,1],lgcv[,4],type="l",main="Acetylene:GCV score vs s",xlab="relative bound",ylab="GCV")

ace<-l1ce(y~.,data=acetylene,bound=0.23)

summary(ace)

University of Minnesota Duluth

37

2 . Linear LASSO Simulation Example 1

library(lasso2)

library(lars)

par(mfrow=c(1,2))

###########generating the desired data

Generator<-function(mean,Ro,beta,sigma,ermean=0,erstd=1,dim)

{

x<-matrix(c(rep(0,dim*9)),ncol=9)

y<-matrix(c(rep(0,dim)),ncol=1)

er<-matrix(c(rnorm(8*dim,0,1)),ncol=8)

error<-matrix(c(rnorm(dim,mean=ermean,sd=erstd)),ncol=1)

for(j in 1:dim)

{

x[j,1]<-mean

for(i in 2:9)

{

x[j,i]<-(1-Ro)*mean+Ro*x[j,i-1]+er[j,i-1]

}

y[j]<-x[j,2:9]%*%beta+sigma*error[j]

}

signal<-x[,2:9]%*%beta

SNR<-mean(signal)/erstd

return(list(x=x[,2:9],y=y,SNR=SNR))

}

#########################################

SNR<-5.7 #Signal to Noise ratio

sigma<-3 #standard deviation of white noise

n<-100 #data set size

beta<-matrix(c(3,1.5,0,0,2,0,0,0),ncol=1) #destinated coefficient

Ro<-0.5 #Correlation

signalmean<-SNR*sigma/sum(beta)

data<-Generator(signalmean,Ro,beta,sigma=sigma,n=n)

University of Minnesota Duluth

38

data

############Standardized the data############################

x.mean<-apply(data$x,2,mean) #mean for each parameter values

xros<-sweep(data$x,2,x.mean,"-") #every element minus its corresponding mean

x.std<-apply(xros,2,var) #variance of each parameter values

xstd<-sweep(xros,2,sqrt(x.std),"/") #every centered data dived its standard deviation

ystd<-as.vector(data$y,mode="numeric")

############Analysis of the data with LASSO##################

#Using "Lars" Package

plres <- lars(xstd,ystd,type="lasso",intercept=TRUE)

plot(plres,main="shrinkage of coefficients Example 1")

#Using "Lasso" Package

l1c.P <- l1ce(ystd~xstd,xros, bound=(1:40)/40)

l1c.P

betaLasso<-matrix(coef(l1c.P),ncol=8)

############plot the GCV vs relative bound graph#############

lassogcv<-gcv(l1c.P)

lassogcv

lgcv<-matrix(lassogcv,ncol=4)

plot(lgcv[,1],lgcv[,4],type="l",main="Acetylene:GCV score vs s",xlab="relative bound",ylab="GCV")

3. Diabetes data

library(lasso2)

library(lars)

data(diabetes)

diabetes_x<-diabetes[,1]

diabetes_x[1:10,]

diabetes_y<-diabetes[,2]

diab<-data.frame(age=diabetes_x[,1],sex=diabetes_x[,2],bmi=diabetes_x[,3],

University of Minnesota Duluth

39

BP=diabetes_x[,4],S1=diabetes_x[,5],S2=diabetes_x[,6],

S3=diabetes_x[,7],S4=diabetes_x[,8],S5=diabetes_x[,9],

S6=diabetes_x[,10],y=diabetes_y) # responsor still use the original data

#####OLS########

diabetes_OLS<-lm(y~.,data=diab)

diabetes_OLS

summary(diabetes_OLS)

anova(diabetes_OLS)

#####Cross validation procedure to decide tuning parameter t

cv.diabetes<-cv.lars(diabetes_x,diabetes_y,K=10,fraction=seq(from=0,to=1,length=40))

cv.diabetes

title("10-fold Cross Validation and Standard error")

lars_diabetes<-lars(diabetes_x,diabetes_y,type="lasso",intercept=TRUE)

plot(lars_diabetes)

#Using "Lasso" Package

l1c.diabetes<- l1ce(y ~ .,diab, bound=(1:40)/40)

l1c.diabetes

anova(l1c.diabetes)

betaLasso<-matrix(coef(l1c.diabetes),ncol=9)

lassogcv<-gcv(l1c.diabetes)

lassogcv

lgcv<-matrix(lassogcv,ncol=4)

plot(lgcv[,1],lgcv[,4],type="l",main="Diabetes:GCV vs s",xlab="s",ylab="GCV")

########bootstrap to get the standard error

####################

resample<-function(data,m)

{

dim<-9

res<-data

University of Minnesota Duluth

40

r<-ceiling(runif(m)*m)

for(i in 1:m)

{

for(j in 1:dim) res[i,j]<-data[r[i],j]

}

return(list(res=res))

}

###########################

l1c.diabetes<-l1ce(y~.,diab, bound=0.4)

summary(l1c.diabetes)

nboot<-500

diab_res<-resample(diab,442)

l1c.diabetes<-l1ce(y~.,diab_res$res, bound=0.4)

sum(residuals(l1c.diabetes)^2)

summary(l1c.diabetes)

l1c.C<-coefficients(l1c.diabetes)

for(m in 2:nboot)

{

diab_res<-resample(diab,442)

l1c.diabetes<-l1ce(y~.,diab_res$res, bound=0.4)

l1c.C<-cbind(l1c.C,coefficients(l1c.diabetes))

}

sqrt(var(l1c.C["(Intercept)",]))

sqrt(var(l1c.C["age",]))

sqrt(var(l1c.C["sex",]))

sqrt(var(l1c.C["bmi",]))

sqrt(var(l1c.C["BP",]))

sqrt(var(l1c.C["S1",]))

sqrt(var(l1c.C["S2",]))

sqrt(var(l1c.C["s3",]))

University of Minnesota Duluth

41

sqrt(var(l1c.C["S4",]))

sqrt(var(l1c.C["S5",]))

sqrt(var(l1c.C["S6",]))

4. Kyphosis data

library(lasso2)

library(lars)

library(rpart)

data(kyphosis)

############Transform Kyphosis to numeric mode##############\

n<-length(kyphosis$Kyphosis)

n

y<-matrix(c(rep(0,n)),ncol=1)

for(i in 1:n)

{

if(kyphosis$Kyphosis[i]=="absent") y[i]<-0

if(kyphosis$Kyphosis[i]=="present") y[i]<-1

}

############Standardized the data###########################

Temp<-kyphosis[,2:4]

k.mean<-apply(Temp,2,mean) #mean for each parameter values

kyph<-sweep(Temp,2,k.mean,"-") #every element minus its corresponding mean

k.var<-apply(kyph,2,var) #variance of each parameter values

kyph<-sweep(kyph,2,sqrt(k.var),"/") #every centered data dived its standard deviation

kyph[,"Kyphosis"]<-y # responsor still use the original data

kyph<-as.data.frame(kyph)

ageSq<-(kyph$Age)^2

numSq<-(kyph$Number)^2

University of Minnesota Duluth

42

startSq<-(kyph$Start)^2

kyph<-data.frame(Age=kyph$Age, Number=kyph$Number, Start=kyph$Start,ageSq=ageSq,

numSq=numSq, startSq=startSq,Kyphosis=y)

#Linear logistic fitted model

glm.K<-glm(Kyphosis~.,data=kyph,family=binomial())

beta_logistic<-as.vector(coefficients(glm.K),mode="numeric")

beta_logistic

###logistic lasso with varying relative bound

gl1c.Coef<-matrix(c(rep(0,7*40)),ncol=7)

for(i in 1:40)

{

temp<-gl1ce(Kyphosis~.,data=kyph,family=binomial(),bound=i/40)

gl1c.Coef[i,]<-coefficients(temp)

}

gl1c.Coef<-data.frame(Intercept=gl1c.Coef[,1],Age=gl1c.Coef[,2],Number=gl1c.Coef[,3],Start=gl1c.Coef

[,4],AgeSq=gl1c.Coef[,5],NumberSq=gl1c.Coef[,6],StartSq=gl1c.Coef[,7])