STATISTICAL SHRINKAGE MODEL AND ITS APPLICATIONS · The aim of statistical analysis is to identify...

A STATISTICAL SHRINKAGE MODEL

AND ITS APPLICATIONS

Wenjiang J . Fu

A thesis submi t ted in conformity wit h the requirements

for the degree of Doctor of P hilosophy

Graduate Depart ment of Public HeaIt h Sciences

University of Toronto

@ Copyright by Wenjiang J. Fu, 1998

National Library Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographic Services services bibliographiques

395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 OttawaON KlAON4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distritbute or seU reproduire, prêter' distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de rnicrofiche/film, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la proprié,té du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

A Statistical Shrinkage Mode1 And Its Applications

Doctor of Philosophy 1998

Wenjiang .I. Fu

Department of Public Health Sciences

University of Toronto

Abst ract

Bridge regression, a special type of penalized regression of a penalty function 1 & I T with 2 1 is considered. The Bridge estimator is obtained by solving the penalized score

equations via the modified Newton-Rapbson method for Î > 1 or the Shooting met hod

for 7 = 1. The Bridge estimator yields small variance with a iittle sacrifice of bias. and

thus achieves small mean squared error and small prediction error wheo collinearity is

present amonp regressors in a linear regression model. The concept of penalization is gen-

eralized via the penalized score equations? which allow the implementat ion of penalization

regardless of the existence of joint likelihood functions. Penalization is then appliecl tu

generalized linear models and generalized est imating equat ions ( GEE) . The penalty pa-

rameter y and the tuning parameter X are selected via the generalized cross-validation

(GCV). A quasi-GCV is developed to select the parameters for the penalized G E L Sim-

ulat ion studies show t hat the Bridge estirnator performs weH compared to the estimators

of ridge regression (7 = 2) and the Lasso (y = 1). Çeveral data sets £rom public health

studies are analyzed using the Bridge penalty model in the statistical settings of a Linear

regression model, a logistic regression model and a GEE rnodel for binary outcornes.

To my parents,

my sister Shufen,

my wife Qi,

and my daughter Martina.

Acknowledgements

1 am in debt to my supervisor, Professor R. Tibshirani, for introducing this very

interesting research topic to me, and for his encouragement, support and supervision

during rny Ph.D. study.

1 am grateful to my cornmittee members, Professors P. Corey. .J. Hsieh and D. Tritchler.

my interna1 and external examiners Professors K. Knight and D. Hamilton. for their

valuable suggestions and critiques.

1 appreciate the valuable discussions with Professors R. Neal and J . Hsieh, which let1

to some very interesting points in my thesis.

I would like to thank Professor P. Corey for providing the environmental health data.

1 aiso would Like to thank my friend Rafal for his help on some progamming techniques.

1 am also grateful to rny parents-in-law, Shouyin and Peiyu, who took care of my

daughter many late nights and weekends while 1 was studying in McMurrich Building.

Contents

Abstract

1 Introduction I

. ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction - . . . . . . . . . . . . . . . . . . . . 1.2 Some Background of Shrinkage Models 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . .3 Problems 10

2 Bridge Regressions 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . '3.1 Introduction 13

. . . . . . . . . . . . . . . . . . . . . . 2.2 Structure of the Bridge Estimators 1 3

. . . . . . . . . . . . . . . 2.3 Algorithms for the Bridge and Lasso Estimators 16

- 1 9 . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Variance of the Bridge Estimator

. . . . . . . . . . . . . . . . . . . . . . 2.5 Illustration of the Shrinkage Effect 26

. . . . . . . . . . . . . . . . . . 2.6 Bridge Regression for Orthonormal Matrix 28

. . . . . . . . . . . . . . . . . . . . . . . 2.7 Bridge Penalty as Bayesian Prior 31

. . . . . . . . . . . . . . . . . 2.8 Relation between Tuning Parmeters X and t 36

3 Penalized Score Equations 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.I Introduction 40

. . . . . . . . . . . . . . . . . . .3. 2 GeneraIized Linear Models and Likelihood 40

. . . . . . . . . . . . . . . . . 3.3 Quasi-Li kelihood and Quasi-Score Functions -4.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Penalized Score Equat ions -48

. . . . . . . . . . . . . . . . . . . 3.5 Algorithms for Penaiized Score Equations 50

4 Penalized GEE 52

4 . L Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3

. . . . . . . . . . . . . . . . . . . . . . . 4.2 Generalized Estimating Equations 54

- - 4 . .3 Penalized GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TI

5 Selection of Shrinkage Parameters 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Introduction 60

. . . . . . . . . . . . . . 5.2 Cross-Validat ion and Generalized Cross-Validat ion 60

. . . . . . . . . . . . . . . . 5.Q Selectioo of Parameters h and 7 via the GCV 61

. . . . . . . . . . . . . . . . . . . . . . . . 5.4 Quasi-GCV for Penalized GEE 64

6 Simulation Studies 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 A Linear Regression Mode1 71

. . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 A Logistic Regression Model 74

. . . . . . . . . . . . . . . . . 6.3 A Generalized Estimating Equations Mode1 78

. . . . . . . . . . . . . . . . . . . 6.4 A Complicated Linear Regression Mode1 Y1

7 Applications: Analyses of Health Data 92

7.1 Analysis of Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Analysis of Kypliosis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C . ( . 3 Analysis of Environmental Health Data . . . . . . . . . . . . . . . . . . . . 100

8 Discussions and Future Studies 108

8.1 Discussion . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - 109

8.2 Future Studies, . . - . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . 114

References 116

A A FORTRAN Subroutine of the Shooting Method for the Lasso 119

8 Mathematical Proof 125

vii

List of Figures

. . . . . . . . . . . . . . . . . . . . 1.1 Constrained areas of Bridge regressions 9

2.1 Solution of equation (2.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Algorithms for the Bridge estimators . . . . . . . . . . . . . . . . . . . . . 20

2 . :3 Shrinkage effect of Bridge regressions for fixed A > 0 . . . . . . . . . . . . . 30

. . . . . . . . . . . . . . . . 2.4 Bridge penalty as a Bayesian pnor with A = 1 39

. . . . . . . . . . . . . . . 2.5 Bridge penalty as a Bayesian prior with A = 0.5 34

. . . . . . . . . . . . . . . 2.6 Bridge penalty as a Bayesian prior with X = 10 3.5

. . . . . . . . . . . . . . . 2.7 Relation between shriokage parameters X and t 38

1 Selection of parameters A and y via GCV . . . . . . . . . . . . . . . . . . . 65

5.2 Selection of parameters A a n d y via quasi-GCV . . . . . . . . . . . . . . . 65

6.1 Simulation with true B generated from the Bridge prior with = 1 . . . . . 87

6.2 Simulatim with true B generated from the Bridge pnor with 7 = 1.3. . . . YY

6.3 Simulation with true P generated from the Bridge prior with y = 2 . . . . . Y9

6.4 Simulation with true generated from the Bridge prior with y = 3 . . . . . 90

6.5 Simulation with true P generated h m the Bridge prior with y = 4 . . . . . 91

. * . V l l l

7.1 Selectioo of parameters X and 7 for the prostate cancer data. . . . . . . . . 95

7.2 Selection of parameters A and 7 for the kyphosis data. . . . . . . . . . . . 98

7.:3 Cornparison of prediction errors on test data by box plots . . . . . . . . . . 104

7.4 Selectioo of parameters X and y for the environmental healt h data . . . . . 1 OFj

List of Tables

3.L Bridge estirnators and standard errors for orthonormal X . . . . . . . . . . 126

. . . . . . . 2.2 Bridge estimators and standard errors for non-orthonormal .Y 27

6.1 Mode1 comparison for a iinear regression mode1 . . . . . . . . . . . . . . . 73

6.2 Mode1 comparisoo for a logistic regressioo mode1 . . . . . . . . . . . . . . . 76

6.3 Mode1 cornparison for a GEE mode1 . . . . . . . . . . . . . . . . . . . . . . 79

6.4 Means and SE'S of MSE, and PSE. for different 7 . . . . . . . . . . . . . . 84

7.1 Estimates of the prostate cancer data . . . . . . . . . . . . . . . . . . . . . 95

7 . 2 Cornparison in mode1 selectioo . . . . . . . . . . . . . . . . . . . . . . . . . 96

. . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Estimates of the kyphosis data 99

7.4 Cornparison of prediction errors on test data over 100 random splits . . . . 103

. v

. a Estimates of the environmental health data . . . . . . . . . . . . . . . . . . 106

Chapter 1

Introduction

1.1 Introduction

In many applied science or public heaith studies. the investigators are interested in re-

lations between response variables and explanatory variables. For example. in a breast

cancer study. it is of interest to know whether the probability of deveioping cancer in a

population depends on some potential risk factors. such as a patient's diet. age. height

and weight. S tatistical analysis provides a scientific tool to investigate such relationships

using data obtained from previous or current studies. The aim of statistical analysis is to

identify the risk factors that contribute significantly to the presence or the occurrence of

the event which is under investigation. Very often, the analysis is conducted through a

statistical procedure called regression, which is based on probability t heory and statisticai

modelling. Regression analysis provides information on the si,gnîficance of the cont ribu-

tion of these risk factors to the event. and thus helps the investigators to make scientific

decisions.

Ln some studies. certain explanatory variables present a linear relation' i.r. some vari-

ables depend linearly on some others. Such a phenornenon is called collinearity. Sincr the

presence of collinearity arnong explanatory variables induces large variation and uncer-

tainty in the regression models, the estimates of the mode1 parameters have large variance.

and prediction based on the models may perform very poorly. Therefore the models may

not serve the needs of the investigators.

In this thesis, 1 investigate this coilinearity problem and propose a metliod using a sta-

tistical technique: Bridge penalization. I also demonstrate through statistical simulations

t hat this method works well in terms of estimation and prediction. Finally: I apply this

method to several data sets from public health studies to achieve p o d statistical results.

1.2 Some Background of Shrinkage Models

Consider a linear regression problem

where y is an n-vector of random responses. .Y is an n x p design matris- ,û is a p-vector

of regression parameters and E is an n-vector of independently identically distributed (iid)

random errors. Ordinary least-squares regression (OLS). or least-squares regression as it

is known in the literature? minimizes the residual sum of squares

and yields an unbiased estimator f i o I , = (X~?C) - ' .YTy. if rnatrix .Y is of full rank. witli

expected value

and variance

where c2 is the commoo variance of each individual random error term ~ i . Sioce both

CL

the estimator Pol, and its variance are of simple forrn, computation of these quaotities

is very simple and can easily be done eveo without help of cornputers if the number of

regressors is srnall. In addition, the variance of Po,, is minimal among ail Linear unbiased

estimatorst i.e. for any linear unbiased estimator B. where 18 = -4y and EP = P. one has

A

Hence. Po,. is usually referred to as the Best Linear Cinbiased Estimator ( BLLE) i~niler

the Gauss-Markov condi tioos. see Sen and Srivastava [lggO].

- However, despite its simplicity, unbiasedness and minimum variance. ,Bol, is not always

satisfactory due to the following reasons.

1). The estimator is not unique if the regression matrix X is less than full rank. In fact.

there are infinitely many estimates which attain the minimum residual sum of squares:

2 ) The variance var(pOl,) = ( X T X ) - l u 2 becomes large if the regression rnatrix .Y is

close to collinear. Hence the mean squared error (MSE) is large since

and

For example. consider a simple linrar regression problem of two regressors.

where E is of normal distribution i V ( 0 , r 2 ) . To illustrate the effect of collinearity between

the regressors, we standardize the regression vectors xi and 2 2 by setting B j = O and

I l ~ ~ l l = 1 for j = 1,2, and set o' = 1 for simplicity. Then the sarnple correlation coefficient

T r = X I 2 2 , and

The variance-covariance rnatrix of the OLS estirnator Bol, = ( A A)T is thus

and ~ a r ( 8 ~ ) = 1/(1 - r') for i = 1.2. If the regressors zl and r2 ore uncorrelated. i-e.

r = O , then v a r ( B j ) = 1 , for i = 1.2. However, if xi and 1 2 are correlated. then ~ a ~ - ( , ! j ~ )

can be very large as shown below, for example, V a r ( p j ) = 10.26 for r = 0.95.

Increase of variance wit h correlat ion coefficient

Since mean squared error retlects the overall accuracy of estimation. and large !VISE ..

means poor estimation. predictions based on Pol, may perform very poorly if collinearity is

present in X. For example, consider the prediction squared error (PSE) of a two regressor

case. The expectation of the predicted error at an arbitrary point (zmT. y') by t h e OLS

estimator f i is

where e' is the random error a t the prediction point, and a' is the variance of the random

error. Shen the PSE depends on the location of the vector 2- in the feature space.

Take a special case with high collinearity x T X = diag(l,0.00 1): then E( PSE) = cr2[L + 2y2 + 1000x;~]. If 1x;I « max{ l,lr;l}. then the preciiction error is moderate. Ot herwise.

it is inflated largely by t h e factor of xg due to high collinearity. Detailed discussions

about multi-collinearity can be found in Seber (1977). Sen and Srivastava ( 1990). Hocking

( 1996): Lawsoo and Hanson (1974), Hoerl and Kennard ( L9'iOa. L950b) and Frank and

Friedman ( 1993).

To achieve better prediction, Hoerl and Kennard ( 1970a. 19'ïOb) introcLuced ridge

regression

While ridge regression may shrink the OLS estirnator bol, towards O and yields a biaseci

where X = A ( t ) a function of t , the variance is smailer than that of Pol. for A > 0-

var(&) = (XTX + A I ) - ' X ~ X ( X ~ X + hl)-'a2

< ~ a r ( P , d

Therefore? better estimation can be achieved on the average in terms of MSE with a little

sacrifice of bias. This is welI known as the bias-variance trade-off.

To illustrate the shrinkage effect of ridge regression. consider the linear regression

problem of two regressors in the above example. The variance of the ridge estimator is

the bias is

The variance: bias2 and MSE of the ridge estimator *

biai' and MSE are computed with true B = ( 1, L ) ~ .

r

II

If 11 and r2 are uncorrelated. i.e. r = 0: then I/nr(,djrd,) = l / ( 1 + A)" 00.5 for

A

X = 1: smaller than G'U~(B,,~,) = 1 for X = O. If rl and x2 are correlated. for example. A

r = 0.9. then V U T ( , ~ ~ ~ & ) = 0.15 for X = 1. much smaller than V n ~ ( / 3 ; , ~ , ) 1.- 3.26 for A = 0.

However. the squored bias increases with A as shown in the above table. The squared 1

bias is computed with bias( ,dj) = -Xb/ ( l + X + r ) for a special case of & = ,& = d = 1.

I t can be observed from the above table that the variance of the ridge rstimator

decreases with A, while the squared bias increases. The MSE presents a trade-off ùetween

the bias and the variance: it decreases to a smaU value from X = O to X = L, and increases

from X = 1 to X = 5 or 10.

var bias2 h[SE var bias2 MSE var bias2 MSE var &as2 SISE

Frank and Friedman ( 1993) introduced Bridge regession

It includes ridge regression with -1 = 2 and subset selection with 7 = O as special cases.

For other values of 7 > O, it constrains the estimates to different regions around the origin

in the parameter space as shown in Figure 1.1 for t = 1. While Frank and Friedman dicl

not solve for the estimator of Bridge regression for any @en y > O, they pointed out that

to optimize the y value was desirable.

Tibshirani ( 1996) introduced the Least Xbsolute S hrinkage and Selection Operator

(Lasso)

as a special case of the Bridge with 7 = 1. The special shape of the constrained region

of the Lasso, as pointed out in Tibshirani (1996): allows the Lasso estimator to attain at

a corner of the region by 1 Igj 1 5 t and thus rnakeç j j = O for some j . Therefore. the A

Lasso may shrink the OLS estimator Pois towards O and potentially rnay set some d j = O.

It can be seen clearly from the following formula of the Lasso estimator for orthonormal

where C( t ) is a positive constant depending on 1 but independent of j . Clearly for

orthonormal X: the Lasso shrinks large coordinates of the OLS estimator by a constant

and small to O. Thus it performs as a variable selection operator.

Tibshirani (1996) used a combined quadratic programming method to solve for the

gamma > 2 gamma = 2

gamma = 1 gamma < 1

Figure 1.1: Constrained areas of Bridge regressions in two dimensional parameter space

Lasso estirnator by observing that the Lasso constraint C IPj 1 5 t is equivalent to com-

bining L P linear constraints C wjPj 5 t with wj = f 1. The parameter 1 was optimized

via generalized cross-validation ( GCV). It was also s hown t hrough some intensive simula-

tion stiidies that the Lasso not only shrinks the OLS estimator towards O and potentialiy

achieves bet ter estimation and predict ion, but also selects variables in a cont inuous w a .

Sucb a variable selection process of the Lasso is more stable t han the cliscrete process of

adding one variable to or delet ing one from the model.

Since the Bridge shrinks the OLS estimator towards O. it is referred to in general as a

shrinkage model, and the parameter t as the shrinkage parameter.

1.3 Problems

Although both ridge regression and the Lasso perform much better than OLS regession

when collinearity is present in X as shown in last section. in Frank and Friedman ( 1993)

and Tibshirani ( 1996), the fact that the Lasso outperforms the ridge in soine cases ancl

the ridge outperforms the Lasso in sorne others (Tibshirani 1996) raises the questions:

What is the optimal value of y that performs the best? How to select the optimal value

of y? How to solve for the Bridge estimator for any fixed > O in general?

To answer these questions, we need to cievelop some techniques that will allow the

optimal value of 7 to be selected based on the data itseif rather than based on some

subjective decision, for example, selecting -y = 1 (the Lasso) or y = 2 (the ridge).

In t his thesis, 1 attempt to answer t hese questions by considering Btidge regression as

a whole family, which iocludes the ridge and the Lasso as special members. Specificaily.

I study

min (y - X P ) ~ ( ~ - XP), subject to C I/3jlT 5 tt with 7 > 1. P

In Chapter 2' I study the structure of the Bridge estimators and develop algorithms to

solve for the Bridge estimator for any fixed 7 2 1. Particularly. 1 develop a new algo-

rithm for the Lasso to make the computation much simpler and easier. The variance of the

Bridge estimator is derived. The shrinkage efFect of Bridge regression is illustrated through

a simple example of linear regression, and is examined theoretically for the ort honormal

regression matrix case. The Bridge penalty function is also studied as a Bayesian prior.

In Chapter 3 : I review generalized linear rnodels. likelihood function and quasi-likeliiiood.

I extend Bridge regression to generalized linear models. I furt lier generalize penalizat ion

to be independent of joint likelihood functions by introducing the penalized score equa-

t ions. Algorit hms solving the penalized score equat ions are also developed. In Chap ter

4, 1 review generalized estimat ing equat ions (GEE) in longitudinal st udies, and apply

penalization to the GEE via the penalized score equations. In Chapter 5 . 1 review the

cross-validation and generalized cross-validation (GCV) methods. The shrinkage parame-

ter y and the tuaing parameter X are selected via the GCV for generalized linear models.

A quasi-GCV is derived to select y and X for the penalized GEE. Ln Chapter 6, I compare

the Bridge mode1 with some other shrinkage models: no shrinkage, the Lasso and the

ndge through simulation studies. In Chapter 7, I analyze several data sets from public

health studies using the Bridge penalty model. Chapter Y gives general discussions and

plans of some future studies. Appendix A gives a FORTRAN subroutine to compute the

Lasso estimator via the Shooting method and Appendix B gives the mathematical proofs.

Chapter 2

Bridge Regressions

2.1 Introduction

In Chapter 1, I briefly introduced regressions and shrinkage models. particularlp Bridge

regessions. Although Bridge regression was proposed. its estimators have not been sti~d-

ied yet. As Frank and Friedman (1993) pointed out, it is desirable to study Iiow to select

the optimal value of ni to achieve the best results.

In this chapter, I stildy Bridge regression and its estimators. 1 propose an algorit hm.

the modified Newton-Raphson method, to solve for the Bridge estimator for any fixed

y > 1. I also propose a new algorithrn, the Shooting method, to solve for the Lasso

estimator. The variance of the Bridge estimator is obtained via the delta tnethod. Tlie

shrinkage effect is demonstrated through a simple example and is examineci theoretically

for the orthonormal regression rnatrix case. The Bridge penalty function is also studied

as a Bayesian prior.

2.2 Structure of the Bridge Estimators

To solve Bridge regression for any given y 2 1: we consider the following two problems.

Problems ( P l ) and (Pz) are equivalent, ive. for given X 2 O, there exists a t 2 0.

such thot the two problems share the same solution, and vice versa. We refer (P2) as a

penalized regression with penalty C 113iI7, and X the tuning parameter.

Consider problem (P2). Let G(B, X, y, A,?) = RSS + AC IPjI7- G' -+ +m the

Euclidean norm J(PII -t +W. Thus. the function G' can be minimized. Le.. t here exists a

6 such that

a = arg min G(P, X, y, XJ). P

Since t h e function 1/3,1 is not differentiable at gj = O. one can only take partial deriva-

tives of G' with respect to Pj at Pj # O, j = 1,. - , p. Deaote

a n d

Setting = O leads to

Problem (P2) can then be solved tlirough ( P 3 ) as showo in next section.

To illustrate how to solve problem (P3), we consider a simple example of linear re-

The residual surn of squares R S S = Ci(yi - /91+il - , & x ~ ~ ) ~ . Taking partial derivative of

function G with respect to pj leads to the equations as in (PJ),

If regressors XI and 12 are uncorrelated, i.e. xizilxiz = 0- each individual equation can

be solved indepenciently. if xl and 2 2 are correlated, i.e. Ci xilxin # O. the equations can

be solved iteratively as shown in next section.

To develop an algorit hm for solving (P3) in general. one needs to know t lie structure of

the solutions. W e study the structure t hrough (P9) in this section and provide algorit hms

in next section. We have the following theorems on (P3) for more general function S j -

Let ,û be a vector in the pdimensional parameter space B, X an n x p matris. and y

a vector in an n-dimensional sample space Rn. For fixed X, y? X 2 0. -/ 2 1. define the

following real functions

S j ( - , X ? y ) : B 4 R,p H Sj (P7 X<Y), j = 1:. . . o p *

F ( - . -Y, y): B -t R? ,f3 H F(& X. y) uon-negative, and

d(Pj7 X 1 7) = X7(/3jl'-1.~ign(/9j).

Denote S = (Si,. . . . s,)*. and defuie problem

Given 7 2 1 and X I O . min (~(~..~.~)+.\Cl[3,jl*'). B

We have the following results for problem (P3).

Theorem 1. Given y > 1: X > O. if the function S defined above is continuously

differentiable wit h respect to ,û and the Jaco bian matrix ( 3S ) is posit ive-semi-definite'

t hen

1. (P3) has a unique solution &A, A,), which is cootinuous in ( A 1 7);

2. The Limit of the unique solution @(A, y) exists as y + 1+. Denote the limit solution

by @(A: L+), then

Theorem 2. Given -1 > 1, X > O. If there exists a non-negative convex function F ( P ) ,

i3F with ap = S7 and the Jacobian mattix is positive definite. theo Ûir

1. The unique solution of (P3) is equal to the unique solution of (PY):

2. The limit of the unique solution of (P3). ,8 (X7 1+), is equai to the unique solution of

(PT) with 1/ = 1:

3 Particularly. if there exists a joint likelihood function L(B) . ancl F = -2Zog( L ( P ) ) ? the

unique solution of (P3) is equal to the Bridge estimator of (PI'), the limit of the unique

solution of (P.3) is equal to the Lasso estimator of ( P Z 1 ) . For Gaussian clistribution. the

solution of (P3) is equal to the Bridge estirnator of (P2), the limit of the solution of ( P 3 )

is equal to the Lasso estimator of (P2).

While the existence and uniqueness of the estimator of (Pz') is guaranteetl by the

convexity of functioii F ( P ) . which can be inferred frorn the .lacobian condition on S.

Theorems 1 and 2 provide theoretical support to (P3), which yieids a rather general

approach to this penalization problem as we shâll see in Chapter 4. Particiilarly for

Gaussian distribution, F ( P ) = RSS(P) , problem (PT) simplifies to (Pz) , its imicpe

solution can be solved through (P3) as shown in next section.

2.3 Algorithms for the Bridge and Lasso Estimators

To solve Bridge regression for any given y 2 1 and X > O: we start with problem (P3).

Although we only demonstrate our method below for Gaussian response variables, our

algorithm applies to many other types of responses via the iteratively reweighted least-

squares (IRLS) procedure.

Denote p by (Pj, p-jT)*, where B-' is a p - 1 vector consisting of the ,/A's other tlian

gj -

We study the j-th equation of (P3):

The left band side function of (2.1): LHS = k ; . c j p j + +- +J . 2xTxi& I - 2xTY. isl for fixed

P-', a linear function of Dj with positive slope ~ Z ? . C ~ . The right hand side function of

(2.1). RHS = -X71Pj17-'.sign(,8j), is nonlinear in 13j- The function RHS is of different

shape for different value of 7 as shown in Figure 2.1. I t is continuous. differentiable and

monotonically decreasing for y > 1 except non-differentiable at $8, = O for i < 7 < 2. a

heavy-side function with a jump of height 2X at 9j = O for 7 = 1. Therefore. equation

(3.1 j has a unique solution for 7 > 1, a unique solution or no solution for 7 = 1.

To compute the Bridge estimator for -/ > 1. the Xewton-Raphson method can be used.

However. since function d is not differentiable at ej = O for 7 < 2. modification is needed

to achieve the convergence t o t he solution. We develop t h e following niodified Newton-

Raphson method for y > 1 in general by solving iteratively for the unique solution of t h e

j-t h equation of (P3).

Modified Newton-Raphson (M-N-R) Algorithm for the Bridge y > 1.

(1). Start with ,b, = fi,, = ( ,Ji , . . .,,8JT.

(2). At step rn, for each j = 1,. . . ? p t let So = s ~ ( o , B - ~ , X,y). Set j, = O if .Yo = O.

Otherwise, if y 2 2, apply Newton-Raphson method to solve for the unique solution jj of

equation (2.1); if 7 < 2, modify function -d by changing one part to its tangent line at a

certain point between the origin and the solution (intersection of -d and Sj) as shown in

gamma > 2

oem

gamma = 1

ben

Figure 2.1: The functions in

gamma = 2

OOOI

gamma = 1

1 c gamma c 2 ,

bota

gamma = 1

equation (2.1). Solid is function -d. dashed is Si. The vertical a i s in each bottom panel has a scale of A.

Figure 2.2 (upper left figure). Such a point can easily be found using bisectioo method.

Then the Newton-Raphson method is applied to equation (2.1) with the modifiecl function

-d to solve for the unique solution f i j . Form a new estimator Dm = (Bi - . . . &,)T after

updating al1 b,. A

( 3 ) . Repeat (2) until p, converges.

Remarks

1. To initialize Bo, the OLS estimator Bois is always available. Even wheo p > n. X is A

less than full rank, any general estimate can be used for the initialization of Po.

2. From the rnodified Newton-Raphson algorithm, one can see that if the Bridge estimator - j

sntisfies bj = O for some j, then Pbrg must satisfy ~ ~ ( 0 . ~ 2 , X, y) = O. This irnp[ies that

t h e (p - 1)-dimensional vector ,B;: lies in a (p - 2)-dimensional manifold, which has zero 4

measure. Therefore. one can conclitcle that /!, is almost surely non-zero.

To compute the Lasso estimator for any given X > 0, one can use Theoreni 1: whicli

irnplies that the limit of the Bridge estimator, lirr+i+ f l (& y). is equal to the Lasso

estimator. However, taking the limit nurnerically is not recomrnended in practice for the

followiug reasons. From the computat ional point of view, it is obviously time-consuming

since the M-N-R algorithrn has to be run man i tirnes, one for each single 7; > 1 in a

series of { r i } with yi + 1+. From the theoretical point of view. it is misleadiug. Assume

a sequence of tends to 1+, and the correspondhg estimate sequence {pi} of one

coordinate is (0.1, . . . ,10-~, . . .). Numerically one can not determine whet her the limit

of bi is equal to O. However, taking the limit theoretically leads to a new algorithrn for

the Lasso, which is simple, straightforward and fast as shown below.

M-N-R S hooting

-1 O -5 O 5 10 beta

Shooting

-1 O -5 O 5 1 O

beta

-1 O -5 O 5 10 beta

Shooting

-1 O -5 O 5 10 beta

Figure 2.2: The algorithms. Solid is function -d, dashed is S j - The vertical axis in each bottom panel and the upper right panel has a scale of A. Upper left: the dotted Lne represents the modification of -d to its tangent; Upper right: So > A. the dotted line indicates the solution; Lower left: [Sol 5 A; Lower right: So < -A? the dotted line indicates the solution.

We introduce a new algorithm for the Lasso - the Shooting method.

( 1 ). p = 1. (P3) reduces to a single equation

Start with an initial estirnate ,&, the OLS estirnator. Shoot in the direction of a slope

2xTx from the point (boy 0) on the horizontal axis as shown in Figure 2.2. If a point on

the ceiling ( -d = A ) i s hit as showo in the upper right figure or if a point on t h e floor

( -d = -A) is hit as shown in the lower right figure, then equation (2.2) has a iiniclue

solution, which has a simple closed form and is equal to the Lasso estimator. If no point

is hit, i.e. shootiog through the window as shown in the lower left figure. equation (2.2)

has no solution. One can take the limit of the Bridge estimator theoretically. It is eaçy

to prove that l i q d 1 + .(A, y) = O. Therefore, set @ = O for the Lasso estimator.

(2). p > 1. Start with an initial value Bo, the OLS estirnator. At step m. cornpute fi,,, by updating âj for fixed b-j using (1): j = L?.. . : p . Iterate until fi, converges. 54it3

summarize the method as follows.

Shooting Algorithm for the Lasso

(1). Start with Bo = fi,, = (A , . . . ,&)T- - -1

(2). At step rn, for each j = L;.. . - p , let So = Sj(O,P J. y) and set

where x j is the j- th column vector of X. Set B, = (,Bi;. . . , &)T after updatiog all j j -

(3). Repeat (2) until fi, converges.

The convergence of the M-N- R algorit hm and the Shoot ing algorit hm is paranteen

by the following t heorem.

Theorem 3. (Convergence of the Algorit hms)

Given fixed A > O and -, 2 L. Pm in t h e rnodified Yewton-Raphson algorithm (Ad-N-R)

converges to the Bridge estirnator of (P?). a, in the Shooting algorithni converges to the

Lasso estimator of (P2).

Our experience tells us that both the N-N-R and the Shooting algorithms converge

very fast, as it can be perceived through the mechanism of the convergence in the math-

ematical proof.

2.4 Variance of the Bridge Estimator

Since the Bridge estimator (-y > 1) is t h e unique solution of problem ( P 3 ) and is alniost

surely non-zero! its variance

can be derived as follows from (P3) using the delta methodl

where y0 is any arbit rary fixed point in the sample space. The variance est imate can be

obtained by plugging in B for Biy, and replocing Var(y) with its estimate.

Denote F = ( F I , . . . , Fp)T, where Fj = sj(b, X7 y) + d(bjôj, A, 7). Heuce F, = O by

(PS). For Gaussian distribution, @ = -2xT and aF = 2 X T x + 20(p), where ay a B

-1

By hnpiicit Functioo Theorem 4 = - (3) g. Therefore. ' aY

Two special cases are worthwhile to mention.

I. The OLS regression, i.e. X = O. The function ~ ( b ) becomes a zero matrix; thiis

wbicb is equal to ~nr(f i , , , ) , the variance of the OLS estimator.

2. The ridge regression, i.e. -y = 2. The function D(P) = XI, where I is the identity

matrix.

var(& = ( X ~ X + AI)-' x T v a r ( y ) x ( X ~ X + XI)- '

which is equal to the variance of the ridge estirnator. *

Since the Lasso may set some B, = 0, the delta method does not appiy. However,

the bootstrap or the jackknife method (Shao and Tu 1995) can be used to compute the

variance. A good variance estirnator of the non-zero bj of the Lasso estimator can be

found in Tibshirani ( 1996).

2.5 Illustration of the Shrinkage Effect

Sections 2.2 and 2.3 give the estimator and the algorithms of Bridge regression. Section

2.4 gives the variance of the Bridge estimator. In this section, we demonstrate how to

solve for the Bridge (Lasso) estimator, and illustrate the shrinkage effect of the Bridge

regression through simple examples.

An example with orthonormal matrix X .

We consider a simple linear regression

with 40 observations, where the random error term E bas a normal distribution N(0. a').

To make the regression matrix X orthonormal. we standardize the colurnn vector X, of

X by setting Ci xij = 0: j = 1.. . . . p 7 and

For simplicity, we set ,do = O and O' = 1. Forty observations of response Y are generated

Erom the mode1 with triie values Bi = 1: & = -2 and 23 = 5 . Since shrinkage has no

impact on the intercept. the intercept is removed by centering xi yi = O. For fised pair

of X > O and y > t, each individual equation of (P3)

reduces to

for j = 1,. . . . p. Then the solution is computed via the modified Newton-Raphson met hod

for y > 1 or the Shooting method for y = 1. The standard errors are coniputed following

the variance formula (2.3) in section 2.4 for > 1. The bootstrappiug nietliod (Efron

and Tibshirani 1993) is used to compute the standard errors for y = 1.

The estimates and standard errors for different shrinkage functions are as shown in

Table 2.1. The standard errors of the Lasso estimator y = 1 are computed via 10000

bootstrap samples. It is shown that the parameter estimate and its standard error are

shrunk monotonically with increasing X for fixed -, in general. However for the Lasso

(y = 1). the standard error of ,& does not show a rnonotooic decreasing trend witli A. It

equals to 0.163 at X = 0, 0.157 at X = I O 1 but 0.3.54 at X = 100. This is because the Lasso

standard errors for X > O are computed with a semi-parametric bootstrap metliod.

An example with non-orthogonal matrix X.

W-e consider a similar linear regression

with 40 observations. where the random error term E has a normal distribution N(0 . u').

The regession matrix .Y is not orthonormal and has the correlation coefficient matrix

We standardize the column vector xj of X by setting Ci t i j = O and Ci r:, = 1. For

simplicity, we set ,Bo = O and a' = 1. Forty observations of response Y are generated from

the mode1 with true values a = 2, a = 3 and a = -1. Since shrinkage has no impact

Table 2.1: The Bridge estimators and standard errors for orthonormal X

Table 2.2: The Bridge estimators and standard errors for non-orthogonal X

@I = 2 1 i.94ï(O.Zi) 1 ( O . ) 1 0.39L(0.07?) / O.i65(0.040) 1 0.718(0.022)

on the intercept, the intercept is removed by centering Ci y; = O. For each pair of X > O

and -y 2 1; each individual equation of (P3) is

a

for j = 1.. . . . p. Then the solution is cornputed iteratively by the modified Yewton-

Raphsoo method for 7 > 1 or the Shooting method for 7 = 1. The standard errors are

computed following the variance formula (2.3) in section 2.4 for y > 1. The bootstrappiug

method (Efron and Tibshirani 1993) is used to compute the standard errors for 1 = 1.

The estirnates and standard errors for different shrinkage functions are as shown iu

Table 2.2. The standard errors of the Lasso estiniator y = 1 are computed via 10000

bootstrap samples. It can be observed that the monotonicity of t h e parameter estitiiatr

and its standard error does not hold for this case in general. which can be: shown by the

estimate j3 and its standard error.

2.6 Bridge Regression for Orthonormal Matrix

In last section, an example of Bridge regression for orthonormal regression rnatrix .Y

was given to illustrate the shrinkage effect. In this section. we study Bridge regressiou

for ort honormal regression matrix theoreticaily and show different slriukage effect for

different value of y.

For orthonormal matrix X = (xij),

It can be seen that problem (P3) simplifies to p independent eqiiations

for j = 1.. . . . p . The solution is then computed via the modified Newton-Raphson rnetlioci

for 7 > 1 or via the Shooting method for 7 = 1. To study the shrinkage effect of different

value of 7, we compare the Bridge estimator - the solution of each single equation of (2.4).

with the OLS estimator. Without making any confusion, we omit the subscnpt j of !jj

and xij for simplicity.

Notice that equation (2.4) can be written as

The first term on the right hand side is equal to the OLS estimator, the second term is

due to the shrinkage and thus reflects the shrinkage effect. Therefore

To show the shrinkage effect of Bridge regression, we plot the absolute value of the

Bridge estimator a,,, and compare it with the OLS estirnator? whose absolute value is

plotted on the diagonal as shown in Figure 2.3. It shows clearly that the Lasso (7 = 1)

shrinks small OLS estimates to zero? and large ones by a constant; ndge regression (y = 2)

shrinks the OLS estimates proportionally; Bridge regression (1 < y < 2) shrinks small

OLS estimates by a large rate and large by a small rate; Bridge regression (y > 2) shrinks

srnail OLS estimates by a small rate and large by a large rate. Ln surnmary, Bridge

regression of large value of -y (y 2 2) tends to retain small parameters while srnall value

of y (7 < 2) tends to skrink small parameters to zero.

gamma = 1

gamma = 2

1 < gamma < 2

beta-01s

gamma > 2

Figure 2.3: Shrinkage effect of Bridge regressions for fixed A > 0.

Solid - the Bridge estimator; Dashed - the OLS estimator.

Therefore, it can be implied that if the true model includes many small but non-zero

regression parameters, the Lasso will perform poorly while the Bridge with large -7 value

will perform well. If the true model includes many zero parameters. the Lasso will perform

well while the Bridge with large y value will perform poorly. Tibshiraoi (1996) obtained

similar results by comparing the Lasso with the ridge through intensive simulation studies.

2.7 Bridge Penalty as Bayesian Prior

In this section, we study the Bridge penalty function C lPjlY as a Bayesian prior distri-

bution of the parameter ,û = (8,. . . .A)=.

From the Bayesian point of view, Bridge regression

min [RSS + x C 1,dj3j17] P

can be regarded as maximizing the log posterior distribution of

where C is a constant. Thus the Bridge penalty X C IDj 1' cân be regarded as the logarit hm

of the prior distribution

of the parameter p = (A,. . . , where Ca > O is a normalization constant. Sioce the

prior is a summat ion, the parameters &, . . . , /Il, are mutually independent and identically

distributed. We thus omit the subscript j and study the p i o r C Exp(-Xlfll-//2) of /3 only.

By simple algebra

where r(-) is the gamma function. Thus the probability density function of (3 is

where A-'/' controls the window size of the density. Particularly. when y = 2. -3 has a

Gaussian distribution. Therefore. the posterior distribution of (PIY) is also Gaussian if

Y has a Gaussian distribution. This is a very special property of the ridge estimator for

linear regressions.

So compare the penalty functions of different values of A and 7, we plot the density

function ~ ~ , ~ ( , 8 ) as shown in Figures 2.4 ( A = 1) , 2.5 ( A = 0..5) and 2.6 ( A = 10). It can

be observed that for fixed small values of X as in Figures 2.4 and 2.5. small values of

put much mass on the tails and thus the density has a Iarge window size and tends to

be 0at' while large values of y put much mass in the center around ,3 = O and t h u s the

density has a srnall window size and is l e s spread-out. However. for fixed large values of

A as shown in Figure 2.6, the window size does not change much since is less than 1

and converges to 1 very fast as 7 increases. While srnall values of +y put much mass very

close to /3 = O with a peak at = 0: large values of tend to distribute the m a s evenly

in the window of the density. When -y = 2: the density is a Gaussian density

It con thus be implied that for fixed srnafl value of A, the Bridge penalty of small y

value favors models with large values of regression parameters, while the Bridge penalty

of large y value favors models with small but non-zero values of regression parameters.

For fixed large value of A, the Bridge penalty of small y value favors models with niany

O of regression parameters, while the Bridge penalty of large 7 value favors models with

smaH but non-zero values of regression parameters. Especiauy, the Lasso (y = 1) with

gamma = 1 gamma = 1.5

gamma = 2 gamma = 4

Figure 2.4: Bridge penalty as a Bayesian prior with X = I


gamma = 2

,4 -2 O 2 4

gamma = 4

Figure 2.5: Bridge penalty as a Bayesian pnor with X = 0.5


gamma = 2 gamma = 4

Figure 2.6: Bridge penalty as a Bayesian prior with X = 10

large A favors models with many O parameters, and the Lasso with small X favors models

with large parameters. This result agrees with the conclusion for ortlionormal regression

matrix in last section.

2.8 Relation between Tuning Parameters X and t

In Section 2.1, we claimed that problems ( P l ) and (P3) are equivalent, i-e. for given

X 2 0. there exists a t > O such that (P 1) and (P2) share the same solution. and vice

versa. In this section, we study this relationship between A and t for the special case of

orthonormal rnatrix X.

Notice that for fixed y 2 1: the constrained area of (P 1) is convex as shown in Figure

1.1. Hence, the Bridge estimator is achieved on the boundary of the constraint. wliicli

irnpiîes that t ( h ) = C ljj(~.--,)17 for fixed X 2 0.

With orthonormal matnx X? ( P 3 ) simplifies to p independent equations

t

as shown in Section 2.6. Since Ci z i jy i = Jars j . the j-th coordinate of the O LS estimator.

the Bridge estimator ,8 = (a7 . - - . &)T satisfies

L. A

Denotiog cj = &,;, and sj = /3 j /c j , the ratio of the Bridge estirnate to the OLS estimate.

one has

Hence

where s, can be determined by solving the following equations

derived from (2.5). Therefore. t ( X ) can be computed by substituting the s j into the

formula above. For a special case where ci = c: a constant independent of j . then sj = s

is also independent of j,

2~ 2 t ( X ) = -c s ( 1 - s ) . An/

Figure 2.7 shows the function t ( A ) computed for a special case where cj = L with p = 2

for different 7 values, y = 1. 1.5. 2. 10. It demonstrates the one-to-one correspondence

between t and A. For this case, the threshold of t is to = 1&, = p = 2 such that

any t 2 to yields b(t) = Bal,. The threshold value of X for the Lasso ,3j = O is ,\O = 2

such that aoy X 2 Xo yields j j ( ~ ) = O. It cao be seeo clearly hom Fi,we 2.7 that t (X)

is a monotonically decreasing function for fixed 2 1. For > L. X has t o be infinity in

order to shrink al1 bj = O. However, for 7 = l t any X such that A >_ Xo = 2 shrinks al1

pj = O. Consequentiy, t ! X ) = O.

Plot of function t(lam bda)

lambda

Figure 2.7: Relation between shnnkage parameters X and t for orthonormal matrix .Y.

Chapter 3

Penalized Score Equations

3.1 Introduction

In Chapter 2, 1 obtained some theoretical results of the Bridge estimators through The-

orems 1 and 2, and developed a general approach to solve for the Bridge estimators v i a

(P3)' i-e. the modified Newton-Raphson method for -f > 1 and the Shooting method for

y = 1. In tbis chapter. I proceed further in theory. introduce penalized score equations.

and thus generalize the concept of penalization. The algorithms for the penalized score

equations are &en by the modified Newton-Raphson method and the Shooting metliocl

via iterat ively reweighted least-squares procedure ( IRLS ) . First . 1 review generalized linear

rnodels, li keli hood funct ions and quasi-li keli hood.

3 .2 Generalized Linear Models and Likelihood

Ln many applied sciences, the response of interest is not a contiouous variable rangiug

from negative to positive. like the temperature in Celsius. The response can be proportion

(fraction between O and 1)- number of subjects (positive integer), presence or absence of

event (dichotomous) , degree of pain: none, rnild. moderate. severe ( polytomoos) etc.

Since the response is not continuous or the range of the response is not (-m.+m). a

linear model, like

Y = ,%, + p*x l +..- +,r, + E ,

may not be appropriate.

Nelder and Wedderburn (1959) introduced generalized linear models, which is a natural

extension of the Lnear regression model to a more general class of response variables, that

have a distribution of the exponential family

A generalized linear mode1 (GLM) has three components:

1. The random cornponent: the components of Y = (6, . . . . Y,)~ are mutually inde~en-

dent and have an identical distribution of the exponential family with meau E(Y j = p

and variance V ( p ) .

2. The systematic component: covariates xi, xz, . . . , 2, produce a linear predictor

3. The link between the random and systematic components:

where g(-) is a monotone differentiable function. and is called the link function. Hence a

GLM mode1 can be written as

The most popular types of responses and their canonical link functioos are Gaussian

response with identity link g ( p ) = p, binomial response with logit link g ( p ) = log(&),

and Poisson counts with log link g ( p ) = l o g ( p ) , etc. inference on the parameter P =

(pl, . . . , B , ) ~ is based on the Likelihood function

and the inaximum likelihood estimator (MLE) jml,? which is defined as

A

Pmi, = arg max L ( P ) . P

The NILE estimator f iml , can be computed via the following Newton-Raphson rnethod.

Fisher scoring met hod or the i teratively reweighted least-squares met hod.

By large sample theory. the MLE f imle is asymptotically consistent under regularity

conditions (McCullagh and NeIder 1989),

where I(P) is the Fisher information matrix defined as

and 1(P) = l o g L ( P ) , the log-li kelihood function.

To solve for the MLE fiml,? we take the partial derivative of the log-likeliùood function A

l ( P ) with respect to 13, Pmle rnust satisfy the following equations

8 l / d P j is called the score function of likelihood [ (P) .

Newton-Raphson method

Take Taylor expansion of the score functions Bl(P)/c)P and ignore the quaciratic term.

one has

and

Then jm1, con be computed by the iterative formula

The iteratioo continues until the convergence of the estirnate Dm or the cleviaoce

where p,,, is t h e mean of the response of the saturated mode1 and is usually equ

Fisher scoring method

Replacing the observed information matrix

in (3 .2 ) of the Newton-Raphson method with t h e expected information matrix

where 0 is assumed to be the tme value of the parameter, one obtains the following Fisher

scoring method to solve for the MLE bml,

where Ee ( .) depeods on P o d y through 8. This simplifies the computation. The aPaP

observed and erpected Fisher information matrices are identical if Y follows a distribution

of the exponential family with a canonical link function. Therefore, Fisher scoring met hod

coincides with the Newton-Raphson method (McCullagh and Nelder L989. Hastie and

Tibshirani 1990).

Iteratively reweighted least-squares (IRLS) method

Green (1984) introduced the following IRLS method to compute the MLE by taking the

linear expansion of the link function (McCullagh and Xelder 1989)

An acljusted depeodent variable z = r ) + ( y - p)/V(p) is then defined for canonical links.

where is the linear predictor, and V ( P ) , a function of the mean p. is the variance of Y.

The WLE estimator can be cornputed by regressing z on matrix X with weights V ( p ) .

The IRLS procedure can be outlined as follows.

The IRLS procedure

1. Start with initial estimate Bo; 2. Cornpute q = XB and weights V ( p ) = diag(K(pI):. . -. V,(p,)):

3. Define adjusted dependent variable r = q + [If(&]-'(y - p);

4. Regress r on X with weights V ( p ) to obtain a new estirnate 6; 5 . Iterate steps 2 4 until convergence is achieved.

An advantage of the IRLS procedure over the Newton-Raphson method or the Fisher

scoring met hod is t hat it can be implemented t hrough a weighted least-squares procedure

with no extra effort, since the weighted least-squares procedure is a standard procedure

and is easy to implement in most statistical softwares.

3.3 Quasi-Likelihood and Quasi-Score F'unctions

In last section, we b r i d y reviewed generalized Linear models and the distributions of the

exponential family. Very often, when a probability functioo is specified. the li keli hood

function can be constructed and the MLE can be computed easily. However. in certain

cases, it is not necessary to specify the entire probability distribution and t h u s the joint

likelihood function, or it is not possible to specify the joint likelihood function.

Wedderburn (1974) introduced quasi-likelihood, which extends t h e generalized linear

models in probability distribut ion. A quasi-li kelihood requires t hat the variance of t he

random variable is a known function of the mean, V ( p ) ~ Z y without specifying that the

distribution is from the expooential family. First. a quasi-score of 1 dimension is defined

as

U(p, y ) satisfies the three fundamental properties of the ordinary score functions of a

likelihood function

Jwr(~, Y )) = 0:

and

Hence the integral

if it exists, has similar properties to those of a log Likelihood function.

We st udy the quasi-likeli hood for the followiog two cases.

1. Independent observations

Since the observations are independent. the variance-covariance matrix is diagonal

where functions 6,. . . , V, are identical. The quasi-score in (13.5) is well defined. so is t h e

quasi-likelihood function in (3.6). The quasi-likelihood function Q(p, y) plays the same

role as the ordinary log-likelihood function in the generalized linear models. Inference can

be made based on t h e quasi-likelihood estimator satisfying the quasi-score equations

Similar to the MLE of the generalized Iinear models, the estimator of the quasi-

likelihood can be computed through Fisher scoring method

This estimator is also asymptotically consistentl i.e.

under regularity conditions.

2. Dependent observations

Since the observations are dependent, the variance-covariance matrix V ( p ) is no Longer

diagonal. In general, the quasi-score U = (Ul , . . . , has

which irnplies that the vector field defined by the quasi-score U ( p . y) is path dependent.

Thus there does not exist a scalar function Q(p, y) such that its partial derivatives would

be the quasi-scores as if it existed. Therefore, the integral Q(p. y) in (3.6) is path de-

pendent and is not weI1 defined. In such a case. inference can not b e made based on the

function Q(pt y). One would rather use the quasi-score function U ( p . y): which satisfies

the t hree fundamental properties of the log-likelihood functions as pointed out previ-

ously. The asymptotic consistency also holds under some rather complicated conditions

( McCullagh 199 1 ).

Since the expected values of the partial derivatives of the quasi-score function U ( p . y)

is symmetric but not the partial derivatives, McCullagh ( 199 1) pointed out the possibil-

ity of a decomposition of U ( p . y) into two terms, one main term with symmetric partial

derivatives and one srnail "noise" term with asymmetric partial derivative. Such a de-

composition allows the study of the quasi-scores U ( p 7 y) via the quasi-likelihood of the

first term wit hout Losing much information. Li and McCullagh (1994) studied potential

functions and conservative estimating functions. They projected the estimating fiinctions

onto a subspace of conservative estimating functions. in which the estimating functions

have symmetric partial derivat ives and t hus have a quasi-likeli hood function. The quasi-

likelihood is named as the potential function of the estimating function.

The estimating functions are a broad class of functions, whose equations yield the

parameter estimators. The quasi-score funct ions are a special class of the est imating

functions. They are linear in y and yield an asymptoticaliy consistent estimator. The

potent ial funct ions have similar properties asymptoticaily as the ordinary log-iikeliliood

functions, as pointed out by Li and McCullagh, thus may help to determine the desired

solution froni the possible multiple solutions of the quasi-score equat ions.

3.4 Penalized Score Equations

in previous sections, 1 reviewed generalized linear models. li keli hood fiinct ion. score fiinc-

tions and quasi-li kelihood. As a generalization of likeli hood function. quasi-likeli hood

focuses on the first two moments and the relation between them without speciSing the

ent ire li kelihood. Similady, one can generalize penalization via penalized score equat ions

without specilying the entire likelihood. In order to introduce penalized score equations,

we consider the results of Theorems 1 and 2 in Chapter 2. First? 1 give some remarks.

Rernarks

1. Problem (P3) and its solution are independent of joint Likelihood functions. Xotice

that no assurnption is made in Theorem 1 on joint likelihood functions.

2. Theorems 1 and 2 apply to al1 distributions that have a concave joint likelihood

function, particularly the most popular Gaussian, Poisson and binomial distributions in

the exponential family. The Jacobian condition is satisfied by the minus gradient of the

concave joint likelihood funct ions.

3 . Theorem 2 implies that if the Bridge (Lasso) estimator for y 2 1 is defined to be the

unique solution of (P3), it has no con£lict to the Bridge estirnator of (Pz'). Thenfore, one

can regard the unique solution of (P3) as the Bridge estimator.

It can be implied from the above remarks that if a joint likelihood function exists,

(P3) can be solved to obtain the Bridge estimator of (PT): if no joint likelihood fitnctioti

exists, (P3) can still be solved to obtain t h e unique solution as long as the .lacobian

condition is satisfied. However. problem (Pz') does not apply in such a case. Hence. one

can always start with problem (P3) to solve for the estimator regardless of the existence

of joint likelihood functions. Therefore, the concept of penalization and i ts est irnator are

generalized to be independent of joint likelihood functions. We introctuce penalized score

equat ions.

Consider the equations

Definition 1 (Penalized Score Equations)

Equation (3.8) wit h function S satisSing the Jacobian condition that ($1 is positive-

serni-definite is called the penalized score equations with the Bridge penalty lijJ 1'

Definition 2 (Bridge Estimator)

Given A > O and y > 1. The Bridge estimator is defined to be P(A. y). the unique solutioo

of Equation ( 4 . Y ) ; The Lasso estirnator is defined to be D(X? Lf). the lirnit of B(x. y ) as

y + 1+-

Through Definitions 1 and 2, the concept of penaiization and its estimator are gener-

alized. In fact, they can be Further generalized for different penalty function as follows.

Remarks.

1. The concept of penalized score equations can be extended in general for a penalty of

C g ( p j ) , where g is a srnooth convex function. One can define penalized score equations

using (P9) wit h partial derivatives of different penalty functions.

2. The Bridge (Lasso) estimator defined in Definition 2 is independent of joint likelihood

functioos. It can thus be applied to cases in which no joint likelihood function exists.

The penalized score equations approach is broad compared to the classical approach

to penalization, which minimizes the deviance, i.e. - l l o g ( L i k ) , plus a penalty function.

Such a generalization is crucial to circumvent the difficulty of the non-existence of joint

likelihood functions in regression problems where penalization is desirable due to highly

correlated regressors. One major application is to apply this met hoc1 to the GEEI in which

no joint likelihood function exists in general. By solving penalized GEE for the Bridge

(Lasso) estimator, one can achieve better predictions overall when collinearity is present

among regressors, see Chapter 4 for the algorithm and Chapter 6 for simulation results.

3.5 Algorit hms for Penalized Score Equat ions

Ln Section 3.4, penaiized score equations were introduced theoretically. To solve for the

Bridge estimators, the modified Newton-Raphson algorithm and the Shooting algorit hm

were developed in Section 2.3. For Gaussian responses, the above algorithnis can be

applied directly. For non-Gaussian responses, the methods have to be applied via the

IRLS procedure as follows.

Algorithm for the Bridge (Lasso) Estimator via the IRLÇ Procedure

1). Start with an initial value 6,. 2). Define adjusted dependent variable z based on the current estirnate 6 . z = XP + [~(WYY - P I ;

3) . Apply the M-N-R (Shooting) method to linear regression of W z on CVrC to itptlatr

b, where W = V-Il2;

4). Repeat steps 2) and :3) till convergence of ,h is achieved.

Here, 1 would like to point out that even if no joint Iikelihood function exists. one caa

still apply the modified Newton-Raphson method or the Shooting method to obtain the

Bridge (Lasso) estimator as long as the Jacobian condi tiou is satisfied. The convergence

of the above algorithm is guaranteed by the following theorem.

Theorem 4 (Convergence of the Algorithms)

Given fixed X > O. If " iis positivedefinite? then 3P (1) the modified Newton-Raphson algorithm converges to the Bridge estimator of ( P3)

for y > 1;

(2) the Shooting algorithm converges to the Lasso estimator of (P3) for y = 1.

As poioted out in Section 2.3, the modified Newton-Raphson and Shootiog algorithms

converse very fast, even combining with the IRLS procedure.

Chapter 4

Penalized GEE

4.1 Introduction

In public health studies. investigators often observe a series of observations of interest

along time. For example, in an asthmatic study. each of the subjects in the study is

monitored for a period of time, Say one year. The subject's asthmatic status is observed

at each visit, along wit h some factors, like quality of the air in the surrounding area where

the subject lives? the season, temperature, and humidity. etc. Very often. the main interest

of the investigator is to find the relation between the response variable. like the asthmatic

status, and a set of explanatory variables, like quality of the air, humidit- temperature,

etc. Such type of studies is in a special statistical setting, called longitudinal studies. and

the goal is to identify the dependence of the time trend of the response on explanatory

variables.

During the past t wo decades, longitudinal studies have at t racted attention from mauy

statisticiaos and public health researchers, and its applications can be round in many re-

search areas, for example, medical st udies, environmental st udies and psychologie st udies

(Laird and Ware 1982, Liang, Zeger and Qaqish 1993). Statistical methods in longitudinal

studies include random effect models, conditionai Markov chain models. and generalized

estimating equations method, etc. (Diggle, Liang and Zeger 1993). In this chapter, 1 focus

on the generalized estimating equations method and apply penalization via the penalized

score equations approach when collinearity is present among explanatory variables.

4.2 Generalized Estimating Equat ions

Consider a longitudinal study of K subjects. Each subject has a series of observations. the

response variable yjt and a vector of predictors l i t , where i = l ? . . . . K and t = 1.. . . . n;

for the i-th subject. When investigators are mainly interested in the effect of explanatory

variables on the response variable, Liang and Zeger (1986) and Zeger and Liang (1986)

proposed the lollowing generalized est imat ing equat ions (GEE) based on the marginal

distribution of response K t , I ( y i t ) = ezp[{yn&t - a(ei ,) + 6 (y i t ) }@j7

w here

is the working covariance matrix. and Di = d{a:(B)}/dp = AirliXi- 1; = diag(dOi,/dqi,)?

vit = xzp, Ai = diag (a$(B)) , and Si = yi - a:(B). To incorporate the correlation of

observations from the same sub ject , t h e GEE assumes certain correlat ion structure by

specifying a working correlation rnatrix R(a). [t has been shown that the estirnator of

(4.1) is consistent as K tends to infinity even if the working correlation matrix R ( a ) is

speci fied incorrect ly.

f i - ) - ( O V ) ,

where

V = (c D?K-'D~)-' (C DTK-'C~~(~~)F-'D~) (2 DTK-' ~ j ) - ' (4.2)

is called the "Sandwich estimatorn of the variance. The efficiency will be improved if the

correlation matrix is specified correctly. Detailed discussions can be found in Liang and

Zeger (1986) and Zeger and Liang (1986).

We consider the GEE in a regression setting. As in linear regressions. the potential

problem of collinearity aIso occurs, i.e. if the explanatory variables in the GEE mode1

are close to collinear, the variance of the estimator will be large and predictions based on

the estimator may perform poorly. Therefore. penalization is desirable as shown in the

previous chapters. However, the clzssical approach of penalization. for examp le. Bridge

regression, requires the existence of joint likeli hood functions, as discussed in Chap ter :3.

Since the GEE assumes a special structure of the correlation, there does not exist a joint

likelihood functioo in general such t bat its potential score functions ( the partial derivatives

with respect to P j ) would be the estimating functions of the GEE ( McCulIagh and Nelcler

1989, McCullagh 1991). Such difficulty hinders the implementation of penalization to the

GEE.

The penalized score equations approach generalizes penalizat ion and provides t h e tech-

niques to handle the collinearity problem in the GEE since the penalized score equations

do not depend on joint likelihood functions and cao easily be applied via the iteratively

reweighted least-squares procedure (IRLS). Ln the following, 1 apply the penalized score

equations to the GEE and solve the penalized GEE to achieve better estimation and

predict ion.

4.3 Penalized GEE

Since it was proposed, the GEE has been widely used in longitudinal studies. Al t hough

the GEE estimator is asymptotically consistent and efficient, one may encounter that

the explanatory variables of interest are coUinear or close to collinear. especially when a

large number of explanatory variables are involved. This raises a question of accuracy of

estimation and prediction based on the parameter estirnator of (4.1).

It has been known that penalization provides the techniques to bandle t h e collinearity

problem in linear regressions. The classical approach to penalization is to rninimize t h e

deviance of the mode1 plus a penalty function. For example, if the joint likelihood function

is L ( B ) , then the penalization problem is

min ( - ~ z o ~ L ( P ) + h C 1 ~ ~ 1 ' ) P

for Bridge penalization.

However, there does not exist a joint likelihood function L ( P ) For the GEE in general as

discussed in last section. To apply penalization to the GEE, one needs special techniques

which do not depend on joint likelihood functions. The penalized score equatioos approach

provides such a technique and thus serves the need.

In the following, I apply the Bridge penalty to the GEE. Similarly. one can consider

other types of penalty functions as discussed in Chapter 3.

Consider the following equations in the format of ( P 3 j.

where function d(pj, A, y) = Ar lPj17-'sign(@j) and the S,'s are the minus estimating

functions of the GEE, or the minus score functions of some joint Likelihood fuoction,

if it exists. Therefore, it is natural to have the .lacobian condition on function S =

(Si-. . . , SP)=: 9 3 is positive-semi-definite.

Consider the estimating functions on the left hand side of the GEE (4.1). Take partial

derivative of the minus estirnating function with respect to ,B and denote the derivative

by H.

since the partial derivat ive

By the regularity conditions (Liang and Zeger. 1986).

is bounded. Since Si, i = L,. . . , K, are mutually independent with the expected value

E ( S i ) = O and finite variance Var(Si) 5 C < m. where C is a large constant independent

of i , by the Weak Law of Large Numbers (Durrett, 1991, page 29): the first term of (4.4)

converges to O in L2 and in probability as K tends to infinity, and the second term: a

positive definite matrix, converges to a positive-serni-definite matrix. Hence. H l K. and

thus H, satisfies a weak form of the Jacobian condition of Theorem 2.1 for suflicientiy

large value of K. Therefore, the existence and uniqueness of the solution of problem

(4.3) are guaranteed. This implies that the Bridge estimator of (4.3) is well defined. One

can penalize the CEE via the penalized score equations approach. The penalized GEE

shrinks the GEE estimator towards O to achieve small variance and better prediction when

colli nearity is present among regressors.

To solve for the estimator of the penalized GEE. one follows the procedure in Liang

and Zeger ( 1986) and applies penalization to the weighted least-squares in the iteratively

reweightecl least-squares (IRLS) procedure. The algorithm is outlined as follows.

Algorithm for Penalized GEE

(1). Start with initial value fi,. (2). Estimate parameters a- 4 and the working correlation matrix R ( a ) using Pearson or

deviance residuals based on the current estimate 8. (3). Define the adjusted dependent variable r = DB + S.

(4). Update the estirnator ,d for fixed X 2 O and 2 1 by opplying penalizotion to the

regression of z on X ivith weights V using the M-N-R (Shooting) method. -

( 5 ) . Repeat steps (2) to (4) till convergence of P is achieved.

Solving the penalized GEE for the Bridge (Lasso) estimators. one achieves better

estimation and predict ions may perform bet ter when collinearity is present arnong t lie

explanatory variables, as demonstrated in Chapten 6 and 7.

Chapter 5

Selection of Shrinkage Parameters

5.1 Introduction

Ln regression problems, one frequently needs to select models according to the following

general rules: (1) to have a good fit to the data, and (2) to maintain a simple and inter-

pretable model. The former cao usually be achieved by including as many explanatory

variables as possible in the model while the latter by excluding variables that are not

statistically significant. However, if there are a large number of explanatory variables. it

is hard in general to choose a good model to satisfy both (1) and (2) simultaneously. Very

often, it is easy to have a large model with many regressors. Then over-fitting becornes a

major problem in these models.

Over-fitting occurs ivhen models include more regressors than necessary and fit the

data extrernely well at al1 given data points. The models perforrn very poorly in predictioo

because the pattern in the data is mis-identified due to interpolation at given data points

with many unnecessary regressors in the model. These models are misleading and thus

overfttting should be prevented as much as possible.

5.2 Cross-Validat ion and Generalized Cross-Validat ion

To handle the overfitting problem, cross-validation (CV) met hod was introduced (Stone;

1974). It selects model by leaving out one observation point at a time and minimizing t h e

average prediction error a t the leave-out points with the model built on remaining data

points, Le.,

min CV(X), X

where

Y-' = ZTP-~(X) , B-'(x) is the estirnate of the mode1 based on the observations excluding

(xi,yi), and h is a tuning parameter for model selection. There are many applications of

cross-validation methods in model fitting and selections. Major refereoces can be found

in Stone (1974), Hastie and Tibshirani (1990), Wahba (L990), Shao (1993) and Zhang

(1992).

Craven and Wahba ( 1979) introduced the generalized cross-validation (GCV) for linear

smoothing splines to optimize the smoothing parameter A. It takes the form of

for linear operator @ = A(X)y of model Y = g + S .

One advantage of t h e GCV is that it is not necessary to compute the estimates n

times, one for each single leave-out data point selected for cross-validation. It suffices to

compute the total deviance (RSS) of the full model. the degrees of freedom of the model

and the sample size. Therefore, it is less expensive computationally and can easily be

computed with an advanced programming laquage. such as S+.

5.3 Selection of Parameters X and y via the GCV

To select the penalty parameters A and 7, we use the generalized cross-validation (GCV)

method of Craven and Wahba. First, ive have from (P3) that the Bridge estimator of

linear regression model satides

We define the effective number of parameters p ( X 7 7) of the model. following Craven ancl

Wahba to assess the penalty eRect on the degrees of freedorn of the model

where D is a p x p diagonal matrix of elements

a

no is the number of such that $j = O for 7 = 1. compensating the loss of the inverse of

entry zero on the diagonal of matnx D due to ,ij = O. The GCV is defined as

where n is the sample size. It cari be re-wri tten as

n R S S ~ P )

and be interpreted as the average quant ity of squared residual over each remaining effective

degree of freedorn out of the model.

To select the parameters X and 7: we compute the GCV for each pair of (Al -y) over a

grid of X 2 O and 7 2 1. X and y are selected to achieve the minimai value of the GCV

as shown in Figure 5.1.

For generalized Linear models, the GCV must be modified since the residual sum of

squares (RSS) is no longer meaningful for non-Gaussian response variables. Instead, the

Bridge Optimization of Lambda and Gamma via GCV

Figure 5 .1 : Selection of parameters X and 7 via GCV

deviance, -21og(Lik), can be used to replace RSS in the GCV. where Lik is the joint

likelihood function of the response variable. The optimizing procedure remains the same.

We consider two speciaI cases for the effective number of parameters p ( X . 7 ) .

1. A = O. No penalty is applied to the model. p ( X , y ) is the trace of the projection matrix

and is thus equal to p, the number of parameters in the linear model.

2. X » 1 and y = 1. Since the Lasso shrinks the parameters and yields .3, = O?

j = 1, . . . , p, for suffitiently large AI D = diag(0) and no = p. The rnodel is t3us nul1 since .L

ail = O. Hence the effective number of parameters of the model is equal to 0. which

agrees with the calculation of p(X, y ) = p - p = 0.

For other cases, p(X, y) is greater than O and less than p, t he number of parameters in

the mode[.

5.4 Quasi-GCV for Penalized GEE

The GCV method was used to select parameters X and 7 for generalized linear models in

last section. However. as pointed out in Chapter 4. no joint likelihood function exists for

the GEE in geoeral. Hence, the GCV method does not apply for the penalized GEE and

thus must be rnodified.

To generalize the GCV method for the penalized CEE, the correlation structure must

be incorporated. By incorporating the correlation. one may achieve the sanie efFect of

the GCV as in generalized linear models. Notice that the deviance used in the C:CV for

generalized linear models is the sum of squares of deviance residuals. hlthough deviance

does not have a proper rneaning in the GEE due to the correlation. the dcviance residuals

can still be calculated at each single observation point as usual

where L(yr t , f i k t ) is the Likelihood of observation ILt based on its marginal distributio~~.

A weighted deviance D,(Xl 7) for correiated observations is thus forrned as foilows by

incorporating the correlation structure into the deviance residuals to achieve similar effect

to the deviance for independent observations.

where TI , is the deviance residual vector of subject k, &(a) of dimension nk x nk is the

working correlation mat rix.

LVe then define a quasi-GCV to be

Figure 5.2: Select,ion of parameters X and via quasi-GCV

where n is the effective number of degrees of freedom of the correlated observations g k t .

t = 1,. . . , n k , defined as

and IRk(a)l is the sum of aiI elernents C pG of &(a) = (p,) . Since the correlatioo

structure of the GEE is estimated via either Pearson residuals or deviance residuaIs,

deviance residuals are recommended in order to incorporate the correlation structure into

the deviance residuals.

The parameter selection procedure remains the same as for the gneralized Iinear mod-

els, i.e. for each fixeci pair of (A. 7): cornpute the Bridge (Lasso) estimator B(x.-,). Theo

compute the effective number of parameters p(A. y). The quasi-GCV is thus computed

using ( 5 . 3 ) with the deviance residuals and correlation matrix R(a ) obtained from the

last step of the IRLS procedure for the penalized GEE. The parameters X and 7 are t hen

seiected over a grid to minirnize the quasi-GCV as shown in Figure -5.2.

Remarks

1. We refer D,(& y ) as the weighted deviance. It reduces to deviance when the correlation

matrix R(a) reduces to identity matrix for independent observations, accordingly. the

quasi-GCV reduces to the GCV.

2. The effective number of degrees of freeciorn of correlated observations depends on

the correlation coefficient matrix R(a). Since different values of X and a/ yield different

estimates and different value of R(a), n seems to Vary with X and 7. However. since the

effective number of degrees of freedom is intriosic to the observations and the subject,

n must be independent of X and y. Therefore, constant value of n should be used to

compute the quasi-GCV for difFerent X and y. We recommend using the estimate of n

Erom X = 0.

The weighted deviance is motivated by correlated Gaussian responses as follows.

Assume Y = ( K , . . . , are correlated responses from mode1 Y = XP + E with E - N(0, C ) , where C is a non-diagonal variance-covariance matrix of E.

Ln order to apply the GCV met hod for independent responses, we take a transformation

Z = PY, where P = n-'I2Q satisfying C = QThQ. Then Z follows a normal distribution

of N ( P X & 1). Apply the GCV to 2. one ha.

i.e. t be GCV is achieved by incorporating the correlation structure in the residuals.

Similady, one incorporates the correlation structure into deviance residuals as in ( 5 . 3 ) to

achieve the same effect for the penalized GEE.

The effective number of degrees of freedom of correlated observations is also motivated

From correlated Gaussian observations. Assume Y = ( K . . . . follows distribution

N ( 0 , a' R), where mat rix R = ( p i j ) has diagonal elernents pi; = 1.

Consider the variance of the

~ a r ( Y ) =

1 - CC Cov(Y;, Y , ) n2

j

Notice that for the speciai case where x ' s are independent, R is thus an identity matrix.

var(?) = ovn. The denominator n is the oumber of degrees of freedom of the n

independent observations Yi, . . . , Y,. By analogy, we d e h e n'Il RI, the denominator

of (5.4), to be the effective number of degrees of freedom of the correlated observations

h, . . . , Y,. For non-negative correlatioo coefficient p, 3 O, this effective number of degrees

of lreedom is between 1 and n, as the former is for n repeats of & and t he latter is for n

independent observations (Yi,. . . , Y,).

There might be some problems with oegative correlation. However. it is very rare in

practice to have a series of observations wit h negative correlation. Especially for longitudi-

nal çtudies, one expects positively correlateci responses [rom t h e same sub ject. Therefore.

the effective nurnber of degrees of fkeedom works well for longitudinal studies in generai.

Chapter 6

Simulation Studies

In this chapter, 1 conduct a series of statistical simulations based on true models in order to

examine the shrinkage effect of Bridge regression. The Bridge penalty model is compared

with no-penalty. the Lasso penalty and the d g e penalty models in the settings of linear

regression. logistic regression (generalized linear rnodel for binomial distri but ion ) and tlir

GEE for binary outcornes. The standardized mean squared error (MSE) of the regression

paraniet ers

M S E = ace ( f i - P ) ~ ( x = x ) ( ~ - P ) .

and the prediction squared error PSE = arue Dev(y7 j i ) averaged over replicates of tlie

mode1 random error. are computed and compared for different penalty models. For the

logistic regression and the GEE model. the misclassification error (,LI CE) is also computed

as an average over replicates of the model randorn error,

MC E = ave [(y. z j ) ?

where

is the indicator function. and

For each replicate of random error generated for the model? the PSE and MCE are corn-

puted as an average a t some randomly selected points in the covariate space having the

same correlation structure as X. The standard error of each quantity is ais0 computed.

'If

6.1 A Linear Regression Mode1

We compare the Bridge mode1 with the OLS. the Lasso and the ridge in a simulation of

a simple model of 40 observations and .i covariates

where B - N ( O : uZ)- The signal noise ratio is thus calculated by [ v â r ( X J ) / ~ ' ] . where

T vâr is the sarnple variance of (2' B . . . . , znTp), P is the true parameter and xi is the

covariate vector of the i-th observation.

To examine the shrinkage effect on collinearity, we choose a regression matrix .Y with

sttong linear correlation as shown in the correlation matrix of X. The correlation coeffi-

cient between x4 and xs is very Large. p = 0.995. The matrix .Y is generated as follows.

First? a matrix of 40 x .i is generated with random numbers of standard normal distribu-

tion X ( 0 7 1 ) . Then the pairwise correlation coefficients of consecutive colunin vectors of

X are generated from uniform distribution U(- 1, 1 ) . The pairwise correlation coefficients

of the consecutive coiumn vectors are achieved by adding a multiple of the second column

to the first. To shrïnk the parameters of the regressors but not the intercept, we center

and scale the data by

where xj is the j-th column vector of X.

Since the Lasso performs weli compared to the ridge if the true model has coefficients O.

but perfonns poorly if the true model has smaU but non-zero coefficients, two sets of true

p were selected to examine the shrinkage effect on models with coefficients O and models

Correlation matrix of the linear model (6.1)

! 1.000 0.1 10 -0.144 0.0:36 0.066

~2 0.1 10 1.000 -0.:315 0.021 0,034

~3 -0.144 -0.315 1.000 -0.1 18 -0.109

~4 0.0:36 0.021 -0.1 18 1.000 0.995

i 5 5 0.066 0.034 -0.109 0.995 1.000

wit h srnall but non-zero coefficients: Pt,, = (0,O: O.5,0, - 1)* wit h intercept ,& = O for

Model (a) and Pt,, = (0.5,3? -0.1,2.5,9)~ with intercept ,& = O for Model (b). The

response Y is generated from model (6.1) with a signal-noise ratio equal to 6.

Table 6.1 shows the parameter estimates, the standard errors in parent heses, the &ISE

and PSE of the OLS, the Bridge, the Lasso and the ndge models. The standard errors of

,& and ,& are relatively large in both Models (a) and (b) due to collinearity

In Model (a), the Bridge and the Lasso achieve the smallest LISE = 1.104. followed by

the ridge MSE = 1.212. The OLS bas the greatest MSE = L.385 due to collinearity- The

Bridge and the Lasso also achieve the smallest predictioo error PSE = ?.-LE followed by

the ridge PSE = 2.745. The OLS lias the geates t prediction error PSE = 3.17Y. The

reduction of MSE by the Bridge is 20% from the OLS. and the reduction of PSE by the

Bridge is 22% from the OLS.

In Model (b), the ridge achieves the srnaflest MSE = L27.90, foiiowed closely by the

Bridge MSE = 129.60 and the Lasso MSE = 130.16. The OLS has the greatest MSE =

145.17. The ridge also achieves the srnailest prediction error PSE = 286.42, followed by

the Bridge PSE = 290.29, and by the Lasso PSE = 292.10. The OLS has the greatest

Table 6.1 : Model cornpanson by sirnulat ion of 200 runs

Model (aj

PSE

1 MSE 1 145.17(5.98) ] 129.60(5.50) 1 130.16(5.52) 1 127.90(5.70) 1

True ,d

&,=O.Cl

XlT8(O. 141)

OLS

0.005(0.75'7)

PSE

2.482(0.15%)

Bridge

329.:30(14.22) 290.%9(12.87)

I

2.4Y2(0.132)

Lasso

292.10(12.90)

2.'iEi(O. 127)

Ridge

286.42(13.%6)

0.003(0.757) 0.003(0.157) 0.005(0.ï57)

prediction error PSE = 329.30. The reduction of MSE by the Bridge is 11% from the

OLS, and the reduction of PSE by the Bridge is 12% from t h e OLS.

I t is shown in the above example that the Bridge regression shrinks the OLS estimators

and achieves small variance, small mean squared error and srnall prediction error. I t is

also dernonstrated that the Bridge estirnator performs well compared to the Lasso and

the ridge estimators, and perforrns better than the OLS estimator.

6.2 A Logistic Regression Mode1

We apply Bridge penalty to a logistic regression model and compare the Bridge mode1 witli

logistic regression of no penalty, of Lasso penalty and of ridge penalty for the following

model of 20 binary responses and 3 regressors:

As above, we standardize the regression matrix by

To examine the shriakage effect with the preseoce of collinearity, we choose a regression

matrix X such that the covariates 1 2 and z3 are highly correlated with correlation coeffi-

cient 0.9975. The matrix X is generated as follows. First, a matrix of 20 x 3 is generated

with random oumbers of standard normal distribution N ( O , 1). Then a large multiple

of the third column vector is added to the second in order to achieve a high correlation

between the two columns.

Correlation matrix of the logistic regression model (6.2)

Since the Lasso penalty model and the ridge penalty model perform ciifferently de-

pending on whether the true model ha. coefficients O or small non-zero coefficients.

two sets of true ,B are selected to examine the shrinkage effect of Bridge penalization.

/3 = (O.i,O.l, - 0 . 1 ) ~ with intercept /Io = 0.1 for Model (a): and ,f3 = (O. I. - I ) ~ with

intercept ,& = O for Model (b). The Bridge model is comparecl with Logistic regression.

the Lasso ancl the ridge through a simulation of 100 runs. Table 6.2 shows the estimates.

the standard errors in the parentheses, MSE, MCE and PSE averaged at 20 randomly

selected points having the same correlation structure as X.

Overall, the standard errors of and b3 are relatively large for both Models (a)

and (b ) due to collin-rity. The Iogistic regression estimator has the greatest standard

errors in both Models (a) and (b), which leads to poor performance in prediction with

the greatest MCE and PSE.

In Mode1 (a), the ridge achieves the smallest MSE = l.6R8, followed by the Bridge

MSE = 1.902 and the Lasso MSE = 1.907. The Logistic regression has the greatest

MSE = 3.058. The ridge has the smallest prediction error PSE = 1.569, foUowed by the

Lasso PSE = 1.588 and the Bridge PSE = 1.590- The Iogistic regression has the greatest

predictioo error PSE = 1.782. However, the Lasso has the smallest misclassification

Table 6.2: Model comparison by simulation of LOO runs

Mode1 (a)

T ruePI Logistic 1 Bridge 1 Lasso 1 Ridge !

MCE 1 0.519(0-012) 1 0.458(0.015) 1 O.-LYS(0.015) 1 O.SL-C(O.012) 1

,&=O.i

fl,=O. 1

,&=-O- 1

MSE

PSE 1 1.782(0.047) 1 1.590(0.031) 1 1.5YY(0.038) 1 L.569(0.039) 1 Model (b)

0.046(0.674)

0.060(7.953)

0.015(7.991)

3.058(0.147)

0.007(0.445)

-O.l39(3.ïO3)

0.203(3.866)

l.gO2(O. 173)

True ,8

-

MCE

0.006(0.4-44)

-O.I38(3.703)

0.20 L(3.866)

L O O 1 1

Logistic

PSE

0.006(0.407)

O.O%(-L.2 15) ,

-0.0 lg(4.29'i)

L.6:3$(0.172)

0.509(0.010)

Bridge

1.839(0.057)

0.477(0.013)

I

Lasso

1.621 (0.040)

Ridge

0.450(0.013) OAW(O.0 10)

1.620(0.039) 1.645(O.O54)

error MCE = 0.485, followed ciosely by the Bridge MCE = 0.488: and by the ridge

MCE = 0.514. The logistic regression has the greatest misciassification error MCE =

0.519. The reduction of LISE by the Bridge and the Lasso is about 38% from the logistic

regression. The reduction of PSE by t be Bridge and the Lasso is about 1 1% froni the

logistic regression. The reduction of MCE by the Bridge and the Lasso is about 6% from

the logistic regression.

In Mode1 (b), the ridge achieves the srnailest MSE = 1.93 1. followed hy the Lasso &ISE

= 2.149 and the Bridge MÇE = 2.150. The lagistic regression bas the geatest MSE =

3.435. The Lasso achieves the smallest prediction error PSE = 1.620. followed closely by

the Bridge PSE = 1.621, and by t h e ridge PSE = 1.645. The logistic regression Lias the

greatest prediction error PSE = f .839. The Bridge achieves the smallest misclassification

error MCE = 0.417, followed closely by the Lasso MCE = 0.480, and by the ridge MCE

= 0.494. The logistic regression lias the geatest MCE = 0.509. The reduction of MSE

by the Bridge is 37% from the logistic regression. The reduction of PSE by the Bridge is

12% €rom the logistic regression. The reduction of MCE by the Bridge is about 6% Erom

the logistic regression.

It is showo in the above example that the Bridge penalization shrinks the estimator

towards O and achieves small mean squared error' small prediction error and small mis-

classification error for logistic regression model. Therefore the Bridge estimator performs

well in prediction compared to the Lasso and the ridge estimators, and performs better

than the logistic regression estimator.

6.3 A Generalized Est imat ing Equat ions Mode1

We apply the Bridge penalization to the GEE via the penalized score equations approach

and compare the Bridge model with the GEE of no penalty. of the Lasso penalty and

of the ridge penalty for the following model of 20 subjects and :I regressors. The matrix

X is the same as in the logistic tegression model in last section. Five binary responses

are generated for each subject with exchangeable positive correlation and the covariates

remain the same for different observations within each subject. The correlated binary

responses are generated using a methoci by Lee (1993) for an exchangeable correlation

structure with positive pairwise correlation determined by a parameter O < rb 5 1. As

I/J tends to 0, the Kendall coefficient r converges to 1 while (1. = I corresponds to r = O

wit h independence as a special case. Here we choose $J = 0.2 to generate the positively

correlated responses. The MSE. MCE and PSE at some randomly selected prediction

points are computed for each model. The PSE is defined to be the deviance averaged over

M = 20 randomly selected prediction points.

1 PSE = -C-2[yilog(ji) + (1 - yi)log(l - P i ) ] ,

M

with the assumption that the prediction points are from independent sub jects.

The linear component of the GEE is

Since the Lasso penalty rnodel and the ridge penalty model perforrn differently de-

pending on whether the true model has coefficients O or small but non-zero coefficients,

two sets of true p are selected to examine the shrinkage effect of the Bridge penalization:

Table 6.3: Mode1 Comparison by Simulation of 100 Runs

Mode1 (a )

True /3

@*=O

1 MCE 1 0.341(0.010) 1 0.340(0.010) 1 0.:340(0.010) 1 0.345(0.010) 1 MSE 1 5.190(0.:349)

I

1 PSE 1 1.244(0.019) 1 1.241(0.018) 1 1241(0.018) 1 1.245(0.018) 1

CEE

Mode1 (b)

4.848(0.:36:3) 1 -1.845(0.363) I

True p GEE Bridge Lasso Ridge I ,&ci,=O.l O.lSS(0.242) 0.185(0.241) 0.185(0.241) 0.18.5(0.241)

Bridge

5-08 9(0.:309)

p3=0.0i O. 155(3.938) 0.028(2.02 1) 0.025(2.02 1) 0.040(2.482)

MSE 4.807(0.295) :3.527(0.264) 3.527(0.264) 3.713(0.294)

-O.OOS(O.ZBl) 1 -0.009(0.277)

MCE 0.460(0.0 12) ( 0.451(0.011) 1 O.45l(O.Oll) 0.46 l(0.0 11)

Lasso

1 PSE 1 1.413(0.013) 1 1.397(0.012) 1 1.397(0.012) 1 1.:398(0.013)

Ridge

-0.009(0.277) -0.0 lO(O.2'iï)

/3 = (0,2, - 3 ) = with intercept ,Bo = O for Model (a), and ,û = (0.1. - O . I . O . O l ) * with

intercept Do = 0.1 for Mode1 (b). Table 6.3 shows the parameter estimates, the standard

errors in parentheses, MSE, MCE and PSE. The standard errors of & and j3 are relatively

large in both Models (a) and (b) due to collinezrity.

In Model (a), the Lasso achieves the smallest MSE = 4.845. followed closely by the

Bridge MSE = 4.848, and by the ridge MSE = 5.089. The GEE mode1 has the greatest

MSE = 5.190. The Bridge and the Lasso achieve the smallest misclassification error MCE

= 0.340, followed by the GEE MCE = O.:Wl. The ridge has the greatest misclassification

error MCE = 0.:345. The Bridge and the Lasso achieve the smallest prediction error PSE

= 1.241, followed by the GEE PSE = 1.244 and the ridge PSE = 1.245. The reduction

of MSE by the Bridge is 7% from the GEE. The reduction of MCE by the Bridge is 0.3%

from the GEE. The reduction of PSE by the Bridge is 0.25% from the GEE.

In Model (b); the Bridge and the Lasso achieve the smallest &ISE = 3.57'7. followed

by the ridge MSE = 3.713. The GEE has the greatest $ISE = 4.807. The Bridge, the

Lasso and the ridge achieves the smallest misclassification error MCE = O . M . The GEE

has MCE = 0.460. The Bridge and the Lasso achieve the smallest prediction error PSE

= 1.39'7. foUowed closely by the ridge PSE = 1.398. The CEE has the greatest prediction

error PSE = 1.413. The reduction of MSE by the Bridge and the Lasso is 27% from the

GEE. The reduction of MCE by the Bridge, the Lasso and the ridge is 2% from the CEE.

The reduction of PSE by the Bridge and the Lasso is 1.2% from the CEE.

It is shown in the above example that the Bridge penalization shrinks the estimator

and achieves small meao squared error, misclassification error and prediction error for

the GEE model. The Bridge estimator performs well compared to the Lasso and ridge

estimators, and performs better than the GEE estimator.

6.4 A Complicated Linear Regression Model

In Section 6.1, a simple linear regression model was studied and the shrinkage effect of

different penalties: the OLS, the Bridge? the Lasso and the ridge were compared in terms of

MSE and PSE with two typical sets of true parameters: one with zeros and the other with

smaU but non-zeros. In this section, we study the shrinkage effects of diEerent penalties

on more complicated linear regression models with different correlation structures of the

regressors. The true parameters are generated from the prior distribution of the Bridge

penalty for different values of y, as discussed in Section 2.7.

Model

CVe study a linear regression model of 10 regressors with sample size n = 30

Ten regression matrices X,,, , rn = 1. . . . . 10, are generated Erom an ort honormal rnatrix .Y

of dimension 30 x 10 with different pairwise correlation coefficients { p ) , generated Erom

a uniform distribution Li(-1,l).

Data

For each X,, 30 true ,Bk: k = 1, . . . .30, are generated, where each component of Pk is

generated from the Bridge pnor q(@), Le. nAsî(8) with X = 1 and fixed y 2 1. With each

X, and pk, 30 observations are generated £rom Y = X,P, + E with iid normal random

error Ei from N ( 0 , a2) with a signal-noise ratio equal to 6. For different penalty models:

the OLS. the Bridge. the Lasso and the ridge. the MSE and PSE are computed as

P S E = ave (y, - rTfi)'

averaged over 20 randomly selected points (xtt y,) generated from the same rnodel. where

xt , the covariate vector of each predict ion point. consists of the covariates having t h e samr

correlation structure as X,. Then t h e MSE and PSE are averaged over 50 replicates of

the mode1 random error S. Hence for each P, generated from the pnor distribution T-,($).

MSE and PSE are computed for the OLSt the Bridge. the Lasso and the ridge nmdels.

Therefore 10 x 30 = 300 sets of SISE and PSE are computed. The above procedure is

repeated for different values of 7 = 1. 1.5. 2. 3 . 4.

Method

Since each set of $ISE and PSE of different penalties are computed Erom the same

,Bk generated from 7r-,(,8) and their values Vary in a large range with different but

the differences between the models are relatively srnall as shown in Figures 6.1 - 6.5. we

choose to compare the relative M S E , and relative PSE, to the OLS by setting the OLS

to be the baseline:

and

PSE - l'SEoL, PSE, =

PSEOlS

It can be seen clearly fkom the plots of the MSE and PSE in the original scale that

the MSE's of difFerent penalty models are highly correlated, and so are t h e PSE's. It

is appropriate to compare the relative MSE and PSE rather than the original .LISE and

PSE.

Result

For each fixed -y value. the means and their standard errors of the 300 sets of &f S Er anci

PSE, are computed and reported in Table (6.4). It is shown that for 7 = L and 1.5. the

Bridge: the Lasso and the ridge have significant reduction of %ISE and PSE frorn the OLS.

For y = 1, the Bridge has the gea tes t reduction with :\.ISE, = -0.0860 and PSE, =

-0.0021, followed closely by the Lasso with MSE, = -0.0841 and PSET = -0.0020. and

followed by the ridge with M S E , = -0.0595 and P S E , = -0.0013. For y = 1 .J. The

ridge has the greatest reduction wit h M S Er = -0.0566 and PS Er = -0.00 17. followed

by the Bridge with MSE, = -0.0225 and PSE, = -0.0009, and followed by the Lasso

with MSET = -0.0224 and PSET = -0.0009.

For 7 = 2, 3 and 4: the ridge has a significant reduction of &ISE and PSE from the

OLS with ILISE, = -0.0-519 and PSE, = -0.00'21 for y = 2. LWSE, = -0.0566 and

PSET = -0.0016 for 7 = 3. and MSE, = -0.0577 and P S E , = -0.00 Li3 for 7 = 4:

while both the Bridge and the Lasso have a si,hficant increase of &ISE and no significant

change of PSE fiom the OLS.

It is shown in Table 6.4 that the Bridge and the Lasso perfonn welI for smaiI y values,

but not as well for large -y values. The ridge performs well for au of the -1 values coasidereci

here. It performs better than the Bridge and the Lasso for large values of y (y = l.5?

Table 6.4: Means and SE'S of Itl SE, and PSE, for differeut y

1 Bridge 1 Lasso 1 Ridge 1

2: 3 and 4, but not as well for small y value (7 = 1). As discussed in Sections 5.6 and

2.7, large value of y generates smal1 but non-zero regression parameters ,d for t h e niodeIl

and small value of y generates large regression parameters 3. It can thus b e implied that

the Lasso performs well if the true model has large parameters. but performs poorly if

the true model has many small but non-zero parameters. Such a result agrees with the

results obtained in Sections 2.6 and 2.7. [t also agrees with the results obtained tlirough

intensive simulation in Tibshirani ( 1996). The Bridge demonstrates a similar effect to t lie

Lasso. it performs well for srnali 7 values (7 = l 1 1.5). but does oot for large values,

even though it can potentially select the best y value.

Lo Fiemes 6.1 - 6.5 for fixed 7 = 1, 1.5, 2, 3 and 4, respectively, on the right hand side

are the box plots of the MSE, and PSE,, and on the left hand side are the plots of ten

randomly selected sets of MSE and PSE in the original scale including the maximum and

minimum. It is shown that the MSE's of different penalty models are highly correlated,

and so are the PSE's. The values of the MSE's and the PSE's Vary in a large range. I t can

be concluded that the cornparison of MSE, and PSE, between different penalty models

is appropriate rather than the cornparison of the original MSE and PSE.

It is shown from the above result that Bridge regression achieves small MSE and PS E.

and performs well compared to the Lasso and the ridge for linear regression models with

large regression parameters, but may perform poorly if the true models have many small

but non-zero parameters.

Summary of the Simulation Resdts

In summary, it can be concluded from the above simulation studies that the shrinkage

estimators ( the Bridge, the Lasso and the ridge) achieve smaller variance and better

estimation than the non-shrinkage estimator by shrinking the parameters towards O witli

a Little sacrifice of bias when coilinearity is present in regression problems. For different

cases, the shrinkage estirnators of the Bridge? the Lasso and the ridge perform differently.

in general, the Bridge estimator with small value of y. such as the Lasso estimator.

tends to favor models with many zero parameters or models with large paranieters. but

does not perform weii on models with many small but non-zero parameters in terrns

of estimation error and prediction error. The Bridge estimator witli large value of y.

such as the ridge estimator? tends to favor models with moderate parameters or models

with many small but non-zero parameters, but does not perform as well as the Bridge

and the Lasso estimators on models with many zero parameters or models with large

parameters. However, the ridge estimator performs well for a wide range of 7 values,

which include models with small but non-zero parameten and models with many zero

parameters. Therefore, the ridge estimator is recommended in generai to deal witli the

collinearity problem in regressions. In practice, one does not have much acknowledge of

the true models, a training-and-testing method is recommended as shown in Section 7-3

in next chapter. This method randomly splits the data set into a training set and a test

set, and biiilds several penalty models on the training set. Then t lie prediction errors on

the test set are computed. The above procedure is repeated many times. and the averaged

prediction errors of the penalty models over different random splits are compared. The

model having the least prediction errors is selected to be an optimal model.

OLS Bridge Lasso Ridge


Relative MSE to OLS

Bridge Lasso Ridge

Relative PSE to OLS

L = - - - - Bridge Lasso Ridge

Figure 6.1: Simulation with true P generated £rom the Bridge pnor with y = 1. Left: ten randornly selected sets of hISE or PSE including the maximum and

the minimum. Right: box plots of 300 sets of the relative MSE and PSE.

OLS Bridge Lasso Ridge Relative MSE to OLS


Bridge Lasso Ridge

Relative PSE to OLS

Bridge Lasso Ridge

Figure 6.2: Simulation with tme P generated from the Bridge prior with 7 = 1.5. Left: ten randomly selected sets of MSE or PSE including the maximum and


OLS Bridge Lasso Ridge Relative MSE to OLS


- - - - Bridge Lasso Ridge

Relative PSE to OLS

Bridge Lasso Ridge

Figure 6.3: Simulation with true ,û generated from the Bridge pnor with -/ = 2 . Left: ten randomly selected sets of MSE or PSE including the maximum and




Relative MSE to OLS

Bridge Lasso Ridge

Bridge Lasso Ridge

Relative PSE to OLS

Figure 6.4: Simulation with true P generated from the Bridge pnor with y = 3 . Left: ten randomly selected sets of MSE or PSE including the maximum and



OLS Bridge Lasso Ridc

Relative MSE to OLS

Bridge Lasso Ridge

Relative PSE to OLS

I = = 1 Bridge Lasso Ridge

Figure 6.5: Simulation with true generated from the Bridge prior with 7 = 4. Left: ten randomly selected sets of MSE or PSE includiog the maximum and

the minimum. Right: box plots of 300 sets of t he relative MSE and PSE.

Chapter 7

Applications: Analyses of Healt h

Data

In this chapter, 1 apply the Bridge penalty model to analyze several data sets obtained

from public health studies to achieve good statistical results.

7.1 Analysis of Prostate Cancer Data

We apply Bridge regression to a prostate cancer data set. The data cornes from a study

by Çtamey et. al. (1989) to examine the correlation between the level of prostate-specific

antigen and a number of clinicai rneasures in men who were about to receive a radical

prostatectorny. The study had a total of 95 observations of male patients aged froni 4 1

to 79 years. The covariates are log cancer volume (lcnvol). log prostate weight ( fwe igh t ) .

age: log of benign prostatic hyperplasia amount ( l b p h ) . seminal vesicle invasion ( m i ) . log

of capsular penetration ( lcp). Gleason score (gleason) and percent Gleasou scores 4 or 5

( p g g 4 5 ) . The data was later studied in Tibshirani ( 1996). A more detailed description of

the data set can be found in either of the above papers.

Some linear correlation is present among the covariates as shown in the correlation

coefficient matrix of X. The pairwise correlation coeficients are moderate. with t h e

largest one 0.752 between gleason and pgg45, and the next 0.675 between icauol and Zcp'

etc. No strong linear relationship can be found through an examination on the condition

number of the standardized covariate matrix X, the ratio of the greatest eigenvalue to

the srnallest eigenvalue of matrix X ~ X . which is 16.9 for the covariates considered here.

Two Linear regression models are fitted to the centered data. one with no penalty,

the other with the Bridge penalty. Table 7.1 shows the parameter estimates and their

standard errors for the OLS model and the Bridge model. The OLS model Lias no vanishing

Correlation matrix of the Iinear model for the prostate cancer da ta

( lcavol

Zweight

nSle

lbph

s v i

[cp

gleason

( ~ 9 9 4 5

coefficients though some of them are not ~ignif icant~ for example. icp. g l e a s a and pgg45

are not significant, etc. The Bridge estimator is obtained by the 41-Y-R or Shooting

algorithm for each pair of fixed h 2 O and 2 1. The values of X and -y are selected

by the GCV as shown in Figure 7.1. A Lasso model with A = 7.2 is sdected. This

Bridge model sets the coefficients of lcp and gleason to O and leaves no cocariates of

pairwise correlation coefficient greater than 0.6 in the model. The standard errors of the

Lasso estimator were omputed from 10000 bootstrap samples. It shows a much smailer

standard error than the OLS due to the shrinkage effect.

While the OLS model yields a significant effect of the intercept. lcavol, h e i g h t . sui,

and a marginally significant effect of age and Zbph, the Bridge mode1 yields a significant

effect of the intercept, lcavol, fweight, sui, and a marginaily significant effect of lbph.

The effect of age becomes non-significant in the Bridge model. Two regressors: lcp and

Bridge Optimization of Lambda and Gamma via GCV

Figure 7.1 : Selection of parameters X and y for the prostate cancer data.

Table 7.1: Estimates of the prostate cancer data

OLS Bridge

intercept

k a vol

r

lweight

age

%.478(0.072)

0.688(0. 10s)

2.478(0.072)

0.6 18(0.090)

0.225(0.084)

-0.145(0.OY2)

O.lW(O.076)

-0.048(0.046)

Table 7.2: Cornparison in model selection'

Predictor

lcavol

*. Y - significant effect in model: IV - non-significant effect in mode!.

OLS

gleason

gleason becomes zero in the Bridge model.

Y

We compare the Bridge model with the model obtained from the subset selection by

the leaps and bounds (L-B) method (Furnival and Wilson 1974, Seber 1977). The subset

selection chooses the best model with the covariates lcavol~ lweight. llph ancl sui. The

Bridge

Y 1 Y

N

covariates age and pgg45 are in the Bridge model but not in the subset selection model.

.

Subset(L-B)

However. these two covariates are not significant at ail. Therefore, the Bridge mode1

N

agrees with the best mode1 Lom the subset selection by the leaps and bounds method as

shown in Table 7.2,

N

Correlation matrix of the predictors of the Kyphosis data

i a9e 1.000 0.946 -0.023 0.059

nge2 0.946 1.000 -0.004 0-076

nun~ber -0.023 -0.004 1 .O00 -0-466

start 0.059 0.076 -0.466 1-000

7.2 Analysis of Kyphosis Data

We analyze the kyphosis data from a study of multiple level thoracic and lumbar iaminec-

tomy, a corrective spinal surgery commonly performed in children for turnor and congenital

or developmental abnormali t ies such as syrinx, diastematumyelia and tet hered cord. The

study had à3 observations of children aged from 1 to 243 rnonths. It was studied by Bell

e t al. (1994) and analyzed by Hastie and Tibshirani (1990) using generalized additive

model (GAM). A detailed description of this study can be found in Bell e t al. ( 1994) or

Hastie and Tibshirani ( 1990).

The outcorne of this study is binary. either the presence (1) or the absence ( O ) of

kyphosis. The predictors are age in months a t time of the operation. the startinp vertebrae

level and the number of vertebrae levels involved in the operation (start and nurnber).

The quadratic term age2 is also included to study the quadratic efkct of age.

A strong linear relation can be observed from the correlation matrix. The coefficient

between age and age2 is 0.946. T h e condition number of this matrix is 37.1. which also

indicates t bat t here exists a st rong linear relationship among the covariates.

Two logistic regression models are fitted to the data: the no-penalty model and the

Bridge Optirniration of Lambda and Gamma via GCV

O \

Figure 7.2: Selection of parameters X and 7 for the kyphosis data.

Bridge penalty model. A Lasso model with slirinkage parameter X = 0 . 2 was selected via

the GCV for the Bridge model as shown in Figure 7.2. We compare the logistic regression

mode1 with the Bridge rnodel in Table ' 7 . 3 .

Table 7.3 shows the parameter estimates and t heir standard errors of bot h models. The

standard errors of t he Bridge estimates are obtained by the jackknife method (Shao and

Tu 19%). It is shown that the logistic regressioo model yields a very significant effect of

aLi the preàictors considered. The age bas an increasing-then-decreasing quadratic effect,

the nvmber has an increasing effect and the start has a decreasing efFect. However, the

Bridge model yields a very different result. It shrinks the estirnate of number to non-

si,hficant, the estimates of age and age2 to margindy significant. The effect of start

Table 7.3: Estimates of the kyphosis data

intercept

age

age2

nurnber

*. TWO outliers removed from the model: number = 14 or age = '24.3. I. A Lasso model with X = 0.22 is selected by the GCV. 2. A Lasso model with X = 0.24 is selected by the GCV.

Logistic

-2.256(0.546)

st art

and the intercept remain significant .

Hastie and Tibshirani (1990) fitted a GAM model on the entire data and obtained a

4.863(2.025)

-4.5 l3(2.125)

0.910(0.427)

quadratic age effect, a n increasing number effect and a decreasing start effect. However.

Bridge'

-2.249(0.826)

- 1 .OO?(O.N9)

after removing two outliers from the model (one wit h nvmber = 14, the ot her with age =

4.4 lS(2.591)

-4.O29(2.8 12)

0.893(0.763)

243), they ended with a best model, which yields a significant start effect, a rnarginally

Logistic'

-2.265(0.534)

- 1 .O l6(O.427)

significant increasing-t hen-decreasing quadrat ic age effect. The effect of n u m ber becornes

non-significant in the best GAM model. The resd t of the Bridge model agrees with that

of the best GAM model in Hastie and Tibshirani (1990), and also agrees with the result

reported in Bell et al. (l994).

Bridge''

-2.258(0.843)

4.714(1.994)

-4.Oï7(1.958)

To examine the robustness of the Bridge model, we further fit the logistic model and

the Bridge mode1 to the data with the two outliers removed. A Lasso model with X = 0.24

is selected by the GCV for the Bridge model. As shown in Table 7.3, the logistic mode1

4.204(2.618)

-3.553(2.636)

I

shows a marginally significant effect of number with the outliers removed, while the Bridge

-0.989(0.343)

0.687(0.381)

-0.998(0.428)

0.685(0.680)

model shows a non-significant effect of number. No major difference is thus observed in

the model selection with the two outliers either included or excluded. The Bridge mode1

iç robust to the outliers in this data set. Therefore. it can be concluded t h a t tlie Bridge

penalty model pedorms very well for this kyphosis data.

7.3 Analysis of Environmental Healt h Data

In this section? we apply the Bridge penalty models to analyze a data set obtained froni

an environmental health study in Windsor, Ontario. The study was conducted froni 1992

to 1993 to study the effect of air pollution on health. For years. there had been a coucern

over the air pollution in the Windsor area. The major source of the pollution is the

industrial activity and municipal incineration in the urban region of Detroit. !vIichigan.

The study was based on a population of asthmatics in Windsor. consisting of 39 subjects

aged 12 years and older. Each subject had 21 records with 4 weeks apart of asthmatic

status and some variables assessing the quality of the air. for exampleo tlie ozoue level.

the carbon monoxide level. etc. The response of asthmatic status was recorded using time

interval in the evening the asthmatics suffer from the status. For the analpsis purpose.

we dichotomize this variable and define a new response variable: Asthma Status = 1 if

the night time interval is positive, or O otherwise. Therefore? we have a binary response

variable, whether the asthmatics suffer €rom the symptoms or not, and a set of independent

variables: the measures assessing the quality of the air, the mean temperature and the

rnean humidity, etc. There were 112 out of 819 total observations wliich indicates that

the asthmatics suffered. Since the 21 observations of each subject are correlated, the

Corretation rnatrix of the environmentai healt h data

h m n o 1.000 0.606 0.077 -0.458 0.414 0.660 0.275 -0.281 -0.025

clmno2 0.606 1.000 0.109 -0.4 10 0.:378 0.712 0.491 -0.160 -O.O:3 1

clmtrs 0.077 0-109 1.000 0.154 -0.028 0.0'79 0.040 0.065 -0.147

clmoz -0.458 -0-410 0.154 1.000 -0.149 -0.21 1 -0.052 0.705 -0.230

C ~ ~ C O 0.414 0.378 -0.028 -0.149 1.000 0.692 0.:379 -0.051 0.24:3

clrncoh 0.660 0.712 0.079-0.211 0.692 1.000 0.484-0.123 0.084

clrnso2 0.275 0-491 0.040 -0.052 0.379 0.484 1.000 0.0:3.5 0.006

mtemp -0.281 -0.160 0.1)65 0.705 -0.051 -0.12:3 0.0:35 1.000 -0.084

i mhumd -0.025 -0.031 -0.347 -0.230 0.243 0.084 0.006 -0.084 1.000

generalized estimating equations (GEE) approach is adopted to study the relation between

the asthmatic status and the pollutant factors.

Included in the GEE mode1 are the following covariates: the closest measurement

of nitrogen oxide (clm.no), nitrogen dioxide (clm.no?), total reduced sulphur (clm.tr.s),

ozone (clm-oz), carbon monoxide (clmxo), coefficient of h u e (cln-coh), sulphur dioxide

(clm.s02), mean temperat ure (mt e n p ) and mean humiciity (m humd) .

First , we examine the correlat ions bet ween the covariates. Some collinearity is present

as shown in the pairwise correlat ion coefficient matrix. The correlation coefficient is 0.71 2

between clm.coh and clrn.no2, 0.705 between clm-or and mtemp, etc. The condition

number of the covariate matrix is 28.92, which indicates that there exists a moderate

linear relation among these variables.

To compare the different penalty models, we split the data set into two: one training

set and one test set. The test set coosists of the observations of 9 randomly selected

subjects, i.e. 9 x 21 = 189 observations. The 6.30 observations of the remaining 30

subjects are included in the training set.

Four GEE models of exchangeable working correlation structure with different penal-

ties are fitted to the training set: no-penalty, the Bridge penalty. the Lasso penalty ancl

the ridge penalty. For each penalty model, the shrinkage parameters are selected via the

quasi-GCV. Then the prediction errors PSE and MCE of the selected model are computed

at each point of the test set as

PSE = Dev(y,b), MCE = I { a b s ( y - b ) 2 0.5},

where Dev is the deviance based on the marginal distribution, and I is the indicator

function. The PSE and MCE are further averaged over different points of the test set.

For each single split, the PSE and MCE computed as above depend on the split. To

compare the different penalty models, one needs to repeat the above procedure for many

different splits? and compare the prediction errors averaged over different random splits

of the data set.

To assess the performance of the different modeIst we repeat the above procedure for

100 random splits of the data set. Table 7.4 shows the mean prediction errors and the

standard errors over 100 randorn splits of the data set. The relative PSE and MCE to

the baseline of the no-penalty CEE model, defined as

PSE - PSEGEE MCE - MCEGEE PSE, = and MCE, =

MCEGEE PSECEE

are also reported to examine the reduction of the prediction error from the no-penalty

GEE model. It is shown that the ridge penalty model has a significant reduction of

Table 7.4: Comparison of prediction errors on test data over 100 random splits

1 PSE 1 :35.928(2.818) 1 :34.749(2.946) ( 37.652(2.996) ( 23.008(2.430) ( GEE

both PSE and MCE from the no-penalty model. while the Bridge and the Lasso have a

significant reduction of MCE but no significant change of PSE. Figure 7.3 shows the box

plots of MCE and PSE in the original scale for different penalty models and the relative

MCE and PSE. It is clearly shown that the ridge penalty niodel actiieves the srnailest

MCE and PSE and thus performs the best in terms of prediction. It is also stiown that

the Bridge and the Lasso penalty models achieve better prediction in terms of MCE than

Ridge Bridge

L I

the no-penalty GEE model, but not in terms of PSE. Therefore. it can be concludeci tha t

the ridge penalty model achieves the best prediction for this data set.

Lasso

0.313(0.029) MCE

Having studied t h e performance of the different penalty models in terms of prediction

0.370(0.028)

errors, we compare the four different penalty GEE models with exchangeable correlation

1

structure on the entire da ta set. Table 7.5 shows the estimates and the standard errors of

the different models. Since a Lasso penalty model with X = 0.5 is selected for t he Bridge

penalty model by t h e quasi-GCV as shown in Figure 7.4, the Lasso model is virtually

0.319(0.030)

the same as the Bridge model. A ridge penalty model with X = 0.4 is selected by the

quasi-GCV. The standard errors for the no-penalty GEE model are computed with the

O.l%(O.O 19)

MCE Relative MCE to GEE

GE€ ~ n d ~ e Lasso Ridge

PSE

GEE Bridge Lasso Ridge

Bridge Lasso Ridge

Relative PSE to GEE

Bridge Lasso Ridge

Figure 7.3: Cornpa~son of prediction errors on test data by box plots

Bridge O p t i r n e - n d ---- .- Gamma via Q-GCV

Fieme 7.4: Selection of parameters A and for the environmental Iiealth data

"sandwich" estimator (4.2): while the standard errors for the other penalty models are

computed with the jackknife method (Shao and Tu. 1995). The jackknife is applied on

the subjects rather than on the observations since the subjects are independent but not

the observations.

The no-penalty GEE model has only one significant effect, i-e. the effect of clm.trs.

The negative effect of clm-trs is not satisfactory since it is known to be air pollutant

and is expected to have a positive effect. Therefore. the no-penalty GEE model does not

yield a meaningfui result. The Bridge and the Lasso models set the effects of clrn.no,

clm.ot, clm.co, clrn.coh, cZm.so2 and mtemp to zero. No effects in the Bridge and the

Lasso model are significant. Therefore, both models fail to explain the variation of the

Table 7.5: Estimates of the environmental health data

int ercept I GEE Bridge1 Lasso2 Ridge3

1. A Lasso mode1 (y = 1) with X = 0.5 is selected. 2. -4 Laso mode1 with X = 0.5 is selected. 3. A ridge mode1 with X = 0.4 is selected. *. S i m c a n t effect.

LOG

response variable and thus are not satisfactory. However. the ridge penalty rnodel yields a

very different result. The negative significant effect of the total reduced sulphur (c1m.tr.s)

in the no-penalty GEE model becomes insignificant. The ridge penalty model yields a

positive significant effect of t he coefficient of haze (clrn.coh). which is not significant in

the no-penalty GEE model. hl1 of the other covariates do not have a significant effect.

Although this ridge penalty GEE model still fails to detect the effect of many pollutant

factors from the data: it certainly supplies the information on the significance of the

contribution of the coefficient of haze to the asthmatic status. The positive significant

effect means t hat the larger the coefficient of haze (or the more severe the haze). the more

likely the ast hmatics suffer from the p o h t e d environment. This result is achieved only

with the ndge penalty model, which has the smallest prediction errors among clifferent

penalty GEE models as shown in the previous cornparison based on prediction errors with

random splits of the data set.

Overall. the GEE model with the ridge penalty achieves better prediction by shrinking

the regession parameters, and yields a more meaningful result than the no-penalty GEE

mode1 for this environmental health study. Even though further investigation is stiu

needed for the ndge penalty model in order to capture more information on the significance

of the effects of other poilutants than the coeficient of Liaze, the ridge penalty model

captures more informat ion than the no-penalty GEE model and makes the interpretation

of the model parameters more meaningful. It is demonstrated through this analysis that

the penaiized GEE model is very important and is potentiauy a good approach to handle

collinearity among covariates of the GEE models.

Chapter 8

Discussions and Future Studies

8.1 Discussion

Regression is a widely used statistical tool for quantitative analysis in scient i fic researches.

Collineari ty is a problem associated wit h regression. It influences estimation and preclic-

tion, and thus has a large impact on researches.

Although there are many niethods dealing wit b collinearity. for example. principal

component analysis, shrinkage model is still an important met hod, which yielcls a simple

and easy-to-interpret linear or geoeralized linear regression model.

Bridge regression, as a special farnily of penalized regressions with two very important

members: ridge regression and the Lasso, plays an important role in handling collioearity

problem. It yields small variance of the estimator and achieves good estimation and

prediction by shrinking the estimator towards O.

The simple and special structure of the Bridge estimators for 7 2 1 makes the com-

piitation very simple. The modified Newton-Raphson method for y > L and the Shooting

method for -y = 1 were developed based on the theoretical results of the structure of

the Bridge estimators. Particularly, the Shooting method for the Lasso benefits from the

theoretical result that the Lasso estimator is the liniit of the Bridge estimator as 7 tends

to 1 from above. It has a very simple close form at each single step, and a simple iteration

leads to fast convergence. These propert ies make it very attractive computat ionally. as

can be seen from the simple and concise programming codes in Appendix A. In contrast?

the combined quadratic programming method by Tibshirani (1996) has a finite-step ( Z P )

convergence, and potentially has even better convergence rate ( 0 . 5 ~ to 0 . 7 5 ~ ) as pointed

out by Tibshirani (1996). In addition, the combined quadratic programming method has

range of [O, L] and is easy to optimize via grid search; while the Shooting method has no

such a standardized range, even though it has a threshold Xo > O such tha t any tuning pa- A

rameter h 2 Xo sets the Lasso estimates 13, = O for j = 1.. . . . p (Gill. Murray and CVriripht

1981). We believe that the Shooting method has a convergence rate of order plog(p)

although a theoretical result of the order has not been obtained. I t is easy to see t hat for

orthogonal X, only p steps is required to solve t he p independent equations in (P9) by the

Shooting methocl. Both the modified Newton-Raphson methoci and the Shooting met hod

can be applied to generalized linear models via the IRLS procedure without extra effort.

The classical approach to penalization depends on joint likelihood f~mctions, and is

thus lirnited to the cases in which a joint likelihood function exists. However. cases in

which there does not exist a joint likelihood function have a broad range of applications

in many scientific researches, for example, the GEE as discussed in Chapter 4. The clas-

sical approach to penalization does not apply in these cases even though penalization is

desirable due to highly correlated regressors. T h e penalized score equations introduced

in Chapter 5 generalize penalization to be independent of joint likelihood functions and

yield a shrinkage estimator, which has the same properties as the estimators of penal-

ized regressions with joint likelihood. Therefore the penalized score equatioos provide a

technique, which enables penalization to be applied to cases in which no joint likelihood

function exists, such as the GEE.

As a new concept, the penalized score equations not only provide the techniques to

handle the colinearity problem in t h e GEEI but also suggest different types of penaliza-

tion. They present penalization in a different ivay compared to the constrained regressions.

The former emphasize the solutions of the equations of (P3) as shown in Figure 2.1. while

the latter the constrained area as shown in Figure 1.1. One can then consider rnaiiy

types of penalty and comprehend the structure of the estimators from the penalized score

equations by studying the solutions of (P3) as in Figure 2.1.

The generalized estimating equatioos is an important statistical met hod in longitudinal

st udies. The consistency of the GEE est imator and t lie working covariance structure

make it very attractive in longitudinal studies. However, the correlation structure may

induce the non-existence of joint likelihood function: which hinders the implementation

of penalization in the classical way via joint Likeliliood functions. The penalized GEE is

a method of applying penalty to the GEE structure via the penalized score equations.

It circumvents the difficulty of the non-existence of joint likelihood functions. Therefore.

the penalized score equations provide a theoretical support to the penalized GEE.

The generalized cross-validation ( GCV) met hod was proposed ini tially to op timize

the tuning parameter of smoothing splines, which are linear operators. This technique

is borrowed here to select the shrinkage parameters X and y: as suggested by Tibshirani

(1996) for the Lasso. It is evidently true in the literature that the GCV rnethod works

weU for Lnear operators, including ridge regression. The simulation results of the linear

regression model in Chapter 6 show that the GCV does not always select the best value

of 7 for the Bridge regression model, even though Bridge regression has the potential to

select y £rom a wide range [l , 00). The foilowing facts may partially but not compietely

explain why the GCV does not select the best 7.

1. The Bridge operator is non-linear for # 2. This can be seen clearly from (5.1)

since the matrix D in (5.1) is a function of p. The non-linearity of t h e Bridge operator

can be visually seen in Figure 2.3 for the special case of orthonormal matrix. Since the

Bridge operator (7 # 2) performs very differently from the ridge operator (-1 = 2 ) or the

OLS operator ( A = O), the linear approximation to the Bridge operator as in the GCV

definition (5.2) does not yield the best y value for the model selection.

2. The range of y value is limited to [l, ca). As the simulation results show tliat the

Bridge yields very similar results to those of the Lasso in many câses. In certain cases.

the GCV achieves the minimum a t y = 1 as shown in Figure 7.1. This may be due to

the truncation of the range of y at = 1. .4 value of less than 1 may be seiected by

the GCV if [O, 1) of 7 is also considered in the Bridge model. Hence the truncation of the

range of y at y = 1 may contribute to the frequent selection of the Lasso (7 = 1 ) by the

GCV.

Because of the above reasons? it is not a surprise that the Bridge model does not

always perform the best in estimation and prediction compared to the other shrinkage

models: the Lasso and the ridge. Therefore, new optimization techniques are desirable.

especially for non-linear operators.

The quasi-GCV, motivated Lom correlated Gaussian responses, incorporates the work-

ing correlation structure of the correlated responses into the deviance residuals and yields

a weighted deviance. The weighted deviance reduces to the deviance for independent

responses, and accordingly, the quasi-GCV reduces to the GCV. By incorporating the

working correlation structure, the quasi-GCV achieves the same effect in model selection

for the penalized GEE as the GCV for generalized linear models. It selects shrinkage

parameters of the penalized GEE in a very easy and simple way. and yields good esti-

mation and predict ion. The quasi-GCV generalizes the GCV to correlated responses and

performs well in rnodel selection.

The effective nurnber of degrees of Eieedom of the correlated observations, motivatecl

from the correlated Gaussian responses, takes the correlation structure into consideration.

It corrects the total degrees of freedom of the data from the total number of observations

to a reduced degree for positively correlated observations. It reflects the effect of the cor-

relation structure on the observations and t hus close1 y captures the int rinsic relationship

among the observations. Since the degree of freedom p l a y an important role in statistical

inference and model adequacy checking, the correction of the degrees of freedom by the ef-

fective number of degrees of freedom is expected to have some effect on the interpretation

of the GEE model.

Due to the lack of joint likelihood f~nc t ions~ many statistical procedures based on joint

likeliliood, for example, likelihood ratio test, do not apply to the GEE models. Thus the

standard errors of the regression estimates become more important for inference. However,

the complexity of the estimate of the penalty models makes it very hard CO obtain simple

formula for the standard errors of the Bridge estimators. Ve r - often, the standard errors

are computed from some semi-parametric methods, like the bootstrap or the jackknife,

which rely on large sample theory. Caution must be used when the observations are

correlated, or when the number of independent units is not large.

8.2 Future Studies

It has been shown theoretically that the Bridge estimator has a simple structure and

can be computed via simple and efficient algorithrns. It has also been demonstrated

through simulation studies that the Bridge estimator performs well in terms of estimation

and prediction for linear regression models, generalized linear niodels and GEE models.

There are still many interesting aspects about the Bridge estimator or other shrinkage

estimators, which need further investigation. They are summarized as follows.

1. Theoretical resdt of the asymptotic consistency.

I t is not known whet her the Bridge estimator, especially of the penalized GEE, is asymp-

totically consistent although we believe the asymptotic consistency is true. It is also

not known how an incorrect specification of the correlation structure influences the con-

sistency of the Bridge estirnator and the selection of the shrinkage parameters of the

penalized GEE. Therefore, it is of great importance to study the asymptotic consistency

of the Bridge estimator of the penalized score equations in general, and particularly to

investigate how incorrect specificat ion of the correlation structure of the GEE influences

the selection of the shrinkage parameters.

2. New model selection met hods. especially for non-linear operators.

As discussed in last section, the GCV method was initially introduced to select the tuning

parameters for smoothing splines, which are Linear operators. Since the Bridge operator

is non-linear, the GCV does not always select the best value of y for the Bridge model.

It is desirable to develop some met hods to select the shrinkage parameter for non-linear

operators, particularly for the Bridge penalty.

3. The Bridge penalty mode1 with y < 1.

The Bridge penalty with y < is not considered in this thesis. It is also of great interest to

investigate t his case. Al t hough some difficult ies are expected. for example. multiple solu-

tions of the penalized score equations. etc.. much work needs to be done both theoretically

and computat ionally.

4. Other types of penalties.

As discussed in Chapter 3 ? other types of penalties can be considered in the form of the

penalized score equations. I t can be expected t hat different types of penalties may y ield

different structures of the estimators' and may lead to different yet interest ing niodels

and results.

Overall. further investigation is needed in the near future to comprebend this inter-

esting topic of penalization in statistical modelling.

References

Bell, D.F., Walker, J.L., O'Connor, G. and Tibshirani. R. ( 1994). Spinal deformity after

multiple-level cervical laminectomy in children. Spine. 19 (4):406- 1 1. Feb. 1.5.

Craven, P and Wrrhba, G. (1979) Smoothing noisy data with spline functions. :Vumerischz

Mathernatk :31, 377-403.

Diggle, P.J.; Liang, K.-Y. and Zeger. Ç.L. (1994) Analysis 01 Longitudinal Data. Claren-

don, Oxford.

Durret t, R. (1 99 1 ) Probabilit y Theory and Examples. Wadswort h. Belmont.

Efroo, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. Xew York, Chap-

man and Hall.

Frank. LE. and Friedman, .J.H. ( 199:3). A S tatistical view of some chemometrics regression

tools. Technometn'cs. Vol :35 Nol- 109- 148

Furnival, G.M. and Wilson, R. W..Jr. ( 1974) Regressions by Leaps and Bounds. Tech,no-

metrics, 16,.L.99-5 1 1.

Gill, P.E., Murray, W. and Wright, M.H. (1981) Fractical Optimization. Academic Press.

London.

Green, P.J. ( 1984) Iteratively reweighted least squares for maximum likelihood estimation.

and some robust and resistant alternatives (witli rtiscussion). Journal of Royal Statistical

Society, B 46, 149-192.

Hast ie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chopman and

Hall, New York.

Hocking, R. R. (1996) Methods and Applications of Linear Models: Regression and the

Aaalysis of Variance. Wiley, New York.

Hoerl, A.E. and Kennard, R. W. ( 1 WOa) Ridge regression: biased estimation for nonort hog-

onal problems. Technometrics. Vol. 2 o . 1. 55-67.

Hoerl. A.E. and Kennard. R. W. ( 1970b) Ridge regression: applications to nonortliogonal

problems. Technometrics, Vol. 12: Xo. 1. 69-82.

Laird, N.M. and Ware, J.H. (1982) Random-effects models for longitudinal data. Biomet-

rics, 38, 963-974.

Lawson, C. and Hansen, R. (1974). Solving least squares problems. Prentice-Hall.

Lee, A..J. (1993) Generating random bioary deviates having fixed marginal distributions

and specified degrees of association. The Amen'can Statistician. Vol. 47. Xo. 3 - 209-21.3.

Li, B. and McCullagh, P. ( 1994) Potential functions and conservative estimating funct ions.

The Annals of Statkt ics Vol. 2%. Xo. 1. 340-356.

Liang, K.-Y. and Zeger? S. L. ( 1986) Longitudinal data analysis using generalized linear

models. Biometrika 73, 1 : 3 - E .

Liang, K.Y., Zeger, S.L. and Qaqish,B. ( 1992) Multivariate regression analyses for cate-

gorical data (with discussion). Journal of the Royal Statistical Society B 34. 4-40.

McCullagh, P. ( 199 1) Quasi-likelihood and estimating functions. In Statistical Theory

and iWodelling: In Honour of Sir David Cox (D. Ci-Kinkley, !V.Reid and E.J.Snel1, eds.)

265-268 Chapman and Hall, London.

McCullagh, P. and Nelder, d.A. (1989) Generalized Linear ibfodels (2nd ed.). Chapman

and Hall, London.

Nelder, J . A. and Wedderbum, R. W.M. ( 1979) Generalized Linear models. Journal O/

Royal Statistical Society A 135, 370-384.

Seber, G.A.F. ( 1977) Linear Regression Analysis. Wiley, New York.

Sen, A. and Srivastava, M. ( 1990) Reg~ession Analysis Theory, Methods, and Applications.

Springer, New York.

Stamey? T., Kabalin, J . , McNeaI. d.. Johnston, I., Freiha, F.. Redwine. E. and Yang. Y.

( 1989). Prostate specific antigen in the d iapos i s and treatment of adenocarcinorna of the

prostate ii. radical prostatectorny treated patients, Journal of L+ologl/. 16. 1076- 10Y3.

Shao, J . (1993) Linear model selection by cross-validation. .Journal 01 the Amen'can

Statistical Association. 88, 486-494.

Shao, J . and Tu, D. (1995) The Jackknife and Bootstrap. Springer New York.

Stone, M. ( 1974) Cross-validatory choice and assessrnent of stat istical predict ions. . .Jour-

nal of Royal Statistical Society B. Vo1.36. 1 1 1- 147.

Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Jownal 01 Royal

Statistical Society B o V01.58~ No. 1. '6'TZYY.

Wahba, G. (1990) Spline Models for Observational Data. Society for Industrial and

Applied Mathematics, Philadelphia.

Wedderburno R. W.M. ( 1974) Quasi-likelihood functions. generalized lioear models and

the Gauss-Newton method. Biometrika 6 1, 4:39-47.

Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal da t a analysis for discrete and continuous

outcomes. Biometrics 42, 121-130.

Zhang, P. (1992) On the distributional properties of model selection criteria. Jo,urnal of

the Amencan Statistical Association. 87, 733-737.

Appendix A

A FORTRAN Subroutine of the

Shooting Method for the Lasso

In this appendix, we provide a FORTRAN subroutine of the Shooting method for the

Lasso estimator. This subroutine is self-contained and can be called from any FORTRAN

or S+ program.

Variables called in the subroutine:

N - sample size:

P - number of regressors;

X - regression matrix of dimension n x p;

Y - response variable of dimension n x 1:

B - regression parameters of dimension p x L 1

input - the OLS estimates. output - the Lasso estimates:

LAM - the tuning parameter X for the Lasso penalty;

EPS - the thresthold of the convergence, about l.E-12.

Matrices used for working space:

BB, BO - matrices of dimension p x l:

XII XB, YSB, - matrices of dimension n x 1.

DO 1000 KK = 1, 1000

DO 1 II = 1, P

1 BO(11,l) = B(11,l)

DO 10 1 = 1, P

DO 20 J = 1, P

IF (J .EQ. 1) THEN

B B C J , ~ ) = O,

ELSE

BB(J,I) = B(J,I)

END IF

CONTINUE

CALL MATM(XB,X,BB,N,P,1)

CALL MATs(YXB,Y,XB,N,~)

CALL MATSUBCOL(XI,X,N,P,I)

CALL MATTM(S,XI,YXB, 1jN,i)

IF (-2.*S(IY 1) .LT. -LAM) THEN

3UNK = (~.*S(I,~)-M)/~./NORM~(XI,N)

ELSE IF (-2.*S(l,l) .GT. LAM) THEN

JUNK = (~.*S(~,~)+M)/~./NORM~(XI,N)

ELSE

J U N K = 0.

END IF

DO 5 II = 1, P

IF ( II .NE. 1 ) THEN

B(11, 1) = B(11,l)

ELSE

B ( f 1 , 1) = JUNK

END IF

5 CONTINUE

10 CONTINUE

IF (DIST(B, BO, P) .LT. EPS) GOTO 5000

1000 CONTINUE

5000 RETURN

END

FUNCTION NORM2 (X , N) INTEGER N, 1

DOUBLE PRECISION NORM2, X (N, 1)

NOM2 = 0.

DO 50 1 = 1, N

N O M 2 = NORM2 + (~(1,1))**2

50 CONTINUE

RETURN

END

FUNCTION DIST(X,Y,N)

INTEGER N j 1

DOUBLE PRECISION DIST , X(N, 1) , Y (N, 1)

DIST = 0.

DO 500 1 = 1, N

DIST = DIST + (X(1,l) -Y (1,1))**2

500 CONTINUE

DIST = SQRT(D1ST)

RETURN

END

SUBROUTINE MATM(A,B,C,N,M,P)

C MATRIX MULTIPLICATION A = B*C

INTEGERN, M, P, 1, J, K

DOUBLE PRECISION A(N,P) , B(N,M) , C(M,P)

DO 10 1 = 1, N

DO 20 J = 1, P

A(1,J) = 0.

DO 30 K = 1, M

A(I,J) = A(I,J) + B(I,K)*C(K,J)

30 CONTINUE

20 CONTINUE

10 CONTINUE

R E m

END

SUBROUTINE MATTM(A,B,C,N,M,P)

C MATRIX MULTIPLICATION A = T (B) *C T(B) -TRANSPOSE

INTEGERN, M, P, 1, J, K

DOUBLE PRECISION A(N,P), B ( M , N ) , C(M,P)

30 CONTINUE

20 CONTINUE

10 CONTINUE

RETURN

END

SUBROUTINE MATS(A,B,C,N,P)

C MATRIX SUBTRACTION A = B-C

INTEGER N, P, 1, J

DOUBLE PRECISION A(N,P) , B ( N , P ) , C(N,P)

DO 100 1 = 1, N

DO 200 J = 1, P

A(I,J) = B ( 1 , J ) - C(I,J) 200 CONTINUE

100 CONTINUE

RETURN

END

SUSROUTINE MATSUBCOL(XI,X,N,P,I)

INTEGERN, P, 1, 3

DOUBLE PRECISION XI(N,1), X(N,P)

Appendix B

Mat hematical Proof

[n this appendix, 1 give the mathematical proof of Tbeorems 1' 2 and :3. Theorem 4

is the same as Theorern 3. Therefore, the proof of Theorem 4 is omitted.

Let F = ( F i , . . ., F,), where Fj = S j ( /3 ,X7y ) + d(i3,,A,y), j = 1.. . . . p .

Lemma 1 Given h > O, y > 1. If the Jacobian ($1 ir positive-senii-definite. t ben

(5) is positive-definite at # O, j = 1,. . . ? p .

Proof Observe that

where D(P,h7-y) = diag - L)IP~~'-*)- D is positive definite for > 1 and ,ijj # O.

j = 1.. . . ,p. This completes the proof.

Lemma 2 Given A > O. The function -d(i3j? A. y) = -A71~jl-f-'sign(ijl,) converges to

the beavyside functioo -As ign (&) a t ,Ljj # O as -7 i L+.

Proof It is obvious by observing that the function d is continuous in y at 4 # O.

Proof of Theorem 1.

1. First , it is easy to prove the existence of the solution of (P3) by mat hemat ical incluct ion

on dimension p, and t hat the solution is almost surely non-zero. Secondly, the conditions

of the Implicit Function Theorern are satisfied by Lemma 1. Therefore. there exists a

unique solution &A, y) satisfying (P3), and &A, 7) is continuous in ( A ? 7).

2. We prove the existence of the Limit of ,&Alr) as 7 + I+ by mathematical induction.

(1). p = 1- Lf there is an intersection of functions S(P, X, y) and -d(p, A, y), by contiouity

of these two functions and Lemma 2, l i q - l + B(x, 7) exists and is equal to the coordinate

of the intersection. If there is no intersection of functions S(P7 X, y ) and -d(P. A. y). it is

easy to prove that l i w d I + &A, y) = O. Therefore. the resuit holds for p = 1.

( 2 ) . In the remaining of the proof. we omit X from the expressions because it is kept as

a constant. Assume t hat the result holds for al1 dimensions 1. . . . . ( p - 1 ). We prove t liat

it also holds for pdimension. Consider a sub-problem formed by the first p - I equatious

q s ,..... s - 1 ) > 0. of (P3) for fixed &. %y the assumption, " > O. whicli irnplies that ~7(1i1, . . . , i1~-i) q - , -

Then the result of Theorern 1 holds for tliis ( p - 1)-dimensional sub-problern for fixed 4.

Therefore, the lirnit of the unique solution ( A (13,: y). . . . ,&-l (19,: Y)) of t his su b-problern

exists as 7 - l+. Substitute this solution into the last equation of (PS):

Then we need to prove that this equation has a unique solution 13,(y) of which the limit

exists as -y 4 t+. Denote the first term of the left hand side function of (B-L) by L ( J p . - / ) .

By chain rule

ar, as, ab, - - - -- asPajp- , as, +.*.+-- a& B& 8pp + -.

a.dp dPp

By the implicit Function Theorem on the (p - 1)-dimensional sub-problem. the partial

derivat ives, $ j = 1 . - . , p - 1, sat isfy

From (B.2), one con eâsily show that - 2 O by simple calculation in Linear algebra. a&

Therefore, t here exists â unique solution & (?) satisfying equation (B. 1).

To prove that the limit of jp(-y) exists, notice that 2 O for any > 1. Similady. one

can prove that the solution of the following equatioo exists.

where $j ( /3p , 1 +) is the lirnit of the solution j j ( [ j p . y) for fixed a , , j = 1 . . . . , p - 1.

Denote the solution of (B.3) by Bp(-y). By the assumption of the induction, lia,,_,+

elcists.

Rewrite equatioo (B. 1) as

where function A(@,? y) is defined as

We need to prove that the solutions of (B.l) and (B.3) have the same lirnit. This can be

achieved by proving that

l W p J ) l s w where 6(y) is independent of /3, and converges to O as 7 + l+. Since S, is differentiable

wi th bounded partial derivatives 5 and bj(flpl y ) is differentiable with bounded partial a&

derivatives % by Lmplicit Function Theorem and jj(,8p7 7) + fij(&, I+) for any value

of ,Bp7 it can be shown in functional analysis that there exists such a function &(y) -+ O

uniformly in #?,.

This completes the proof of Theorem 1.

Proof of Theorem 2.

(1). Given X > 0, 7 > 1. Since there exists a joint likelihood function and is positive 3 definite, then function -2 log(Lik) is convex. By the same argument of Lemma 1. ftinction

G(P, A, y) is convex and can be minimized uniquely a t some finite point. Therefore. the

Bridge estirnator is unique. Since (P3) has a unique solution &A,-(): which satisfies that

bj # Ol alrnost surely for j = 1,. . . ,pl and function G is differentiable a t &A. 7): thus

G is minimized ot 7). By the uniqueness of the Bndge estirnator of (P?)' D(A. Y) is

equal to the Bridge estimator of (P2).

(2). Given X > O. By Theorern 1, &A7 exists. We denote the lirnit by &A. i f ) .

Sioce G(P7 A, y ) is continuous in (PT A, r), l iq- , , G' (B(& -y), A: = G (B(& l+) ' A- 1) -

Also notice that B(x, y ) is the unique estimator rninimizing G'(X,-/). and fi,,,,, is the

unique estimator minimizing G'( A, 1) since G' is convex for = 1. We prove t hat B(h. 1 +) =

by contradiction. If not true, by the uniqueness of the Lasso estimator'

Take > O such that

Po < IG (m, 1+), A 7 1) - G (P.,,(W, A, 1) 1

Since G (b(A, -y), A y r ) and G ( ~ l a s , o ( ~ ) , & - y ) are continuous in 7, there exists a > L

and

However, this contradicts the fact that

for any ,B + ,&A, T ~ ) by the uniqueness of the Bridge estimator. This cornpletes the proof.

Proof of Theorem 3.

Notice that the limit of the Bridge estimator is, by Theorem 1, the Lasso estimator as y

tends to l+. Taking this limit at each step of the Modified-Newton-Raphsou (M-N-R)

algot-ithm leads to the Shooting algorithm. Hence the convergence of the M-N-R algorithm

irnplies the convergence of the Shooting algorithm by simply taking the limit as 7 tends

to If at each step. Therefore it suffices to prove the convergence of the M-N-R algorithm.

We prove it for the following two cases.

(1). There exists a joint likelihood function. By Lemma 1. function G(B. A, y ) is

*

convex. There exists a unique solution minimizing G, i.e. Pbrg = arg min G'. For p = L. the

M-N-R algorithm converges to the unique solution of (P:3). which is the Bridge estimator

by Theorem 2. Hence, it minirnizes function G in 8. For p > 1 and fixed 6-jl updating

jj by the M-N-R algorithm achieves the local minimum of G in ,ai for fixed ,8-J. Denote

the value of G by Gmj and the updated value of 6 by pmj after updating ;j, at step rn

by the M-N-R a lg~r i th rn~ one has

By the convexity of function G and the uniqueness of the Bridge estimator rvhich miui-

mizes G', Gmj converges to the unique minimum min(G) and Bmj converges to the unique

Bridge estirnator bbrg. Consecpently: the subsequeoce b,,: which is equal to Pm by

definition, converjes to Pb,.

(2). There exists no joint likelihood function. We prove that in a smail neiglibourhood

A

of the Bridge estimator Pbrs of (PS), there exïsts a potential function of P such that the

gradient of this potential Eunction is equal to the vector field of S. Then the convergence

can be proved through (1) above.

We prove there exists an approximation of such a potential function. Since the Jaco-

bian is positive definite, by Theorern 1, there exists a unique solution Bbrg. Denote aP as T

the matrix Q = - Define a real function L ( P ) = f S*Q-'S and take the ( 3 ~ ) lp=Bh,

partial denvative wi th respect to ,B in a neighbourhood of &-

where 4 = O( 1) by the continuity of the Jacobian (3) a t &,. Therefore. hiaction

L ( P ) is an approximation to the potential function of which the gradient is equal to S.

This cornpletes the proof.

Remark The existence of a local potential function in a neighbourhood of some point

p of the vector field S does not imply the existence of a global potential function that

would have a gradient equal to S since S may be path-dependent.

IMAGE EVALUATION TEST TARGET (QA-3)

APPLIED 1 IMAGE. lnc 1653 East Main Street -

-* , Fiachester. NY 14609 USA -- -- - - Phone: i l 6/48Z-O3OO -- -- - - Fax 71 61288-5989

O 993. &lied Image. lm. All Rishm Resewed

STATISTICAL SHRINKAGE MODEL AND ITS APPLICATIONS · The aim of statistical analysis is to identify...

Documents

Transcript of STATISTICAL SHRINKAGE MODEL AND ITS APPLICATIONS · The aim of statistical analysis is to identify...