Boosting ridge regression

16
Computational Statistics & Data Analysis 51 (2007) 6044 – 6059 www.elsevier.com/locate/csda Boosting ridge regression Gerhard Tutz a , , Harald Binder b a Institut für Statistik, Ludwig-Maximilians-Universität München, Akademiestr. 1, D-80799 München, Germany b Institut für Medizinische Biometrie und Medizinische Informatik, Universitätsklinikum Freiburg, Germany Received 22 May 2006; received in revised form 22 November 2006; accepted 29 November 2006 Available online 22 December 2006 Abstract There are several possible approaches to combining ridge regression with boosting techniques. In the simple or naive approach the ridge estimator is used to fit iteratively the current residuals yielding an alternative to the usual ridge estimator. In partial boosting only part of the regression parameters are reestimated within one step of the iterative procedure. The technique allows to distinguish between mandatory variables that are always included in the analysis and optional variables that are chosen only if relevant. The resulting procedure selects optional variables in a similar way as the Lasso, yielding a reduced set of influential variables, while allowing for regularized estimation of the mandatory parameters. The suggested procedures are investigated within the classical framework of continuous response variables as well as in the case of generalized linear models. The performance in terms of prediction and the identification of relevant variables is compared to several competitors as the Lasso and the more recently proposed elastic net. For the evaluation of the identification of relevant variables pseudo ROC curves are introduced. © 2006 Elsevier B.V.All rights reserved. Keywords: Ridge regression; Boosting; Lasso; Mandatory variables; Pseudo ROC curves 1. Introduction Ridge regression has been introduced by Hoerl and Kennard (1970b) to overcome problems of existence of the ordinary least squares estimator and achieve better prediction. In linear regression with y = X + where y is a n-vector of centered responses, X an (n×p)-design matrix, a p-vector of parameters and a vector of iid random errors, the estimator is obtained by minimizing the sum of squares (y X) T (y X), subject to a constraint j | j | 2 t . It has the explicit form R = (X T X + I p ) 1 X T y where the tuning parameter depends on t and I p denotes the (p × p) identity matrix. The shrinkage towards zero makes R a biased estimator. However, since the variance is smaller than for the ordinary least squares estimator ( = 0), better estimation is obtained. For details see Seber (1977), Hoerl and Kennard (1970a,b) and Frank and Friedman (1993). Alternative shrinkage estimators have been proposed by modifying the constraint. Frank and Friedman (1993) introduced bridge regression which is based on the constraint | j | t with 0. Tibshirani (1996) proposed the Lasso which results for = 1 and investigated its properties; for extensions see also Fu (1998) and Le Cessie and van Houwelingen (1992). In the present paper boosted versions of the ridge estimator are investigated. Boosting has been originally developed in the machine learning community as a means to improve classification (e.g. Schapire, 1990). More recently, it has been Corresponding author. E-mail addresses: [email protected] (G. Tutz), [email protected] (H. Binder). 0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.11.041

Transcript of Boosting ridge regression

Page 1: Boosting ridge regression

Computational Statistics & Data Analysis 51 (2007) 6044–6059www.elsevier.com/locate/csda

Boosting ridge regression

Gerhard Tutza,∗, Harald Binderb

aInstitut für Statistik, Ludwig-Maximilians-Universität München, Akademiestr. 1, D-80799 München, GermanybInstitut für Medizinische Biometrie und Medizinische Informatik, Universitätsklinikum Freiburg, Germany

Received 22 May 2006; received in revised form 22 November 2006; accepted 29 November 2006Available online 22 December 2006

Abstract

There are several possible approaches to combining ridge regression with boosting techniques. In the simple or naive approachthe ridge estimator is used to fit iteratively the current residuals yielding an alternative to the usual ridge estimator. In partialboosting only part of the regression parameters are reestimated within one step of the iterative procedure. The technique allowsto distinguish between mandatory variables that are always included in the analysis and optional variables that are chosen onlyif relevant. The resulting procedure selects optional variables in a similar way as the Lasso, yielding a reduced set of influentialvariables, while allowing for regularized estimation of the mandatory parameters. The suggested procedures are investigated withinthe classical framework of continuous response variables as well as in the case of generalized linear models. The performance interms of prediction and the identification of relevant variables is compared to several competitors as the Lasso and the more recentlyproposed elastic net. For the evaluation of the identification of relevant variables pseudo ROC curves are introduced.© 2006 Elsevier B.V. All rights reserved.

Keywords: Ridge regression; Boosting; Lasso; Mandatory variables; Pseudo ROC curves

1. Introduction

Ridge regression has been introduced by Hoerl and Kennard (1970b) to overcome problems of existence of theordinary least squares estimator and achieve better prediction. In linear regression with y = X� + � where y is an-vector of centered responses, X an (n×p)-design matrix, � a p-vector of parameters and � a vector of iid random errors,the estimator is obtained by minimizing the sum of squares (y −X�)T(y −X�), subject to a constraint

∑j |�j |2 � t . It

has the explicit form �R = (XTX + �Ip)−1XTy where the tuning parameter � depends on t and Ip denotes the (p ×p)

identity matrix. The shrinkage towards zero makes �R a biased estimator. However, since the variance is smaller thanfor the ordinary least squares estimator (� = 0), better estimation is obtained. For details see Seber (1977), Hoerl andKennard (1970a,b) and Frank and Friedman (1993). Alternative shrinkage estimators have been proposed by modifyingthe constraint. Frank and Friedman (1993) introduced bridge regression which is based on the constraint

∑ |�j |� � t

with ��0. Tibshirani (1996) proposed the Lasso which results for � = 1 and investigated its properties; for extensionssee also Fu (1998) and Le Cessie and van Houwelingen (1992).

In the present paper boosted versions of the ridge estimator are investigated. Boosting has been originally developedin the machine learning community as a means to improve classification (e.g. Schapire, 1990). More recently, it has been

∗ Corresponding author.E-mail addresses: [email protected] (G. Tutz), [email protected] (H. Binder).

0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2006.11.041

Page 2: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6045

|beta| / |GLM beta|

sta

ndard

ized c

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

nosec

persdis

noupto

losupto

age

prevamb

somprob

homeless

jobless

|beta| / |GLM beta|

sta

ndard

ized c

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

nosec

persdis

noupto

losupto

age

prevamb

somprob

homeless

jobless

|beta| / |GLM beta|

sta

ndard

ized c

oeffic

ients

0.0 0.2 0.4 0.6 0.8 1.0

nosec

persdis

noupto

losupto

age

prevamb

somprob

homeless

jobless

0.4

0.2

0.0

−0.2

−0.4

0.4

0.2

0.0

−0.2

−0.4

0.4

0.2

0.0

−0.2

−0.4

Fig. 1. Coefficient build-up (i.e. coefficient estimates for a varying number of boosting steps/different values of the constraint parameter plottedagainst their relative L1 norm) for example data (simple ridge boosting: upper left; partial boosting: upper right; the Lasso: bottom panel). Arrowsindicate the models chosen by AIC (for boosting) or cross-validation (for the Lasso; repeated 10 times).

shown that boosting can be seen as the fitting of an additive structure by minimizing specific loss functions (Breiman,1999; Friedman et al., 2000). Bühlmann and Yu (2003) and Bühlmann (2006) have proposed and investigated boostedestimators in the context of linear regression with the focus on L2 loss. In the following, linear as well as generalizedlinear models (GLMs) are considered. In the linear case boosting is based on L2 loss whereas in the more general caseof GLMs maximum likelihood based boosting is used.

In Section 2 we introduce a partial boosting procedure. Partial boosting is a parsimonious modelling strategy likethe Lasso (Tibshirani, 1996) and the more recently proposed elastic net (Zou and Hastie, 2005). These approaches areappealing since they produce sparse representations and accurate prediction. Partial boosting means that only a selectionof the regression parameters is reestimated in one step. The procedure has several advantages: by estimating onlycomponents of the predictor the performance is strongly improved; important parameters are automatically selectedand moreover the procedure allows to distinguish between mandatory variables, for example the treatment effectin treatment studies, and optional variables, for example covariates which might be of relevance. The inclusion ofmandatory variables yields coefficients build-ups, i.e. coefficient estimates plotted against their relative L1 norm (foreach boosting step), that are quite different from that are obtained for the Lasso or the elastic net (when varying theconstraint parameter). The top right panel of Fig. 1 shows the estimates where two variables (“persdis” and “nosec”) areincluded as mandatory in a binary response example (details are given in Section 5). The resulting coefficient build-up

Page 3: Boosting ridge regression

6046 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

is quite different from that obtained for simple ridge and lasso. When all regression parameters are reestimated in eachstep a simple boosting procedure is obtained where the ridge estimator is used iteratively to fit residuals.

A useful property of ridge boosting is that it can be extended to GLM settings. In Section 3, it is shown thatgeneralized ridge boosting may be constructed as likelihood based boosting, including variable selection and customizedpenalization techniques. The performance of ridge boosting is investigated in Section 4 by simulation techniques. Itis demonstrated that componentwise ridge boosting (i.e. partial boosting with only one parameter reestimated in eachstep) behaves often similar to the Lasso and elastic net with which it shares the property of automatic variable selection.A special treatment is dedicated to the identification of influential variables which is investigated by novel pseudoROC curves. It is shown that for moderate correlation among covariates ridge boosting estimators perform very well.In Section 5 an application example that employs mandatory variables is given.

2. Boosting in ridge regression: continuous response

2.1. The estimator

Boosting has been described in a general form as forward stepwise additive modelling (Hastie et al., 2001). Ridgeboosting as considered here means that the ridge estimator is applied iteratively to the residuals of the previous iteration.

For high-dimensional predictor space the performance can be strongly improved by updating only one componentof � in one iteration. Bühlmann and Yu (2003) refer to the method as componentwise boosting and propose to choosethe component to be updated with reference to the resulting improvement of the fit. This powerful procedure implicitlyselects variables to be included in the predictor. What comes as an advantage may turn into a drawback when covariateswhich are important to the practitioner are not included. Moreover, componentwise boosting does not distinguishbetween continuous and categorical predictors. In our studies continuous predictors have been preferred in the selectionprocess, probably since continuous predictors contain more information than binary variables.

Therefore in the following partial boosting of ridge estimators is proposed where partial boosting means that in themth iteration selected components of the parameter vector �T

(m) = (�(m),1, . . . , �(m),p) are reestimated. The selectionis determined by a specific structuring of the parameters (variables). Let the parameter indices V = {1, . . . , p} bepartitioned into disjunct sets by V = Vc ∪ Vo1 ∪ · · · ∪ Voq where Vc stands for the (mandatory) parameters (variables)that have to be included in the analysis, and Vo1, . . . , Voq represent blocks of parameters that are optional. A block Vor

may refer to all the parameters which refer to a multicategorical variable, such that not only parameters but variables areevaluated. Candidates in the refitting process are all combinations Vc ∪Vor , r =1, . . . , q, representing combinations ofnecessary and optional variables. Componentwise boosting, as considered by Bühlmann and Yu (2003) is the specialcase where Vc = ∅, Voj = {j}.

Let now Vm = (m1, . . . , ml) denote the indices of parameters to be considered for refitting in the mth step. Oneobtains the actual design matrix from the full design matrix X= (x·1, . . . , x·p) by selecting the corresponding columns,obtaining the design matrix XVm =(x·m1 , . . . , x·ml

). Then in iteration m ridge regression is applied to the reduced modelyi − �̂(m−1),i = (xi,m1 , . . . , xi,ml

)T�RVm

+ εi , where �̂(m−1),i is the fit from the previous iteration, yielding solutions

�̂RVm

= (�̂R

Vm,m1, . . . , �̂R

Vm,ml) = (XT

VmXVm + �Ip)−1XT

Vm(y − �̂(m−1)).

The total parameter update is obtained from components

�̂R(Vm),j =

{�̂R

Vm,j j ∈ Vm

0 j /∈ Vm,

yielding �̂R(Vm) = (�̂R

(Vm),1, . . . , �̂R(Vm),p)T. The corresponding update of the mean is given by �̂(m) = �̂(m−1) +XVm �̂R

Vm=

X(�̂(m−1)+�̂R(Vm)), and the new parameter vector is �̂(m)=�̂(m−1)+�̂R

(Vm). In the general case where in themth step several

candidate sets V(j)m = (m

(j)1 , . . . , m

(j)l ), j = 1, . . . , l, are evaluated, an additional selection step is needed. Typically,

only a relatively small number of candidate sets will be considered. Candidate sets containing only a single elementor elements corresponding to a single (categorical) variable in addition to the mandatory elements will be the norm.

Page 4: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6047

In summary the algorithm has the following form:

Algorithm: PartBoostRStep 1: Initialization. �̂(0) = (XTX + �Ip)−1XTy, �̂(0) = X�̂(0).Step 2: Iteration. For m = 1, 2, . . . .(a) Compute for j = 1, . . . , l, the parameter updates �̂

(V(j)m )

, and the corresponding means �̂(j)

(m) = �̂(m−1) +X

V(j)m

�̂V

(j)m

.

(b) Determine which �̂(j)

(m), j = 1, . . . , l improves the fit maximally. With Vm denoting the corresponding subset

one obtains the updates �̂(m) = �̂(m−1) + �̂R(Vm),�̂(m) = X�̂(m).

With S0 = X(XTX + �Ip)−1XT, Sm = XVm(XTVm

XVm + �Ip)−1XTVm

, m = 1, 2, . . . , the iterations are represented as

�̂(m) = �̂(m−1) +Sm(y − �̂(m−1)) and therefore �̂(m) =Hmy with the hat matrix given by Hm =∑mj=0 Sj

∏j−1i=1 (I −Si).

It is straightforward to derive the algorithm as an example of stagewise functional gradient descend (Friedman, 2001;Bühlmann and Yu, 2003), based on L2 loss. The stopping rule for the algorithm determines the amount of shrinkagethat is applied. It may be chosen in the same way as for shrinkage estimators like the Lasso. Bühlmann and Yu (2003)found 5-fold cross validation for estimating the mean squared error to work well. Tutz and Binder (2006) proposethe AIC criterion which is less time consuming and may be extended to GLMs. In AIC based stopping one computesAIC = Dev(�̂(m))+ 2df m where Dev is the deviance of the model and df m represents the effective degrees of freedomwhich are given by the trace of the hat matrix (see Hastie and Tibshirani, 1990). The algorithm stops if AIC increases.Alternatively, the Bayesian information criterion BIC = Dev(�̂(m)) + log n · df m can be used. The performance ofmodels chosen by either AIC or BIC for binary response examples will be evaluated in Section 4.

Simple ridge regression is equivalent to the initialization step with tuning parameter � chosen data-adaptively, forexample by cross-validation. Ridge boosting is based on an iterative fit where in each step a weak learner is applied,i.e. � is chosen very large. Thus in each step the improvement of the fit is small. The essential tuning parameter is thenumber of iterations. In the following, we first consider the simple case with only one candidate set that contains theindices of all variables. In this case, one has Sj = S, and we will refer to it as simple or naive ridge boosting.

2.2. Simple ridge boosting

If all variables are in the (mandatory) candidate set the algorithm takes the simple form:

Algorithm: BoostRStep 1: Initialization. �̂(0) = (XTX + �Ip)−1XTy, �̂(0) = X�̂(0).Step 2: Iteration. For m = 1, 2, . . . apply ridge regression to the model for residuals

y − �̂(m−1) = X�R + ε,

yielding solutions �̂R(m) = (XTX + �Ip)−1XT(y − �̂(m−1)), �̂

R(m) = X�̂R

(m).

The new estimate is obtained by �̂(m) = �̂(m−1) + �̂R(m).

As will be shown below, the iterative fit yields a tradeoff between bias and variance, which differs from thetradeoff found in simple ridge regression. It is straightforward to derive that after m iterations of BoostR one has�̂(m) = X�̂(m) where �̂(m) = �̂0 + ∑

j �̂R(j) shows the evolution of the parameter vector by successive correction of

residuals. With B = (XTX + �Ip)−1XT, S = XB one obtains, using Proposition 1 from Bühlmann and Yu (2003), theclosed form

�̂(m) =m∑

j=0

S(In − S)j y = (In − (In − S)m+1)y, (1)

�̂(m) =m∑

j=0

B(In − S)j y. (2)

Page 5: Boosting ridge regression

6048 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

Thus �̂(m)=Hmy where Hm=In−(In−X(XTX+�Ip)−1XT)m+1. Some insight into the nature of simple ridge boostingmay be obtained by using the singular value decomposition of the design matrix. The singular value decompositionof the (n × p) matrix X with rank(X) = r �p, is given by X = UDV T, where the (n × p) matrix U = (u1, . . . , up)

spans the column space of X, UTU = Ip, and the (p × p) matrix V spans the row space of X, V TV = V V T = Ip.D = diag(d1, . . . , dr , 0, . . . , 0) is a (p × p) diagonal matrix with entries d1 �d2 � · · · �dr �0 called the singularvalues of X. One obtains for � > 0 the hat matrix Hm = In − (In − UDV T(V DUTUDV T + �Ip)−1V DUT)m+1 =In − (In − UD̃UT)m+1, where D̃ = diag(d̃2

1 , . . . , d̃2p),with d̃2

j = d2j /(d2

j + �), j = 1, . . . , r , d̃2j = 0, j = r + 1, . . . , p.

Simple derivation shows that:

Hm = U(In − (In − D̃)m+1)UT =p∑

j=1

ujuTj (1 − (1 − d̃2

j )m+1).

It is seen from �̂(m) =Hmy =∑pj=1 uj (1− (1− d̃2

j )m+1)uTj y that the coordinates with respect to the orthonormal basis

u1, . . . , up, given by (1 − (1 − d̃2j )m+1)uT

j y, are shrunken by the factor (1 − (1 − d̃2j )m+1). This shrinkage might be

compared to shrinkage in common ridge regression. The shrinkage factor in ridge regression with tuning parameter �is given by d2

j /(d2j + �). If � �= 0, usually no values �, m exist such that d2

j /(d2j + �) = (1 − (1 − d2

j /(d2j + �))m+1)

for j = 1, . . . , p. Therefore, simple ridge boosting yields different shrinkage than usual ridge regression.Basic properties of the ridge boosting estimator may be summarized in the following proposition (for proof see

appendix).

Proposition 1. (1) The variance after m iterations is given by

cov(�̂(m)) = �2U(I − (I − D̃)m+1)2UT.

(2) The bias b = E(�̂(m) − �) has the form b = U(I − D̃)m+1U�.

(3) The corresponding mean squared error is given by

MSE(BoostR�(m)) = 1

n(trace(cov(H(m)y)) + bTb)

= 1

n

p∑j=1

{�2(1 − (1 − d̃2j )m+1)2 + cj (1 − d̃2

j )2m+2},

where d̃2j = d2

j /(d2j + �), and cj = �Tuju

Tj � = ‖�Tuj‖ depends only on the underlying model.

As consequence, the variance component of the MSE increases exponentially with (1− (1− d̃2j )(m+1))2 whereas the

bias component bTb decreases exponentially with (1− d̃2j )(2m+2). By rearranging terms one obtains the decomposition

MSE(BoostR�(m)) = 1

n

p∑j=1

�2 + (1 − d̃2j )m+1{(�2 + cj )(1 − d̃2

j )m+1 − 2�2}.

It is seen that for large � (1− d̃2j close to 1) the second term is influential. With increasing iteration number m the positive

term (�2 +cj )(1− d̃2j )m+1 decreases while the negative term −2�2 is independent of m. Then for appropriately chosen

m the decrease induced by −2�2 becomes dominating. The following proposition states that the obtained estimatoryields better MSE than the ML estimate (for proof see appendix).

Proposition 2. In the case where the least squares estimate exists (r �p) one obtains:

(a) For m → ∞ the MSE of boosted ridge regression converges to the MSE of the least squares estimator r�2/n.(b) For appropriate choice of the iteration number ridge boosting yields smaller values in MSE than the least squares

estimates.

Page 6: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6049

Thus if the least squares estimator exists simple ridge boosting may be seen as an alternative (although not efficient)way of computing the least squares estimate (m → ∞).

For simple ridge regression the trade-off between variance and bias has a different form. For simple ridge regressionwith smoothing parameter � one obtains

MSE(Ridge�) = 1

n

p∑j=1

{�2(d̃2j,�)

2 + cj (1 − d̃2j,�)

2},

where d̃2j,� = d2

j /(d2j + �). While the trade-off in simple ridge regression is determined by the weights (d̃2

j,�)2 and

(1 − d̃2j,�)

2 on �2 and cj , respectively, in simple ridge boosting the corresponding weights are (1 − (1 − d̃2j )m+1)2 and

(1 − d̃2j )2m+2 which crucially depend on the iteration number.

2.3. Simulations and example

2.3.1. SimulationFirst we consider simple ridge boosting. Fig. 2 shows the (empirical) bias and variance for example data generated

from model (3) which will be introduced in detail in Section 4. The minimum MSE obtainable is slightly smaller forsimple ridge boosting as compared to simple ridge regression, probably resulting from a faster drop-off of the biascomponent. We found this pattern prevailing in other simulated data examples, but the difference between simple ridgeboosting and simple ridge regression usually is not large. Overall the fits obtainable from ridge regression and simpleridge boosting with the penalty parameter and the number of steps chosen appropriately are rather similar. The moreinteresting case occurs when candidate sets are specified.

Fig. 3 shows MSE (solid lines), bias (dotted lines) and variance (broken lines) for simple and componentwise ridgeboosting for the example data generated from model (3). Penalties for both procedures have been chosen such thatthe bias curves are close in the initial steps. The fit of a linear model that incorporates all variables is indicated by ahorizontal line. In the left panel of Fig. 3 there are five covariates (of 50 covariates in total) with true parameters unequalto zero and the correlation between all covariates is zero. It is seen that for data with a small number of uncorrelatedinformative covariates the componentwise approach results in a much smaller minimum MSE, probably due to a muchslower increase of variance. When correlation among all covariates increases, the performance of simple ridge boosting

0 50 100 150 200

step no.

bia

s/v

ariance/M

SE

0.8

0.6

0.4

0.2

0.0

Fig. 2. Empirical bias and variance of ridge regression (thin lines) and simple ridge boosting (thick lines) (MSE: solid lines; bias: dotted lines;variance: broken lines) for example data generated from (3) (see Section 4) with n=100, p=50, b =0.7 and signal-to-noise ratio 1. The horizontalscale of the ridge regression curves is adjusted to have approximately the same degrees of freedom as simple ridge boosting in each step.

Page 7: Boosting ridge regression

6050 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

0 50 100 150 200

step no.

bia

s/v

ariance/M

SE

0 50 100 150 200

step no.

bia

s/v

ariance/M

SE

0.8

0.6

0.4

0.2

0.0 0.0

0.2

0.4

0.6

0.8

Fig. 3. Empirical bias and variance of simple ridge boosting (thick lines) and componentwise ridge boosting (thin lines) (MSE: solid lines; bias:dotted lines; variance: broken lines) for example data generated from (3) with n = 100, p = 50 and uncorrelated (b = 0) (left panel) and correlated(b = 0.7) (right panel) covariates.

0.0 0.2 0.4 0.6 0.8 1.0

|beta| / |least squares beta|

sta

ndard

ized c

oeff

icie

nts

0.0 0.2 0.4 0.6 0.8 1.0

|beta| / |least squares beta|

sta

ndard

ized c

oeff

icie

nts

0.0 0.2 0.4 0.6 0.8 1.0

|beta| / |least squares beta|

sta

ndard

ized c

oeff

icie

nts

df

sta

ndard

ized c

oeff

icie

nts

1

df

sta

ndard

ized c

oeff

icie

nts

1

df

sta

ndard

ized c

oeff

icie

nts

1

0.6

0.4

0.2

0.0

−0.2 −0.2

0.0

0.2

0.4

0.6 0.6

0.4

0.2

0.0

−0.2

0.6

0.4

0.2

0.0

−0.2

2 3 4 5 6 7 8 9

0.6

0.4

0.2

0.0

−0.2

2 3 4 5 6 7 8

0.6

0.4

0.2

0.0

−0.2

2 3 4 5 6 7 8 9

Fig. 4. Coefficient build-up of simple ridge boosting (left panels), componentwise ridge boosting (center panels) and the Lasso (right panels) forthe prostate cancer data plotted against standardized L2-norm of the parameter vector (top panels) and degrees of freedom (bottom panels). Arrowsindicate the model chosen by 10-fold cross-validation (repeated 10 times).

comes closer to the componentwise approach (right panel of Fig. 3). This may be due to the coefficient build-up schemeof BoostR (illustrated in Fig. 4) that assigns non-zero parameters to all covariates.

The idea of a stepwise approach to regression where in each step just one predictor is updated is also foundin stagewise regression (see e.g. Efron et al., 2004). The main difference is the selection criterion and the update:

Page 8: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6051

the selection criterion in stagewise regression is the correlation between each variable and the current residual and theupdate is of fixed size �, whereas in componentwise ridge boosting a penalized model with one predictor is fitted foreach variable and any model selection criterion may be used. Since stagewise regression is closely related to the Lassoand least angle regression (Efron et al., 2004), it can be expected that componentwise ridge boosting (but not the moregeneral partial approach!) is also similar to the latter procedures.

2.3.2. Example: prostate cancerWe applied componentwise ridge boosting and simple ridge boosting to the prostate cancer data used by Tibshirani

(1996) for illustration of the Lasso. The data with n= 97 observations come from a study by Stamney et al. (1989) thatexamined the correlation between the (log-)level of prostate specific antigen and eight clinical measures (standardizedbefore model fit): log(cancer volume) (lcavol), log(prostate weight) (lweight), age, log(benign prostatic hyperplasiaamount) (lbph), seminal vesical invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason) and percentageGleason scores 4 or 5 (pgg45). Note that Hastie et al. (2001, Fig. 10.12) applied stagewise regression to the prostatecancer data. Due to the close connection to componentwise ridge boosting noted above the resulting coefficient build-uplooks very similar.

Fig. 4 shows the coefficient build-up in the course of the boosting steps for two scalings. In the top panels the valueson the abscissa indicate the L2-norm of parameter vector relative to the least-squares estimate, whereas in the bottompanels the degrees of freedom are given. A dot on the line corresponding to a coefficient indicates an estimate; thus eachdot represents one step in the algorithm. While for simple ridge boosting (left panels) each coefficient seems to increaseby a small amount in each step, in the componentwise approach (center panels) non-zero coefficients are assigned onlyto a specific set of variables within one step, the other coefficients remain zero. Therefore, the coefficient build-up ofthe componentwise approach looks much more similar to that of the Lasso procedure (right panels) (fitted by the LARSprocedure, see Efron et al., 2004). One important difference to the Lasso that can be seen in this example is that forcomponentwise ridge boosting the size of the coefficient updates varies over boosting steps in a specific pattern, withlarger changes in the early steps and only small adjustments in late steps.

Arrows indicate 10-fold cross-validation based on mean squared error which has been repeated over 10 distinctsplittings to show variability. It is seen that cross-validation selects a model of rather high complexity for simpleridge boosting. When using the componentwise approach or the Lasso, more parsimonious models are selected. Partialboosting selects even more parsimonious models than the Lasso in terms of degrees of freedom.

3. Ridge boosting in generalized linear models

In GLMs the dominating estimation approach is maximum likelihood which corresponds to the use of L2 loss in thespecial case of normally distributed responses. Ridge regression in GLMs is therefore based on penalized maximumlikelihood. Univariate GLMs are given by �i = h(�0 + xT

i �̃) = h(zTi �) = h(�i ) where �i = E(yi |xi), h is a given

(strictly monotone) response function, and the predictor �i = zTi � has linear form. Moreover, it is assumed that yi |xi is

from the simple exponential family, including for example binary responses, Poisson distributed responses and gammadistributed responses.

The basic concept in ridge regression is to maximize the penalized log-likelihood

lp(�) =n∑

i=1

li − �

2

p∑i=1

�2i =

n∑i=1

li − �

2

p∑i=1

�TP�,

where li is the likelihood contribution of the ith observation and P is a block diagonal matrix with the (1 × 1) block0 and the (p × p) block given by the identity matrix Ip. The corresponding penalized score function is given by

sp(�) =∑ni=1 zi

�h(�i )/���2

i

(yi − �i ) − �P� where �2i = var(yi). A simple way of computing the estimator is iterative

Fisher scoring �̂(k+1) = �̂(k) + Fp(�̂(k))−1sp(�̂(k)), where Fp(�) = E(−�l/����T) = F(�) + �P , with F(�) denotingthe usual Fisher matrix given by F(�)=XTW(�)X, X=(x·0, x·1, . . . , x·p), xT·0 =(1, . . . , 1), W(�)=D(�)(�)−1D(�),(�) = (�2

1, . . . , �2n), D(�)=diag(�h(�1)/��, . . . , �h(�n)/��).

The proposed boosting procedure is likelihood based ridge boosting based on one step of Fisher scoring (for likelihoodbased boosting see also Tutz and Binder, 2006). The parameter/variable set to be refitted in one iteration now includes

Page 9: Boosting ridge regression

6052 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

the intercept. Let again V(j)m denote subsets of indices of parameters to be considered for refitting in the mth step. Then

V(j)m ={m1, . . . , ml} ⊂ {0, 1, . . . , p}, where 0 denotes the intercept. The general algorithm, including a selection step,

is given in the following in matrix form � = h(�) where � and � denote vectors in which observations are collected.

Algorithm: GenPartBoostR/GenBoostRStep 1: Initialization. Fit model �i =h(�0) by iterative Fisher scoring obtaining �̂(0)=(�̂0, 0, . . . , 0), �̂(0)=X�̂(0).Step 2: Iteration. For m = 1, 2, . . . .(a) Estimation: Estimation for candidate sets V

(j)m corresponds to fitting of the model

� = h(�̂(m−1) + XV

(j)m

�R

V(j)m

),

where �̂(m−1) = X�̂(m−1) and XTV

(j)m

= (xim

(j)1

, . . . , xim

(j)l

) contains only components from V(j)m . One step of

Fisher scoring is given by �̂R

V(j)m

= F−1p,V

(j)m

sp,V

(j)m

where sp,V

(j)m

= XV

(j)m

W(�̂(m−1))D−1(�(m−1))

(y − �(m−1)) (without −�P�= 0), Fp,V

(j)m

=XTV

(j)m

W(�̂(m−1))XV(j)m

, with �̂T(m−1) = (�̂1,(m−1), . . . , �̂n,(m−1)),

yT = (y1, . . . , yn), �T(m−1) = (�1,(m−1), . . . , �n,(m−1)).

(b) Selection: For candidate sets V(j)m , j = 1, . . . , s, the set Vm is selected that improves the fit maximally.

(c) Update: One sets

�̂R(m) =

{�̂R

Vm,j , j ∈ Vm,

0, j /∈ Vm,

�̂(m)=�̂(m+1)+�̂R(m), �̂(m)=X�̂(m)=X�̂(m+1)+X�̂R

(m), �̂(m)=h(X�̂(m)), where h is applied componentwise.

Within a GLM framework it is natural to use in the selection step (b) the improvement in deviance. With Dev(�̂)

denoting the deviance given predictor values �̂ one selects in the mth step Vm such that Dev(�(j)) is minimal, where

Dev(�(j)) uses predictor value �̂(j) = �̂m−1 + XV

(j)m

�̂R

V(j)m

.Also stopping criteria should be based on the deviance. One option is deviance based cross-validation, an alternative

choice is based on information criteria. Two of the latter, AIC and BIC, were already introduced in Section 2. They arebased on degrees of freedom which can be obtained from the trace of the hat matrix. In GLMs the hat matrix is notas straightforward as in the simple linear model. The following proposition gives an approximate hat matrix (for proofsee Appendix).

Proposition 3. An approximate hat matrix for which �̂(m) ≈ Hmy is given by

Hm =m∑

j=0

Mj

j−1∏i=0

(I − M0),

where Mm = 1/2m W

1/2m XVm(XT

VmWmXVm + �I )−1XVmW

1/2m −1/2

m , Wm = W(�̂(m−1)), Dm = D(�̂(m−1)).

4. Empirical comparison

Properties of ridge boosting are investigated by use of simulated data. We generate a covariate matrix X by drawingn observations from a p-dimensional multivariate normal distribution with variance 1 and correlation between twocovariates (i.e. columns of X) xj and xk being |j−k|

b . The true predictor � (i.e. the expected value of the responseE(y|x) = � = h(�), where h is the identity for a continuous response and h(�) = exp(�)/(1 + exp(�)) for binaryresponses) is formed by

� = X�true, (3)

where the true parameter vector �true = cstn · (�1, . . . , �p)T is determined by �j ∼ N(5, 1) for j ∈ Vinfo, �j =0 otherwise, with Vinfo ⊂ {1, . . . , 10} being the set (of size 5) of the randomly drawn indices of the informative

Page 10: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6053

covariates. The constant cstn is chosen such that the signal-to-noise ratio for the final response y, drawn from anormal distribution N(�, 1) or a binomial distribution B(�, 1), is equal to 1. The signal-to-noise-ratio is determined by

signal-to-noise ratio =∑n

i=1 (�i−�̄)2∑ni=1 Var(yi )

, where �̄ = 1n

∑ni=1 �i . For the examples presented here we used fixed sample size

n = 100 for the training data and a new sample of size nnew = 1000 for the evaluation of prediction performance.The following comparisons of performance and identification of influential covariates of (generalized) simple

ridge boosting ((Gen)BoostR) and (generalized) partial boosting ((Gen)PartBoostR) (more specifically (generalized)componentwise ridge boosting) with other procedures have been done within the statistical environment R 2.1.0(R Development Core Team, 2005). We used intercept-only (generalized) linear models, the (generalized) Lasso (pack-age “lasso2” 1.2-0—LARS as used in Section 2.3.2 is only available for continuous response data) and the “elastic net”procedure (Zou and Hastie, 2005) (package “elastic net” 1.02) for comparison. We evaluate performance for optimalvalues of the tuning parameters as well as for parameters chosen by 10-fold cross-validation.

For the Lasso only one parameter (the upper bound on the L1 norm of the parameter vector) has to be chosen. Weuse a line search, which has to be supplied with a fixed (upper) limit for this parameter in some examples because forcertain values no solution exists. Zou and Hastie (2005) also note this as a downside of the classical Lasso. The elasticnet procedure has two tuning parameters, the number of steps k and the penalty �. We chose both by evaluating a gridof steps 1, . . . , min(p, n) and penalties (0, 0.01, 0.1, 1, 10, 100) (as suggested by Zou and Hastie, 2005). For each of(Gen)BoostR and (Gen)PartBoostR we used a fixed penalty parameter �. This parameter has been chosen in advancesuch that the number of steps chosen by cross-validation typically is between 50 and 200.

4.1. Metric response

4.1.1. Prediction performanceThe boxplots in Fig. 5 show the mean squared error for continuous response data generated from (3) with a varying

number of predictors and amount of correlation between the covariates for 50 repetitions per example. In Fig. 5 elasticnet, the Lasso, simple ridge boosting and componentwise ridge boosting (with an intercept-only linear model as abaseline) are compared for the optimal choice of tuning parameters (in the left part of the panels) as well for thecross-validation based estimates (in the right part).

When comparing the optimal performance of simple ridge boosting with the componentwise approach, it is seen thatthe latter distinctly dominates the former for a large number of predictors (with only few of them being informative)(bottom left vs. top left panel) and/or small correlations (top left vs. top right panel). A similar difference in performanceis also found when comparing simple ridge boosting to the Lasso and to the elastic net procedure. This highlights thesimilarity of the componentwise approach to the Lasso-type procedures (as illustrated in Section 2.3.2). When the closeconnection of simple ridge boosting and simple ridge regression is taken into consideration this replicates the resultsof Tibshirani (1996) who also found that in sparse scenarios the Lasso performs well and ridge regression performspoorly. Consistently, the performance difference for a smaller number of covariates and high correlation (bottom rightpanel) is less pronounced. For optimal parameters no large differences between the performance of the componentwiseapproach, the Lasso and of elastic net are seen. Elastic net seems to have a slightly better performance for exampleswith higher correlation among the predictors, but this is to be expected because elastic net was specifically developedfor such high-correlation scenarios (see Zou and Hastie, 2005). The decrease in performance incurred by selecting thenumber of steps/Lasso constraint by cross-validation instead of using the optimal values is similar for componentwiseridge boosting and the Lasso. For the elastic net procedure the performance decrease is larger to such an extent thatthe performance benefit over the former procedures (with optimal parameters) is lost. This might result from the verysmall range of the number of steps where good performance can be achieved (due to the small overall number of stepsused by elastic net). In contrast boosting procedures change very slowly in the course of the iterative process, whichmakes selection of appropriate shrinkage more stable (compare Fig. 4).

4.1.2. Identification of influential variablesWhile prediction performance is an important criterion for comparison of algorithms the variables included into

the final model are also of interest. The final model should be as parsimonious as possible, but all relevant variablesshould be included. For example one of the objectives for the development of the elastic net procedure was to retainall important variables for the final model even when they are highly correlated (while the Lasso includes only some

Page 11: Boosting ridge regression

6054 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

p =100, b=0.3: p =100, b=0.7:

mean s

quare

d e

rror

intercept only

lasso

elastic net

BoostR

PartBoostR

lasso CV

elastic net C

V

BoostR C

V

PartBoostR

CV

mean s

quare

d e

rror

intercept only

lasso

elastic net

BoostR

PartBoostR

lasso CV

elastic net C

V

BoostR C

V

PartBoostR

CV

p =200, b=0.3: p =50, b=0.7:

mean s

quare

d e

rror

intercept only

lasso

elastic net

BoostR

PartBoostR

lasso CV

elastic net C

V

BoostR C

V

PartBoostR

CV

mean s

quare

d e

rror

intercept only

lasso

elastic net

BoostR

PartBoostR

lasso CV

elastic net C

V

BoostR C

V

PartBoostR

CV

2.2

2.0

1.8

1.6

1.4

1.2

1.0 1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.2

2.0

1.8

1.6

1.4

1.2

1.0

2.2

2.0

1.8

1.6

1.4

1.2

1.0

Fig. 5. Mean squared error for continuous response data with varying number of predictors p and correlation b for the linear model includingonly an intercept term, elastic net, the Lasso, simple ridge boosting and componentwise ridge boosting with tuning parameters selected for optimalperformance or by cross-validation (CV).

of a group of correlated variables; see Zou and Hastie, 2005). The criteria by which the performance of a proceduresin the identification of influential variables can be judged are the hit rate (i.e. the proportion of correctly identifiedinfluential variables) 1/(

∑pj=1 I (�true,j �= 0)) · ∑p

j=1 I (�true,j �= 0) · I (�̂j �= 0) and the false alarm rate (i.e. the

proportion of non-influential variables dubbed influential) 1/(∑p

j=1 I (�true,j = 0)) ·∑pj=1 I (�true,j = 0) · I (�̂j �= 0),

where �true,j , j = 1, . . . , p are the elements of the true parameter vector, �̂j are the corresponding estimates used bythe final model and I (expression) is an indicator function that takes the value 1 if “expression” is true and 0 otherwise.

Page 12: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6055

0.04 0.06 0.08 0.10 0.12 0.14 0.16

false alarm rate

hit r

ate

1.196

1.1981.202

1.204

1.207

1.211 1.2131.216

1.219

1.218

1.217

1.213

1.205

1.212

1.216

1.221

0.04 0.06 0.08 0.10

false alarm rate

hit r

ate

1.135

1.1321.131

1.1311.133

1.135 1.1351.137

1.141

1.1391.137

1.121

1.093

1.136

1.146

1.154

p =200, b=0.3:

p =100, b=0.3: p =100, b=0.7:

p =50, b=0.7:

0.02 0.04 0.06 0.08

false alarm rate

hit r

ate

1.228

1.2231.227

1.23

1.234

1.238 1.241.243 1.249

1.245

1.242

1.241

1.252

1.256

1.262

0.06 0.08 0.10 0.12 0.14 0.16

false alarm rate

hit r

ate

1.108

1.104

1.101

1.102

1.1051.1061.108

1.109

1.113

1.111.108

1.095

1.081

1.128

1.139

1.153

0.995

0.985

0.975

0.9650.75

0.80

0.85

0.90

0.95

0.97

0.96

0.95

0.94

0.93

0.95

0.90

0.85

0.80

Fig. 6. Pseudo-ROC curves for the identification of influential and non-influential covariates with metric response data for componentwise ridgeboosting (circles) and elastic net (squares) with varying penalty parameters (and the optimal number of steps) and for the Lasso (triangle). Arrowsindicate increasing penalty parameters. The numbers give the mean squared error of prediction for the respective estimates.

Fig. 6 shows the hit rates and false alarm rates for componentwise ridge boosting (circles), elastic net (squares) andthe Lasso (triangle) for the data underlying Fig. 5. While the Lasso has only one parameter which is selected for optimalprediction performance, the componentwise approach and elastic net have two parameters, a penalty parameter and thenumber of steps. For evaluation of prediction performance we used a fixed penalty for the componentwise approachand the optimal penalty (with respect to prediction performance) for the elastic net procedure. For the investigation oftheir properties with respect to identification of influential variables we vary the penalty parameters (the number ofsteps still being chosen for optimal performance), thus resulting in a sequence of fits. Plotting the hit rates and falsealarm rates of these fits leads to the pseudo-ROC-curves shown in Fig. 6. We call them “pseudo” curves since they arenot necessarily monotone!

It is seen that for a large number of covariates and a medium level of correlation (bottom left) the componentwiseapproach comes close to dominating the other procedures. While for a smaller number of variables and medium levelof correlation (top left) higher hit rates can be achieved by using the elastic net procedure the componentwise approachstill is the only procedure which allows for a trade-off of hit rate and false alarm rate (and therefore selection of smallfalse alarm rates) by varying the penalty parameter. Doing this for the elastic net the false alarm rate hardly changes.

Page 13: Boosting ridge regression

6056 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

Table 1Mean deviance (of prediction) for binary response data with varying number of predictors p and correlation b for a generalized linear modelincluding only an intercept term (base), generalized Lasso (with optimal constraint and constraint selected by cross-validation), generalized simpleridge boosting (GenBoostR) and generalized componentwise ridge boosting (GenPartBoostR) (with optimal number of steps and number of stepsselected by AIC, BIC or cross-validation (CV))

p b Base Lasso GenBoostR CV GenPartBoostR BIC CV

opt CV opt AIC opt AIC

10 0 1.397 0.875 0.896 0.885 0.900 0.896 0.873 0.904 (4) 0.899 0.8930.3 1.400 0.873 0.887 0.874 0.897 0.886 0.871 0.894 (2) 0.900 0.8860.7 1.390 0.845 0.856 0.834 0.849 0.844 0.844 0.858 (0) 0.858 0.854

50 0 1.398 0.935 0.963 1.098 1.371 1.143 0.931 1.058 (7) 0.953 0.9690.3 1.391 0.931 0.998 1.035 1.318 1.068 0.921 1.093 (6) 0.936 0.9850.7 1.395 0.888 0.910 0.924 1.034 0.937 0.883 1.051 (11) 0.891 0.905

100 0 1.395 1.019 1.042 1.241 1.281 1.265 1.014 1.296 (11) 1.029 1.0640.3 1.394 0.991 1.011 1.159 1.234 1.187 0.981 1.256 (14) 0.991 1.0140.7 1.395 0.925 0.962 0.996 1.052 1.057 0.918 1.209 (16) 0.936 0.986

200 0 1.395 1.075 1.110 1.307 1.350 1.330 1.071 1.332 (1) 1.095 1.1190.3 1.396 1.010 1.049 1.261 1.327 1.283 1.002 1.232 (4) 1.013 1.0510.7 1.402 0.950 0.972 1.113 1.164 1.141 0.944 1.278 (12) 0.963 0.976

The best result obtainable by data selected tuning parameters is marked in boldface for each example.

For the examples with high correlations between covariates (right panels) there is a clear advantage for the elasticnet procedure. With penalty parameters going toward zero elastic net comes close to a Lasso solution (as it should,based on its theory). This is also where the elastic net pseudo-ROC curve (for small penalties) and the curve ofthe componentwise approach (for large penalties) coincide. The differences seen between elastic net/componentwisesolutions with small/large penalties and the Lasso solution might result from the different type of tuning parameter(number of steps vs. constraint on the parameter vector).

4.2. Binary response

In order to evaluate generalized simple ridge boosting and generalized componentwise ridge boosting for thenon-metric response case we compare them to a generalized variant of the Lasso, which is obtained by usingiteratively reweighted least squares with a pseudo-response where weighted least-squares estimation is replacedby a weighted Lasso estimate (see documentation of the “lasso2” package, Lokhorst, 1999). The Lasso constraintparameter and the number of steps for ridge boosting are determined by cross-validation and we compare theresulting performance to that obtained with optimal parameter values. For the binary response examples given inthe following the AIC and the BIC (see Section 3) are available as additional criteria for generalized ridgeboosting.

Table 1 gives the mean deviance for binary response data generated from (3) with n = 100 and a varying numberof variables and correlation for 20 repetitions per example. When using the AIC as a stopping criterion in severalinstances (number indicated in parentheses) no minimum within the range of 500 boosting steps could be found andso effectively the maximum number is used. It is seen that for a small number of variables and high correlationbetween covariates (p = 10, b = 0.7) (generalized) simple ridge boosting and componentwise ridge boosting performsimilar, for all other examples the componentwise approach is ahead. This parallels the findings from the continuousresponse examples. While it seems that there is a slight advantage of the componentwise approach over the Lasso, theirperformance (with optimal parameters as well as with parameters selected by cross-validation) is very similar. Selectionof the number of boosting steps by AIC seems to work well only for a small number of variables, as can be seen e.g.when comparing to the cross-validation results for the componentwise approach. For a larger number of covariatesthe use of AIC for the selection of the number of boosting steps seems to be less efficient. In contrast BIC seemsto perform quite well for a moderate to large number of covariates while being suboptimal for a smaller number ofpredictors.

Page 14: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6057

The investigation of hit/false alarm rates yields results (not given) which are similar to the continuous responseexamples. The trade-off between hit rate and false alarm by selection of the penalty parameter seems to be feasible.

5. Application

We illustrate the application of simple ridge boosting and partial boosting with real data from 344 admissions at a psy-chiatric hospital. The (binary) response variable to be investigated is whether treatment is aborted by the patient againstphysicians’ advice (about 55% for this group of patients). From a total of 101 variables available eight variables likelyto be relevant were identified: age, number of previous admissions (“noupto”), cumulative length of stay (“losupto”)and the 0/1-variables indicating a previous ambulatory treatment (“prevamb”), no second diagnosis (“nosec”), seconddiagnosis “personality disorder” (“persdis”), somatic problems (“somprobl”), homelessness (“homeless”) and jobless-ness (“jobless”). Based on subject matter considerations the two variables which relate to secondary diagnosis (“nosec”and “persdis”) are mandatory members of the response set and no penalty is applied to their estimates. This illustratesthe effect of augmenting an unpenalized model with few mandatory variables with optional predictors. The top rightpanel of Fig. 1 shows the coefficient build-up in the course of the boosting steps for partial boosting contrasted withsimple ridge boosting (top left) and the Lasso (bottom panel). The arrows indicate the number of steps chosen by AIC(for partial boosting and simple ridge boosting) and 10-fold cross-validation (for the Lasso; repeated 10 times). It canbe seen that while for the optional predictors the partial approach results in a build-up scheme similar to the Lasso, themandatory components introduce a very different structure. One interesting feature is the slow decrease of the estimatefor “persdis” beginning with boosting step 8. This indicates some overshooting of the initial estimate that is correctedwhen additional predictors are included.

To investigate identification of relevant variables we used all 101 predictors and divided the data into a training setof size 270 and a test set of size 74. The Lasso (with cross-validation) returned 13 predictors with prediction error of0.392. Partial boosting (using six mandatory response set elements relating to the secondary diagnosis) with penaltyvarying from 500 to 10 000 returned 15–19 predictors and prediction error between 0.378 and 0.392.

6. Concluding remarks

We investigated the use of boosting for the fitting of linear models. The simplest estimate we considered was theboosted ridge estimator. Although the resulting simple ridge boosting estimator yields shrinkage that differs from thesimple ridge estimator in practice the performance does not differ strongly. Strongly improved estimates are obtained forpartial boosting, in particular for small correlation between covariates. The performance of partial boosting turns out tobe similar to that of stagewise regression procedures which may be seen as competitors, in particular for large numbersof covariates. In empirical comparisons it is seen that the componentwise approach is competitive to the Lasso andthe elastic net procedures in terms of prediction performance as well as with respect to the identification of influentialcovariates. The elastic net procedure (not surprisingly) performs best in situations with highly correlated predictors. Forexamples with less correlation componentwise ridge boosting is competitive not only in terms of prediction performancebut also with respect to covariate identification. This seems to hold not only for the continuous response case but alsofor binary responses. In addition to its competitive prediction performance the proposed partial boosting procedureallows for penalized estimation of mandatory variables. In contrast, for procedures such as the Lasso the only way toguarantee that variables get included into a model is to estimate the corresponding parameters without penalization. Afurther feature of the componentwise approach is that the trade-off between hit rate/false alarm rate can be controlledover a broad range by selecting the penalty parameter. This is a distinct advantage over the other procedures, especiallyin situations where a small false alarm rate is wanted (e.g. for obtaining parsimonious models).

Acknowledgments

We gratefully acknowledge support from Deutsche Forschungsgemeinschaft (Project C4, SFB 386 StatisticalAnalysis of Discrete Structures).

Page 15: Boosting ridge regression

6058 G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059

Appendix

Proof of Proposition 1. One obtains cov(�̂m)=�2U(In − (In −D̃)m+1)2UT immediately from �̂(m) =Hmy. The biasis obtained from b = E(� − �m) = (In − Hm)� = (In − UD̃UT)m+1� = U(In − D̃)m+1UT�. Therefore one obtains:

bTb = �TU(In − D̃)m+1UTU(In − D̃)m+1UT�

= �TU diag((1 − d̃21 )2m+2, . . . , (1 − d̃2

p)2m+2)UT�,

trace cov(Hmy) = trace(�2U(In − (In − D̃)m+1)2UT)

=r∑

j=1

�2

⎛⎝1 −(

1 − d2j

d2j + �

)m+1⎞⎠2

,

MSE(BoostR(m)) = 1

n(trace cov(Hmy) + bTb)

= 1

n

⎛⎝ r∑j=1

�2(1 − (1 − d̃2j )m+1)2

+ �TU diag((1 − d̃21 )2m+2, . . . , (1 − d̃2

p)2m+2)UT�

⎞⎠ . �

Proof of Proposition 2. The MSE of simple ridge boosting (� > 0) is given by

MSE(BoostR�(m)) = 1

n

p∑j=1

{�2(1 − (1 − d̃2j )m+1)2 + cj (1 − d̃2

j )2m+2},

where d̃2j = d2

j /(d2j + �), and cj = �Tuju

Tj � = ‖�Tuj‖ depends only on the underlying model. The MSE of the least

squares estimate is given by MSE(ML) = rn�2.

Since d̃2j = cj = 0 for j > r and 0 < d̃2

j �1 for j �r (a) follows immediately. In addition one obtains

MSE(BoostR�(m))�MSE(ML) if∑r

j=1(1 − (1 − d̃2j )m+1)2 + cj

�2 (1 − d̃2j )2m+2 �r , which is equivalent to

r∑j=1

2(1 − d̃2j )m+1 − (1 − d̃2

j )2(m+1)( cj

�2 + 1)

�0.

It is enough to find �, m such that∑r

j=12(1 − d̃2j )m+1(2 − (1 − d̃2

j )m+1(cj

�2 + 1))�0. Since � > 0 one can choose m

so large that (1 − d̃2j )m+1 becomes arbitrarily small. Therefore (b) holds. �

Proof of Proposition 3. In the mth iteration (after Vm has been selected) the update is given by �̂RVm

= F−1p,Vm

sp,Vm =(XT

VmWmXVm+�Ip)−1XVmWmD−1

m (y−�̂(m−1)), where Wm=W(�̂(m−1)), Dm=D(�̂(m−1)) are evaluated at �̂(m−1). One

has �̂(m)− �̂(m−1)=X�̂(m−1)+XVm �̂Vm−X�̂(m−1)=XVm �̂Vm

=XVm(XTVm

WmXVm +�Ip)−1XVmWmD−1m (y− �̂(m−1)).

By using Taylor approximation of first order h(�̂) ≈ h(�) + (�h(�)/��T)(�̂ − �) one obtains �̂(m) ≈ �̂(m−1) +Dm(�̂(m) − �̂(m−1))= �̂(m−1) +DmXVm �̂R

Vm= �̂(m−1) +DmXVm(XT

VmWmXVm + �Ip)−1XT

VmWmD−1

m (y − �̂(m−1)) and

therefore D−1m (�̂(m) − �̂(m−1)) ≈ XVm(XT

VmWmXVm + �Ip)−1XVmWmD−1

m (y − �̂(m−1)).

Multiplication with W1/2m and using W

1/2m D−1

m =−1/2m =diag(�(m−1),1, . . . , �(m−1),n) yields −1/2

m (�̂(m)−�̂(m−1)) ≈H̃m−1/2

m (y − �̂(m−2)) where H̃m = W 1/2XVm(XTVm

WmXVm + �Ip)−1XVmW 1/2. Defining Mm = 1/2m H̃m−1/2

m yieldsthe approximation �̂(m) ≈ �̂(m−1) + Mm(y − �̂(m−1)) = �̂(m−1) + Mm(y − �̂(m−2)) − (�̂(m−1) − �̂(m−2)) ≈ �̂(m−1) +Mm((y − �̂(m−2)) − Mm−1(y − �̂(m−2))) = �̂(m−1) + Mm(In − Mm−1)(y − �̂(m−2)). With starting value �̂(0) =

Page 16: Boosting ridge regression

G. Tutz, H. Binder / Computational Statistics & Data Analysis 51 (2007) 6044–6059 6059

M0y one obtains �̂(1) ≈ �̂(0) + M1(y − �̂(0)) = M0y + M1(In − M0)y and more general �̂(m) ≈ Hmy where

Hm =∑mj=0Mj

∏j−1i=0 (In − M0). �

References

Breiman, L., 1999. Prediction games and arcing algorithms. Neural Comput. 11, 1493–1517.Bühlmann, P., 2006. Boosting for high-dimensional linear models. Ann. Statist. 34, 559–583.Bühlmann, P., Yu, B., 2003. Boosting with the L2 loss: regression and classification. J. Amer. Statist. Assoc. 98, 324–339.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Ann. Statist. 32, 407–499.Frank, I.E., Friedman, J.H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35, 109–135.Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Statist. 29, 1189–1232.Friedman, J.H., Hastie, T., Tibshirani, R., 2000. Additive logistic regression: a statistical view of boosting. Ann. Statist. 28, 337–407.Fu, W.J., 1998. Penalized regressions: the bridge versus the lasso. J. Comput. Graphical Statist. 7, 397–416.Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer, New York.Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall, London.Hoerl, A.E., Kennard, R.W., 1970a. Ridge regression: applications to nonorthogonal problems. Technometrics 12, 69–82.Hoerl, A.E., Kennard, R.W., 1970b. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67.Le Cessie, S., van Houwelingen, J.C., 1992. Ridge estimators in logistic regression. Appl. Statist. 41, 191–201.Lokhorst, J., 1999. The LASSO and generalised linear models, honors project, Department of Statistics, University of Adelaide.R Development Core Team, 2005. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria,

ISBN 3-900051-07-0. URL 〈http://www.R-project.org〉.Schapire, R.E., 1990. The strength of weak learnability. Mach. Learn. 5, 197–227.Seber, G.A.F., 1977. Linear Regression Analysis. Wiley, New York.Stamney, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., Yang, N., 1989. Prostate specific antigen in the diagnosis and treatment

of adenocarcinoma of the prostate, ii: radical prostatectomy treated patients. J. Urology 16, 1076–1083.Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. B 58, 267–288.Tutz, G., Binder, H., 2006. Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics 62, 961–971.Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. B 67, 301–320.