CROSS-VALIDATION AND REGRESSION ANALYSIS IN ...yw320tk7289/...CROSS-VALIDATION AND REGRESSION...

CROSS-VALIDATION AND REGRESSION ANALYSIS IN

HIGH-DIMENSIONAL SPARSE LINEAR MODELS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Feng Zhang

Aug 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/yw320tk7289

© 2011 by Feng Zhang. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/yw320tk7289

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Tze Lai, Primary Adviser


Balakanapathy Rajaratnam


Nancy Zhang

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Modern scientific research often involves experiments with at most hundreds of sub-

jects but with tens of thousands of variables for every subject. The challenge of

high dimensionality has reshaped statistical thinking and modeling. Variable selec-

tion plays a pivotal role in the high-dimensional data analysis, and the combination

of sparsity and accuracy is crucial for statistical theory and practical applications.

Regularization methods are attractive for tackling these sparsity and accuracy issues.

The first part of this thesis studies two regularization methods. First, we consider

the orthogonal greedy algorithm (OGA) used in conjunction with a high-dimensional

information criterion introduced by Ing & Lai (2011). Although it has been shown to

have excellent performance for weakly sparse regression models, one does not know a

priori in practice that the actual model is weakly sparse, and we address this problem

by developing a new cross-validation approach. OGA can be viewed as L0 regular-

ization for weakly sparse regression models. When such sparsity fails, as revealed by

the cross-validation analysis, we propose to use a new way to combine L1 and L2

penalties, which we show to have important advantages over previous regularization

methods.

The second part of the thesis develops a Monte Carlo Cross-Validation (MCCV)

method to estimate the distribution of out-of-sample prediction errors when a training

iv

sample is used to build a regression model for prediction. Asymptotic theory and

simulation studies show that the proposed MCCV method mimics the actual (but

unknown) prediction error distribution even when the number of regressors exceeds

the sample size. Therefore MCCV provides a useful tool for comparing the predictive

performance of different regularization methods for real (rather than simulated) data

sets.

v

Acknowledgements

First, I would like to express my deepest gratitude to my advisor, Professor Tze

Leung Lai. He not only suggested the research topic, but also helped me in structuring,

modifying and enriching the contents, which have added enormous value to the quality

of this work. It has been my honor to have had the opportunity to work with him for

four years at Stanford, and I have really benefitted a lot from his deep understanding

in the field, his integral view on research, his patience to listen to my ideas and answer

my questions, and his encouragement. I would not be able to complete this dissertation

without his immense support and invaluable advice throughout my doctoral study.

I would like to thank Professor Jerry Friedman for giving me valuable suggestions

both on my research and study. I am also deeply grateful to him and to Professor

Guenther Walther and Professor Chiara Sabatti for serving on my oral examination

committee, and to Professor Nancy Zhang and Professor Bala Rajaratnam for serving

on my reading committee besides the oral committe, for both their thoughtful sug-

gestion and constructive feedback on my research. I am much indebted to Tong Xia

who helped me to proofread my thesis. I also want to thank Professor Ching-Kang

Ing of Academia Sinica for sharing with me his ongoing research and deep insights in

the subject of model selection.

vi

I am grateful to many friends at Stanford, for their invaluable friendship and enor-

mous support. Among them are Li Ma, Waiwai Liu, Ya Xu, Shaojie Deng, Camilo

Rivera, Justin Dyer, Murat Ahmed, Genevera Allen, Patrick Perry, Victor Hu, Kshi-

tij Khare, Zehao Chen, Yiyuan She, Ling Chen, Kevin Sun, Zongming Ma, Yifang

Chen, Paul Pong, Anwei Chai, Lei Zhao, Xiaoye Jiang, Xin Zhou, Yanxin Shi and

Hongsong Yuan. This is, however, far from a complete list of names. Their stimulating

conversations, constant encouragement and great company have made my last four

years full of joy and happiness, and our friendship will be one of the best treasures

in my life. I would also thank to the staff in the Department of Statistics who have

made Sequoia Hall to be the best place to work. Thank you all.

Last but not least, I would like to thank my wife, Yan Zhai, and my parents. Their

enduring love and support are much more than I could ever be able to acknowledge.

I would like to dedicate this piece of work to them.

vii

Contents

Abstract iii

Acknowledgements v

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Regularization in High-dimensional Regression 4

2.1 Regularization via Penalized Least Squares . . . . . . . . . . . . . . . 5

2.1.1 L1 regularization: Lasso . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 L2 Regularization: Ridge Regression . . . . . . . . . . . . . . . 8

2.1.3 Lq Regularization: Bridge Regression . . . . . . . . . . . . . . 11

2.1.4 Refinements: Adaptive Lasso & SCAD . . . . . . . . . . . . . 11

2.2 Elastic Net and a New Approach . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Implementation of Max 1 ,2 Regularization . . . . . . . . . . . . 21

2.3 L0-Regularization: Orthogonal Greedy Algorithm . . . . . . . . . . . 27

2.3.1 OGA and Gradient Boosting . . . . . . . . . . . . . . . . . . . 27

2.3.2 OGA+HDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

2.3.3 Variable Selection Consistency under Strong Sparisty . . . . . 33

3 Monte Carlo Cross-Validation 38

3.1 Overview of Cross-validation . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 MCCV Estimate of F (M )nt . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Choice of the Training Sample Size nt . . . . . . . . . . . . . . . . . . 43

3.4 Asymptotic Theory of MCCV . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Comparing the Prediction Performance of Two Methods . . . . . . . 48

4 Simulation Studies 49

4.1 Strongly Sparse Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.1 Sure Screening and Comparison with Other Methods . . . . . 51

4.1.2 MCCV Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Weakly Sparse Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Scenario without Weak Sparsity . . . . . . . . . . . . . . . . . . . . . 67

4.3.1 MSPE performance . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.2 MCCV Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Conclusion 70

Bibliography 72

ix

List of Tables

4.1 Sure screening results for Example 4.1 . . . . . . . . . . . . . . . . . 53

4.2 MSPE results for Example 4.1, and frequency, in 1000 simulations,

of including all 9 relevant variables (Correct), of selecting exactly the

relevant variables (E), of selecting all relevant variables and i irrelevant

variables (E+i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 5-number summary and mean of FMn with n = 500 and 450 in Exam-

ple 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 5-number summary together with mean for MCCV estimates of FM450

in Example 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 MSPEs of different methods in Example 4.2 . . . . . . . . . . . . . . 59

4.6 5-number summary and mean of FMn with n = 500 and 450 in Exam-

ple 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7 5-number summary for MCCV estimates of FM450 in Example 4.2 . . . 61

4.8 5-number summary for the distribution of squared prediction error differ-

ences between OGA+HDAIC and Lasso, and for its MCCV estimate . . . 63

4.9 MSPE of different methods in Example 4.3 . . . . . . . . . . . . . . . 65

4.10 5-number summary and mean for F (M)n in Example 4.4 . . . . . . . . 68

x

4.11 5-number summary and mean for MCCV estimates of F (M)360 in Exam-

ple 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xi

List of Figures

2.1 Contour for different penalties, left to right, L1, L2, and Max1,2 with

ρ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Shrinkage effects of different penalties . . . . . . . . . . . . . . . . . . 20

4.1 MCCV performance for example 4.1, and the solid line in the center

of each box shows its corresponding median. . . . . . . . . . . . . . . 56

4.2 Boxplots of MCCV performance for Example 4.2 . . . . . . . . . . . 61

4.3 Distribution of SPEOGA+HDAIC − SPELasso for Example 4.2 . . . . 62

4.4 MCCV estimates based on two simulated data sets in Example 4.2 . 63

4.5 Boxplot of MCCV performance for Example 4.3 . . . . . . . . . . . 66

4.6 Boxplot of MCCV performance for Example 4.4 . . . . . . . . . . . 69

xii

Chapter 1

Introduction

1.1 Introduction

High-dimensional statistical learning problems arise from diverse fields of modern

scientific research and various engineering areas. With recent advances in processing

power, storage capacity, and cloud computing technology, massive amount of data can

be collected with relatively low cost, and plenty examples for such high-dimensional

data set can be found at the frontiers of scientific research, such as microarray in

computational biology, longitudinal data in economics, and high-frequency data in

financial markets. Meanwhile, the methodology for high-dimensional data analysis

has become increasingly important and also become one of the most active areas in

statistical research.

In traditional statistical theory, it is assumed that the number of observations

n is much larger than the number of variables or parameters, so that large-sample

asymptotic theory can be used to derive procedures and analyze their statistical

accuracy and interpretability. For high-dimensional data, this assumption is violated

1

CHAPTER 1. INTRODUCTION 2

as the number of variables exceeds the number of observations. Analysis of these data

has reshaped statistical thinking.

Variable selection is an important idea in statistical learning with high-dimensional

data. In many applications, the response variable of interest is related to only a rela-

tively small number of predictors from a large pool of possible candidates. For exam-

ple, in gene expression studies using microarrays, the expressions of tens of thousands

of molecules are potential predictors, but only few of them are the important variables

that are truly related to the disease. How to identify the ”sparse” relevant variables

in high-dimensional cases has become a fundamental challenge.

Regularization methods are attractive in high-dimensional modeling and predic-

tive learning, and there is a rapidly growing literature devoted to solving sparse

variable selection by regularization methods under different assumptions. It will be

shown in chapter 2 and chapter 4 that different regularization methods have advan-

tage in different situations. Thus it is a crucial challenge to characterize whether the

underlying assumptions hold for a particular regularization method for the problem

at hand, for which one typically does not know the actual data generating mechanism.

1.2 Outline

This thesis is to address these challenges mentioned above in the context of regression

on high-dimensional input vectors, and is organized as following.

In first part of Chapter 2, we give an overview of different regularization meth-

ods, including Lasso (L1), Ridge Regression (L2), Bridge regression (Lq), and their

refinements like Adaptive Lasso and SCAD. In the second part of Chapter 2, inspired

by Elastic Net, we propose a new approach to take advantage from both L1 and L2

CHAPTER 1. INTRODUCTION 3

regularization(called Max1,2), and give both an efficient exact solution algorithm and

a fast pathwise approximation algorithm. In the third part of Chapter 2, we consider

a particular method to implement L0 regularization (called Orthogonal Gradient Al-

gorithm by Ing & Lai (2011)) and characterize its statistical properties in weakly

sparse regression model. Thus, we have added two new methods (OGA and Max1,2)

to our arsenal of regularization methods.

Since we usually do not know the underlying sparsity condtion for a particular

application in practice, we have to choose a good candidate from our arsenal of meth-

ods. In Chapter 3, we introduce a Monte Carlo Cross-Validation (MCCV) method

to compare different procedures. We also prove some attractive theory for MCCV

method when p " n.

In Chapter 4, we report simulation studies of different regularization methods

and also of MCCV to support the theoretical analysis in Chapter 2 and Chapter 3.

Chapter 5 gives further discussions and concluding remarks.

Chapter 2

Regularization in High-dimensional

Regression

We begin this chapter with a brief reviews, in Section 2.1, of existing regularization

methods including Lasso, Ridge Regression, Bridge Regression and their refinements.

In Section 2.2, we introduce a new regularization method to combine L1 and L2

penalties, and provide fast algorithms for both exact solution and an approximate

pathwise solution. In Section 2.3, we consider the Orthogonal Gradient Algorithm

(OGA) and a high-dimensional information criterion recently proposed by Ing & Lai

(2011) for L0 regularization; L0 uses the number of nonzero estimated regression

coefficients to penalize ordinary least squares. Throughout this chapter, we consider

the linear regression model:

yt = α +p∑

j=1

βjxtj + εt, t = 1, 2, · · · , n, (2.1)

4

CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 5

with p predictor variables (xt1, xt2, · · · , xtp) that are uncorrelated with the mean-zero

random disturbances εt.

2.1 Regularization via Penalized Least Squares

2.1.1 L1 regularization: Lasso

Lasso

Tibshirani (1996) proposed the Lasso method that uses L1 penalty to replace L0

penalty. Under ordinary regression setting, the objective function for Lasso is

−%n(β) + λP1(β) =n∑

t=1

(yt −p∑

j=1

βjxtj)2 + λ

p∑

j=1

|βj|. (2.2)

As its name “Least Absolute Shrinkage and Selection Operator” signifies, the objec-

tive of Lasso (the abbreviated name) is to retain good features of both subset selection

which uses L0-penalty and ridge regression which uses L2-penalty by shrinking some

coefficients while setting others to 0, and thereby to produce a model with inter-

pretability similar to that produced by subset selection and with stability similar to

that produced by ridge regression. The Lasso penalty gives the smallest q for which

the Lq-penalty is convex, thus we can solve the optimization problem by convex op-

timization techniques. There is also theoretical support of the method by Donoho &

Johnstone (1994), Donoho & Elad (2003), Candes & Tao (2007) and Bickel, Ritov

& Tsybakov (2009).

However, Lasso shrinkage introduces bias for the estimates of the non-zero coeffi-

cients. Concerning the shrinkage of Lasso as a tool for variable selection, there is an


extensive literature on variable selection consistency of Lasso. Leng, Lin & Wahba

(2006) have shown that Lasso is not variable-selection consistent when prediction ac-

curacy is used as the criterion for choosing the penalty λ in (2.2). Zhao & Yu (2006)

pointed out “obtaining (sparse) models through classical model selection methods

usually involves heavy combinatorial search”, and Lasso “provides a computationally

feasible way for model selection”. Noting that Lasso can fail to distinguish irrelevant

predictors that are highly correlated with relevant predictors, they proposed a ”strong

irrepresentable condition” that is sufficient for Lasso to select the true model both in

the classical fixed p setting and in the large p setting as the sample size n gets large,

i.e., pn = O(nκ) for some κ > 0. This work is closely related to that of Donoho,

Elad & Temlyakov (2006) where they proved that under a “coherence condition”, the

Lasso solution identifies the correct predictors in a sparse model with high probabil-

ity. Under a “sparse Riesz condition”, Zhang & Huang (2008) have studied sparsity

and bias properties of Lasso-based model selection methods.

LARS

In statistical applications, one often wants to compute the entire solution path for

tuning purposes, not just one solution for a single λ value. This has led to algorithms

to approximate the whole Lasso path along with the shrinkage parameter λ, including

LARS and the coordinate descent algorithm.

Least Angle Regression (LARS) was introduced by Efron et al. (2004) as a fast

algorithm for variable selection, a simple modification of which can produce the entire

Lasso solution path {β(λ) : λ > 0} that optimizes (2.2). By exploiting the piecewise

linear nature of the Lasso solution path in λ, LARS uses an iterative stepwise strategy,


and only enters as much of a predictor as it deserves for each step. At the first step

it identies the variable most correlated with the response. Rather than fitting this

variable completely, LARS moves the coefficient of this variable continuously toward

its least squares value (causing its correlation with the evolving residual to decrease

in absolute value). As soon as another variable catches up in terms of correlation with

the residual, this iteration stops and the second variable then enters the active set.

Their coefficients are moved together in a way that keeps their correlations with the

residual equal and decreasing. This process is continued until all the variables are in

the model. See Algorithm 1 for more details, and Algorithm 2 for a modification to

get Lasso path.

Algorithm 1 LARS Algorithm

Step 1 Standardize the predictors to have mean zero and unit norm. Start with theresidual r = y − y, and β1 = 0, · · · , βp = 0,

Step 2 Find the predictor xj most correlated with r,

Step 3 Move βj from 0 towards its least-squares coefficients 〈xj, r〉, until some othercompetitor xk has as much correlation with the current residual as does xj,

Step 4 Move βj and βk in the direction defined by their joint least squares coefficientsof the current residual on (xj,xk), until some other competitor xl has as muchcorrelation with the current residual,

Step 5 Continue in this way until all p predictors have been entered. After min(n−1, p)steps, we arrive at the full least-squares solution.

Remark 2.1. The termination condition in step 5 requires some explanation. If p >n− 1, the LARS algorithm reaches a zero residual solution after n− 1 steps (the 1 isbecause we have centered the data).


Algorithm 2 Modified LARS Algorithm for Lasso Path

Step 4a If a nonzero coefficient along a path crosses zero at the ith step, drop theassociated variable from the active set of variables and recompute the currentjoint least squares direction.

2.1.2 L2 Regularization: Ridge Regression

There are two main issues in regression analysis with high-dimensional input vectors:

sparsity and singularity. While the main aim of L1-regularization is to gain a sparse

solution, L2-regularization (known as ridge regression) helps to address the singularity

problem.

The ridge coefficients minimize a penalized residual sum of squares,

−%n(β) + λP2(β) =n∑

t=1

(yt −p∑

j=1

βjxtj)2 + λ

p∑

j=1

β2j . (2.3)

Rewriting the criterion (2.3) in matrix form yields,

RSS(λ) = (Y −Xβ)!(Y −Xβ) + λβ!β, (2.4)

whose minimization has the explicit solution

βridge

= (X!X + λI)−1X!Y. (2.5)

By adding a positive constant to the diagonal elements of X!X before inversion,

it makes the problem nonsingular even though X!X is singular when p > n. This

helps to solve the singularity problem of the design matrix, which was the main

motivation of Hoerl & Kennard (1970) when they introduced ridge regression into


statistics research.

Ridge regression has two desirable properties:

1. Shrinkage.

To see the shrinkage effect of ridge regression, we use the singular value decom-

position (SVD) of the n× p matrix X:

X = UDV!, (2.6)

where D = diag(d1, · · · , dmin(p,n)) is a diagonal matrix such that {d2i : 1 ≤ i ≤

min(p, n)} contains the positive singular values of X!X.

Using the SVD, we can write the OLS estimates as

Xβols

= X(X!X)−1XTY (2.7)

= UU!Y. (2.8)

and the ridge regression estimates as

Xβridge

= X(X!X + λI)−1X!Y (2.9)

= UD(D2 + λI)−1DU!Y (2.10)

=p∑

j=1

uj

d2j

d2j + λ

u!j Y. (2.11)

This formula suggests that a greater amount of shrinkage to 0 is applied to basis

vector uj with smaller d2j .


2. Decorrelation.

Assuming the predictors to be normalized, we can express the sample covariance

matrix in terms of the sample correlation ρij:

X!X =

1 ρ12 . ρ1p

1 . .

1 ρp−1,p

1

p×p

, (2.12)

Ridge estimates with parameter λ are given by βridge

= RX!Y, with

R = (X!X + λI)−1.

Notice that R can be rewriten as

R =1

1 + λR∗ =

1

1 + λ

1 ρ12

1+λ . ρ1p

1+λ

1 . .

1 ρp−1,p

1+λ

1

−1

,

which is the “decorrelated” OLS operator with correlations shrunk by the factor

1/(1 + λ). We will revisit this effect in Section 2.2. However, ridge regression

does not provide sparse solutions for high-dimensional regression.


2.1.3 Lq Regularization: Bridge Regression

Frank & Friedman (1993) has considered the bridge estimator associated with the

regularization problem

−%n(β) + λPq(β) =n∑

t=1

(yt −p∑

j=1

βjxtj)2 + λ

p∑

j=1

|β|qj , where q ≥ 0. (2.13)

Since this power family of penalties contains subset selection (q = 0), Lasso (q = 1),

and ridge regression (q = 2) as special cases, it gives us opportunities to choose

between subset regression (sparsest solutions) and ridge regression (non-sparse solu-

tions) for 0 ≤ q ≤ 2, and indeed one might even try estimating q from the data; see

Frank & Friedman (1993) and Friedman (2008). While one can use convex optimiza-

tion techniques to solve (2.13) for q ≥ 1, the optimization problem is non-convex when

q < 1, which may lead to much more sparse solution than Lasso (q = 1). Huang,

Horowitz & Ma (2008) have studied the asymptotic properties of the bridge estimator

when 0 < q < 1 as p →∞ and n →∞.

2.1.4 Refinements: Adaptive Lasso & SCAD

Alternative penalty functions have been proposed to improve Lasso and ridge regres-

sion in terms of variable selection consistency and prediction accuracy.

Adaptive Lasso

We call a method M an “oracle procedure” if β(M)

has the following asymptotic

properties:

(i) identifies the right subset model, i.e., {j : β(M)

j *= 0} = {j : βj *= 0},


(ii)√

n(β(M)

A − βA) converges in distribution to N(0,Σ∗), where A = {j : βj *= 0}

and Σ∗ is the covariance matrix when the true subset model is known and βA

is the subvector of β corresponding to the subset A of {1, 2, · · · , p}.

From Section 2.1.1, we know Lasso is only variable selection consistent under

certain conditions, thus Lasso is not an oracle procedure itself. Zou (2006) proposed

Adaptive Lasso that assigns different weights to different coefficients. Defining the

weight vector w = 1/|β|γ if β is a√

n-consistent estimator for β, the Adaptive Lasso

estimator is given by

β∗(n)

= arg minβ

{n∑

t=1

(yt −p∑

j=1

βjxtj)2 + λ(n)

p∑

j=1

wj|β|j}. (2.14)

The data-driven w is the key for adaptive lasso since it depends on a√

n-consistent

initial value β (one usually can use βols). Under the fixed p setting, Zou (2006) proved

Adaptive Lasso enjoys the oracle properties in his Theorem 2. Huang, Ma & Zhang

(2008) extended this oracle properties to high-dimensional cases when p → ∞ as

n →∞.

SCAD

Under a generalized penalized likelihood framework, Fan & Li (2001) introduced the

“smoothly chipped absolute deviation” (SCAD) penalty to address the problem that

the L1-penalty used by Lasso may lead to severe bias for large regression coefficients.

They proposed to estimate β by


−n−1%n(β) +p∑

j=1

Pλ(|βj|) =1

2n‖Y −Xβ‖2 +

p∑

j=1

Pλ(|βj|), (2.15)

They argued that good penalty functions should result in estimators with the following

properties:

• Approximate Unbiasedness: The resulting estimator is nearly unbiased, espe-

cially when the true unknown coefficient βj is large, to avoid unnecessary mod-

eling bias,

• Sparsity: The resulting estimator is a thresholding rule, thus can automatically

set small estimated coefficients to zero so as to get a sparse model,

• Continuity: The resulting estimator is continuous in data to reduce instability

in model prediction.

By considering canonical linear model in which the design matrix satisfies

X!X = nIp,

they reduced (2.15) to

1

2n‖Y −Xβ‖2 + ‖β − β‖2 +

p∑

j=1

Pλ(|βj|), (2.16)


where β = n−1X!y is the ordinary least square estimator, thus the minimization

problem of (2.16) is equivalent to minimizing componentwise as:

(z − θ)2 + Pλ(|θ|). (2.17)

To attain these three properties for a good penalty function, Antoniadis & Fan (2001)

gave the followings sufficient conditions as below:

• Approximate unbiasedness if p′(t) = 0 for large t,

• Sparsity if mint≥0{t + p′(t)} > 0,

• Continuity if arg mint≥0

{t + p′(t)} = 0.

Accordingly, Fan and Li introduced the SCAD penalty function, whose derivative

is given by

p′(t) = λ{ t≤λ +(aλ− t)+

(a− 1)λt>λ}, for some a > 2, and pλ(0) = 0. (2.18)

The associated minimization problem is non-convex, so multi-step procedures in which

each step involves convex optimization have been introduced, as in the local quadratic

approximation of Fan & Li (2001) and the local linear approximation (LLA) of Zou

& Li (2008), who also showed that the one-step LLA estimator has certain oracle

properties if the initial estimator is suitably chosen. Zhou, Van De Geer & Buhlmann

(2009) have pointed out that one such procedure is Zou’s (2006) adaptive Lasso, which

uses the Lasso as an initial estimator to determine the weights for a second-stage

weighted Lasso. They have also substantially weakened the conditions of Huang, Ma

& Zhang (2008) on the variable selection consistency of adaptive Lasso, which Zou


(2006) established earlier for the case of fixed p. However, the computation of both

Adaptive Lasso and SCAD requires a consistent initial estimator for the unknown

regression coefficients. This requirement implicitly assumes p < n or that Lasso or

some other preliminary estimator is consistent for the problem at hand.


2.2 Elastic Net and a New Approach

Zou & Hastie (2005) introduced the Elastic Net, which is a convex combination of L1

and L2 penalties, to address the two serious drawback of Lasso:

(a) Lasso has some amount of sparsity forced onto it, and it selects at most n

variables before it saturates. When p " n, there may be more than n nonzero

of βj’s in the true model, and this is a limitation of Lasso in terms of variable

selection,

(b) Simulation studies have shown that Lasso does not perform well if the predictors

have high correlation structure, even when p < n.

Elastic Net attempts to combine the L2 and L1 penalties, by using ridge regression

to deal with high correlation problem while taking advantage of Lasso’s sparse variable

selection properties, as will be explained below. Assume that the response is centered

and the predictors are normalized, i.e.,

n∑

i=1

yi = 0,n∑

i=1

xij = 0, andn∑

i=1

x2ij = 1, for j = 1, 2, · · · , p. (2.19)

Define

L(λ1, λ2, β) = ‖Y −Xβ‖22 + λ2‖β‖2

2 + λ1‖β‖1, (2.20)

where ‖β‖22 =

∑pj=1 β2

j , ‖β‖1 =∑p

j=1 |βj|. For any fixed λ1 and λ2, the Elastic Net

estimator minimizes (2.20), i.e.,

βenet = (1 + λ2) arg minβ

{L(λ1, λ2, β)}. (2.21)


As is shown by Zou and Hastie, we can first augment the data as (Y∗,X∗)

X∗(n+p)×p = (1 + λ2)

−1/2

X√

λ2I

Y∗(n+p)×1 =

Y

0

,

and then minimize

L(γ, β) = L(γ, β∗) = ‖Y∗ −X∗β∗‖22 + γ‖β∗‖1, (2.22)

where γ = λ1/√

1 + λ2, and β∗ =√

1 + λ2 ·β; the minimization of (2.22) is equivalent

to minimize (2.20).

In the case of orthogonal design, it is straightforward to show that with parameters

(λ1, λ2), the Elastic Net solution is

βeneti = (1 + λ2)

(|βolsi |− λ1/2)+

1 + λ2sgn{βols

i }

= (1 + λ2)(|βridgei |− λ1/2

1 + λ2)+sgn{βridge

i },(2.23)

which amounts to the Lasso-type soft-thresholding of the ridge regression estimates

associated with L2 penalties.

More generally, Elastic Net can be considered as two-stage procedure: a ridge-

type direct shrinkage followed by a lasso-type thresholding. Actually, the ridge-type

shrinkage is the part that distinguishes Elastic Net from Lasso. To study the operation


characteristics of ridge operator, assume that the predictors are normalized so that

X!X =

1 ρ12 . ρ1p

1 . .

1 ρp−1,p

1

p×p

, (2.24)

where ρi,j is the sample correlation. Ridge estimates with parameter λ2 are given by

βridge

= RX!Y, with

R = (X!X + λ2I)−1.

Notice that R can be rewriten as

R =1

1 + λ2R∗ =

1

1 + λ2

1 ρ12

1+λ2. ρ1p

1+λ2

1 . .

1 ρp−1,p

1+λ2

1

−1

.

R∗ is like the usual OLS operator except that the correlations are shrunk by the factor

1/(1 + λ2), which actually works as decorrelation.

However, Elastic Net has has some limitations, because it uses a mixture of the

joint double exponential prior (corresponding to Lasso) and joint normal prior (cor-

responding to ridge). Therefore, when one of the penalties does not work well, the

mixture still has to pay the price of that penalty.


The operation characteristics of Elastic Net inspired us to combine the L1 and L2

penalties in an alternative way. Rather than taking a linear combination of the L1

and L2 penalties as in (2.20), we can combine the penalties by taking the maximum

of |βi| and ρβ2i for each component of β, i.e.,

P (β, ρ) =p∑

i=1

max(|βi|, ρβ2i ). (2.25)

!1.0 !0.5 0.0 0.5 1.0!1.0

!0.5

0.0

0.5

1.0

b1

b2

L1 penalty

!1.0 !0.5 0.0 0.5 1.0!1.0

!0.5

0.0

0.5

1.0

b1

b2

L2 penalty

!1.0 !0.5 0.0 0.5 1.0!1.0

!0.5

0.0

0.5

1.0

b1

b2

Max!1,2" penalty

Figure 2.1: Contour for different penalties, left to right, L1, L2, and Max1,2 with ρ = 3

We call it the Max 1 ,2 penalty. Figure 2.1 shows the contours of L1, L2 and

Max 1 ,2 (with ρ = 3) penalties. The essential idea behind the Max 1 ,2 penalty is to

threshold the small coefficients by the L1-penalty, while to keep the large coefficients

after the decorrelating by the L2-penalty. The corresponding prior density for βj is

proportional to

exp{−λ · max(|βj|, ρβ2j )}


for each j, and the joint prior distribution for β assumes that βj are independent.

In other words, we treat the large coefficients like ridge regression that try to ’share’

the coefficients among the group of correlated regressors, while we treat the small

coefficients like Lasso to do soft thresholding and thus get ”sparsity” among these

small coefficients. Figure 2.2 compares the shrinkage effects of L1, L2, Elastic Net,

and the Max 1 ,2 penalties.

-2 -1 0 1 2

-2-1

01

2

L1 shrinkage

x

x

-2 -1 0 1 2

-2-1

01

2

L2 shrinkage

x

x

-2 -1 0 1 2

-2-1

01

2

ElasticNet shrinkage

x

x

-2 -1 0 1 2

-2-1

01

2

Max(1,2) shrinkage

x

x

Figure 2.2: Shrinkage effects of different penalties


Now, let’s define our Max 1 ,2 estimator as below:

L(λ, ρ, β) = ‖Y −Xβ‖22 + λP (β, ρ) (2.26)

= ‖Y −Xβ‖22 + λ

p∑

i=1

max(|βi|, ρβ2i ). (2.27)

For any fixed λ and ρ denote Max 1 ,2 penalty estimator by βM :

βM = arg minβ

L(λ, ρ, β). (2.28)

In the literature, Owen (2006) introduced “Berhu” penalty,

BM(βj) =

|βj| |βj| ≤ M,

β2j +M2

2M |βj| > M,

(2.29)

which is another variant of the Max 1 ,2 penalties. His motivation is Huber’s loss func-

tion for robust estimates, and “Berhu” means reversing “huber”. Owen (2006) used

cvx to solve the optimization problem, which is not fast. In the following section we

will present our new efficient implementation.

2.2.1 Implementation of Max 1 ,2 Regularization

To make it possible for p " n cases, in this section, we will first present a recent fast

implementation for Max 1 ,2 regularization when λ and ρ are fixed. This is based on the

ADMM procedure for certain convex optimization problems; see Boyd et al. (2010).

Since the statistics community cares much about the path solution for parameter

tuning purpose, we propose an even faster path approximation algorithm to get all


path solutions for the Max 1 ,2 regularization method.

Exact solution implementation

We now describe our modified Alternating Direction Method of Multipliers (ADMM,

Boyd et al. (2010)) for the Max 1 ,2 problem using the lagrange form (2.30)

L(λ, ρ, β) = ‖Y −Xβ‖22 + λ

p∑

i=1

max(|βi|, ρβ2i ). (2.30)

The algorithm can also be generalized to other convex hybrid penalties.

ADMM is a variant of the augmented Lagrangian scheme that uses partial updates

for the dual variables, and it is intended to blend the decomposition capability of dual

descent with the superior convergence properties of the method of multipliers. Our

Max 1 ,2 optimization problem is clearly equivalent to

minimize: L(λ, ρ, β) = ‖Y −Xβ‖22 + λ

p∑

i=1

max(|zi|, ρz2i ), (2.31)

subject to: β − z = 0, (2.32)

where z = (z1, z2, · · · , zp). As in the method of multiplier, we can form the augmented

Lagrangian as

Lr(λ, ρ, β, z, d) = ‖Y−Xβ‖22 +λ

p∑

i=1

max(|zi|, ρz2i )+d!(β−z)+

r

2‖β−z‖2

2. (2.33)

Under the ADMM framework, for fixed λ and ρ, we can solve (2.33) through the


iterations:

βk+1 := arg minβ

Lr(β, zk, dk), (2.34)

zk+1 := arg minz

Lr(βk+1, z, dk), (2.35)

dk+1 := dk + r(βk+1 − zk+1). (2.36)

Application of ADMM to (2.33) updates β and z in an alternating fashion, and

consists of three steps: β optimization step (2.34), a z -minimization step (2.35), and

a dual variable d update (2.36) that uses a step size r. As pointed by Boyd et al.,

the state of ADMM actually only consists of zk and dk, which means (zk+1, dk+1) is

function of (zk, dk). In particular, both of the minimization steps (2.34) and (2.35)

can be solved explicitly,

βk+1 = (X!X +r

2I)−1

(X!y +r

2zk − dk/2), (2.37)

zk+1j = Sλ,ρ,r(

dkj

r+ βk+1

j ), 1 ≤ j ≤ p, (2.38)

and Sλ,ρ,r is the shrinkage operator associated with Max1,2 penalty,

Sλ,ρ,r(v) =

v1+2λρ/r |v| ≥ 1

ρ + 2λr ,

1ρsign(v) 1

ρ + λr ≤ |v| ≤ 1

ρ + 2λr ,

v − sign(v)λr

λr ≤ |v| ≤ 1

ρ + λr ,

0 |v| ≤ λr ,

(2.39)

which is exactly the shrinkage that occurs for orthonormal inputs and is a “hybrid”


of ridge shrinkage and soft thresholding for Lasso based the magnitude of v.

We iteratively update until ‖zk − βk‖2 ≤ εfeas and ‖zk − zk−1‖2 ≤ εtol, where

εfeas is the feasible tolerance for the residuals of equality constraint (2.32) and εtol is

the stability tolerance. There are many convergence results for ADMM discussed in

literature, and more details can be found in Boyd et al. (2010).

Thus, we can calculate the βλ,ρ on grids of (λ, ρ), and then choose the best tuning

parameters (λ∗, ρ∗) through cross-validation.

Fast path approximation implementation

In previous subsection, we gave a exact path solution given λ and ρ. There are still

large computational burdens of obtaining the solutions to (2.30) for grids of different

(λ, ρ). One approach to mitigate this burden is direct path seeking, which aims at

sequentially constructing a path directly in the parameter space that closely approxi-

mates that for a given penalty rather than repeatedly solving the optimization prob-

lems. Motivated by Generalized Path Seeking(GPS) algorithm of Friedman (2008),

we introduce a direct path seeking algorithm for our Max 1 ,2 estimator.

Denote P (β, ρ) =∑p

i=1 max(|βi|, ρβ2i ), ρ > 0. It is easily to show

{∂P (β, ρ)

∂|βj|> 0}p

1 (2.40)

for all values of β, which satisfy the condition (23) in Friedman (2008). We get the

path seeking algorithm for our Max 1 ,2 estimator based on Friedman’s GPS frame-

work.

Let υ measure the length along the path, ∆υ > 0 be a small increment, and β(υ)


be the path solution point indexed by υ. Define

gj(υ) = −[∂‖Y −Xβ‖2

2

∂βj

]

β=β(υ)

, (2.41)

pj(υ) =

[∂P (β, ρ)

∂|βj|

]

β=β(υ)

, (2.42)

τj(υ) =gj(υ)

pj(υ), (2.43)

in which gj(υ) and pj(υ) correspond separately to jth component-wise negative gradi-

ent of least squares empirical risk and gradient of regularization P (β, ρ) with respect

to |βj| evaluated at β(υ), thus τ(υ) are the component-wise ratios for these two

gradients at β(υ).

Algorithm 3 Path seeking algorithm for Max 1 ,2 estimator

Choose appropriate ∆υ, Gρ be the number of ρs in the grid:

Step 1 for each ρ, itialize υ = 0; {βj(0) = 0}p1.

Step 2.1 Compute {τj(υ)}p1, and get the candidate set S = {j|τj(υ)·βj(υ) < 0};

Step 2.2 Choose the variable with max gradient ratio from candidate set S:j∗ = arg max

j∈S|τj(υ)| (if S is empty, j∗ = arg max

j|τj(υ)|) ;

Step 2.3 Update βj∗ :βj∗(υ + ∆υ) = βj∗(υ) + ∆υ · sign(τj∗(υ));

Step 2.4 υ = υ + ∆υ;

Step 2.5 Continue step 2.2-2.5 until all τ(υ) = 0.

Step 3 Tuning along the path to get β(υ∗(ρ)) with minimum cross-validation error.

βM = β(υ∗(ρ∗)) , with minimum cross-validation error across all ρ.

For any fixed ρ, the path approximation algorithm (Algorithm 3) gives us the

solution path along path length υ, thus we can tune υ (equivalent to tuning λ) along


the path by choosing β(υ∗(ρ)) with minimum cross-validation error. Among a grid

of ρ, we then select β(υ∗(ρ∗)) with minimum cross-validation error, which is our

empirical solution for the Max 1 ,2 estimator. The path seeking algorithm reduces the

two-dimensional grid search to a univariate grid search, and also speeds up the path

seeking along λ with fixed ρ by using an approximation, which makes the Max 1 ,2

estimator more easily computable.


2.3 L0-Regularization: Orthogonal Greedy Algorithm

Ing & Lai (2011) introduced a fast stepwise regression procedure, called orthogonal

greedy algorithm (OGA), which consists of (a) forward selection of input variables in

a “greedy” manner so that the selected variable at each step will most improve the

fit, (b) a high dimension information criterion (HDIC) to terminate forward inclusion

of variables and (c) stepwise backward elimination of variables according to HDIC.

2.3.1 OGA and Gradient Boosting

As a greedy algorithm, OGA has a strong connection to the boosting algorithms in

statistical learning. Boosting successively uses a “weak learner” (or “base learner”)

to improve prediction so that after a large number of iterations, the cumulative ef-

fect of these weaker learners can produce much better prediction. The first boosting

algorithm came from Valiant’s (1984) PAC (Probably Approximately Correct) learn-

ing model, and Schapire (1990) developed the first simple boosting procedure under

the PAC learning framework. Freund (1995) proposed a boost by majority voting

to combine many weak learners simultaneously and improve the performance of the

simple boosting algorithm of Schapire. This led to the more adaptive and realistic

AdaBoost (Freund & Schapire 1996a) and its refinements, and theoretical justifica-

tion were provided by Freund & Schapire (1996a) and Schapire & Singer (1999).

Subsequently, Freund & Schapire (1996b) and Breiman(1998, 1999) connected the

Adaboost to game theory and Vapnik-Chervonenkis theory.

The ground-breaking work for boosting came from Friedman(1999, 2001). Fried-

man treats the boosting procedure as a general method for functional gradient descent

learning, and gives a list of choices for loss functions and base learners in a generic


Algorithm 4 Friedman’s Gradient Boosting AlgorithmGiven {(yi,xi); i=1,2...,n}, denote empirical loss function as φ(F ) =1N

∑ni=1 L(yi, F (xi))

• Step 1 (Initialization) Initialize F0(x) = 0.

• Step 2 (Iteration) For m=1 to M do:

– Step 2.1 Compute the steepest descent ui = −∂L(yi,F (xi))∂F (xi)

|F=Fm−1(xi), i = 1, ..n;

– Step 2.2 Calculate the least squares fit am = arg mina,β∑n

i=1[ui − βh(xi; a)]2,where h can be any base learner. Denote fm(x) = h(x; am);

– Step 2.3 Linear search:

ρm = arg minρ

n∑

i=1

L(yi, Fm−1(xi) + ρfm(xi)));

– Step 2.4 Update Fm(x) = Fm−1(x) + ρmfm(x);

• Step 3(Stopping) Boosted estimate is F (x) =∑m∗

m=1 ρmfm(x), where m∗ is selectedto minimize model selection criterion.

framework for “gradient boosting”. Algorithm 4 is a brief summary view for boosting

under gradient descent point of view.

This work has extended boosting method to regression, which is implemented as

an optimization using the squared error loss function, and this important special case

is called $L2-Boosting in Buhlmann & Yu (2003) and Buhlmann (2006). As a special

case, $L2-Boosting uses loss function L(y, F ) = 12(y−F )2 and takes the component-wise

linear model as base learner, thus we can get the $L2-Boosting algorithm in gradient

descent framework with a simple structure: the negative gradient of step 2.1 is the

classical residual vector and the linear search in step 2.3 is trivial. Algorithm 5 is the

summary for $L2-Boosting.


Algorithm 5 $L2-Boosting Algorithm

• Step 1 (Initialization) Initialize F0(x) = 0.

• Step 2 (Iteration) For m=1 to M do:

– Step 2.1 Compute the steepest descent ui = yi − Fm−1(xi), i = 1, ..n

– Step 2.2 Here, the new base learner is

fm(x) = xjm,

where jm = arg min1≤j≤p

n∑

i=1

(ui − βjxij)2 = arg min1≤j≤p

(1− r2j ), βj =

∑ni=1 uixij∑ni=1 x2

ij

– Step 2.3 Linear search:

ρm = arg minρ

n∑

i=1

L(yi, Fm−1(xi) + ρfm(xi)) = βjm

– Step 2.4 Update Fm(x) = Fm−1(x) + βjmxjm

• Step 3(Stopping) Boosted estimate is F (x) =∑m∗

m=1 βjmxjm

, where m∗ is selected tominimize model selection criterion.

Remark 2.2. It is often better to use small step sized in Step 2.4, as Fm(x) = Fm−1(x) +vβjm

xjm, where v is constant during the iteration.

Buhlmann (2006) has studied the consistency of $L2-Boosting in terms of the con-

ditional prediction error (2.44),

CPE := E{(Fm(x)− f(x))2|y1,x1, · · · , yn,xn}, (2.44)

and shown that for p = exp(O(nξ)) with 0 < ξ < 1, the CPE of $L2-Boosting predictor

Fm(x), under certain technical conditions, can converge in probability to 0 if m =

mn →∞ sufficiently slowly. He also proposed to use corrected-AIC(AICc) as model

selection criterion along the PGA path.


Algorithm 6 OGA Algorithm

• Step 1 (Initialization) Initialize F0(x) = 0, and J (0) = ∅.

• Step 2 (Iteration) For m=1 to Kn do:

– Step 2.1 Compute the steepest descent Ui = Y − Fm−1(Xi), i = 1, ..n

– Step 2.2 Find jm similarly by

jm = argmin1≤j≤p,j /∈J(m−1)(1− r2j ) (2.45)

Compute the projection Xjmof Xjm

into the linear space spannedby(Xj1

,X⊥j2

,X⊥j3

....X⊥jm−1

), and get X⊥jm

= Xjm− Xjm

, which is ”weak” orthog-onal gradient direction.Update J (m) = (J (m−1), jm).

– Step 2.3 Line search along the Orthogonal gradient direction:

ρm = arg minρ

n∑

i=1

L(Yi, Fm−1(Xi) + ρx⊥jm

) = βjm,

where βjm= (

∑ni=1 UiX⊥

i,jm)/

∑ni=1(X

⊥i,jm

)2.

– Step 2.4 Update Fm(x) = xJm βJ(m) = Fm−1(x) + βjmx⊥

jm

Remark 2.3. We can consider the step 2.3 as a hyper-plane search, since we will get theOLS update from this step.

As a alternative to $L2-Boosting, OGA can also be summarized in the framework of

boosting in Algorithm 6. The population version of OGA and $L2-Boosting were also

studied in Temlyakov (2000), called (OGA and PGA separately). In the information

theory, compressed sensing and approximation theory, OGA is also called orthogo-

nal matching pursuit (OMP), which focuses on approximations in noiseless models

(i.e.,εt = 0 in (2.1)); more details are in Tropp (2004) and Tropp et al. (2007).


2.3.2 OGA+HDIC

Ing & Lai (2011) introduced OGA+HDIC for variable selection. In their work, they

proposed to choose along the OGA path the model that has the smallest value of a

suitably chosen criterion, which is called a “high-dimensional information criterion”

(HDIC). Specifically, for a non-empty subset J of {1, · · · , p}, let σ2J = n−1

∑nt=1(yt−

yt;J)2, where yt;J is OGA fitting after Jth iterations. Let

HDIC(J) = n log σ2J + 1(J)wn log p, (2.46)

kn = arg min1≤k≤Kn

HDIC(Jk), (2.47)

in which different criterias correspond to different choices of wn and Jk = {j1, · · · , jk}.

Note that σ2Jk

, and therefore HDIC(Jk) also, can be readily computed at the kth OGA

iteration, and therefore this model selection method along the OGA path involves

little additional computational cost. In particular, wn = log n corresponds to HDBIC,

the case wn = c log log n with c > 2 corresponds to the high-dimensional Hannan-

Quinn criterion (HDHQ), and the case wn = c corresponds to HDAIC.

Let Kn denote a prescribed upper bound on the number m of OGA iterations,

σ2j = E(x2

j), zj = xj/σj and ztj = xtj/σj. Let

Γ(J) = E{z(J)z!(J)}, gi(J) = E(ziz(J)), (2.48)

where z(J) is a subvector of (z1, · · · , zp)! and J denotes the associated subset of


indices 1, · · · , p. We assume that for some δ > 0, M > 0 and all large n,

min1≤)(J)≤Kn

λmin(Γ(J)) > δ, max1≤)(J)≤Kn,i∈/J

‖Γ−1(J)gi(J)‖1 < M, (2.49)

where 1(J) denotes the cardinality of J and

‖ν‖1 =k∑

j=1

|νj| for ν = (ν1, · · · , νk)!. (2.50)

For p " n, p = pn →∞, under assumptions :

(C1) log pn = o(n),

(C2) E{exp(sε)} < ∞ for |s| ≤ s0,

(C3) There exists s > 0 such that

lim supn→∞

max1≤j≤pn E{exp(sz2j )} < ∞,

(C4) supn≥1

∑pn

j=1 |βjσj| < ∞.

The following theorem of Ing & Lai (2011) gives the rate of convergence, which holds

uniformly over 1 ≤ m ≤ Kn, for the CPE (defined in (2.44)) of OGA provided that

the correlation matrix of the regressors satisfies (2.49).

Theorem 2.4. Assume (C1)-(C4) and (2.49). Suppose Kn → ∞ such that Kn =

O((n/ log pn)1/2). Then for OGA,

max1≤m≤Kn

(E[{y(x)− ym(x)}2|y1,x1, · · · , yn,xn]

m−1 + n−1m log pn

)= Op(1).


Theorem 2.4 says that uniformly in m = O((n/ log pn)1/2), OGA can actually

attain heuristically best order of m−1 + n−1m log pn for En({y(x) − ym(x)}2). The

two parts in denominator correspond to bias and variance of OGA method in sample

version. In the population version of OGA, Theorem 3 of Temlyakov (2000) says the

squared bias in approximating y(x) by yJm(x) is E(y(x)−yJm(x))2 = O(m−1), where

yJm(x) denote the best linear predictor of y(x) based on {xj, j ∈ Jm} and Jm is the

set of input variables selected by the population version of OGA at the end of stage

m. Since sample version OGA uses ym(·) rather than yJm(·), it has not only larger

squared bias but also variance in the least squares estimates βji, i = 1, · · · , m. The

variance is of order O(n−1m log pn), noting that m is the number of estimated regres-

sion coefficients, O(n−1) is the variance per coefficient and O(log pn) is the variance

inflation factor due to data-dependent selection of ji from {1, · · · , pn}. Combining

the squared bias with the variance suggests that O(m−1 + n−1m log pn) is the small-

est order one can expect for En({y(x)− ym(x)}2). Moreover, standard bias-variance

tradeoff suggests that m should not be chosen to be larger than O((n/ log pn)1/2), and

OGA usually chooses Kn = 5 · (n/ log pn)1/2.

2.3.3 Variable Selection Consistency under Strong Sparisty

We call a procedure has sure screening property if it can include all relevant variables

with probability approaching 1. Furthermore, we call the procedure variable selection

consistency if it can select all relevant variables and no irrelevant variables with prob-

ability approaching 1. To achieve consistency of variable selection of OGA+HDIC,

some lower bound condition (which may approach 0 as n →∞) on the absolute val-

ues of nonzero regression coefficients needs to be imposed. Define a “strong sparsity”


condition to quantify the lower bound condition by:

(C5) There exists 0 ≤ γ < 1 such that nγ = o((n/ log pn)1/2) and

lim infn→∞

nγ min1≤j≤pn:βj ,=0

β2j σ

2j > 0.

Denote the set of relevant input variables by

Nn = {1 ≤ j ≤ pn : βj *= 0}.

The following theorem of Ing & Lai (2011) shows OGA path contains all the relevant

variables under strong sparsity condition (C5), thus OGA has sure screening property.

Theorem 2.5. Assume (C1)-(C5) and (2.49). Suppose Kn/nγ → ∞ and Kn =

O((n/ log pn)1/2). Then limn→∞ P (Nn ⊂ JKn) = 1, where Nn = {1 ≤ j ≤ pn : βj *= 0}

denotes the set of relevant input variables.

Define the minimal number of relevant regressors along an OGA path by

kn = min{k : 1 ≤ k ≤ Kn, Nn ⊆ Jk} (min ∅ = Kn), (2.51)

kn = arg min1≤k≤Kn

HDIC(Jk). (2.52)

Ing & Lai (2011) have shown that by choosing wn of HDIC satisfying

wn →∞, wn log pn = o(n1−2γ). (2.53)

the OGA+HDIC scheme is variable selection consistent under strong sparsity case.


Theorem 2.6. With the same notation and assumptions as in Theorem 2.5, suppose

(2.53) holds, Kn/nγ →∞ and Kn = O((n/ log pn)1/2). Then limn→∞ P (kn = kn) = 1.

Although Theorem 2.6 shows that the minimal number of relevant regressor along

OGA path kn can be consistently estimated by kn, Jknmay still contain irrelevant

variables. Ing & Lai (2011) also proposed a back trim scheme to exclude the irrelevant

variables by using HDIC criterion, i.e define a subset Nn of Jknby

Nn = {jl : HDIC(Jkn− {jl}) > HDIC(Jkn

), 1 ≤ l ≤ kn} if kn > 1, (2.54)

and Nn = {j1} if kn = 1. In order to further trim the irrelevant variables, we only

require the computation of kn − 1 additional least squares estimates and their as-

sociated residual sum of squares∑n

t=1(yt − yt;Jkn−{jl})

2, 1 ≤ l < kn for (2.54), in

contrast to the intractable combinatorial optimization problem of choosing the sub-

set with the smallest extended BIC among all non-empty subsets of {1, · · · , pn}, for

which Chen and Chen (2008, Theorem 1) established variable selection consistency

under an “asymptotic identifiability” condition and pn = O(nκ) for some κ > 0.

The following theorem of Ing & Lai (2011) establishes the oracle property of the

OGA+HDIC+Trim procedure.

Theorem 2.7. Under the same assumptions as in Theorem 2.6 , limn→∞ P (Nn =

Nn) = 1.

There are also numbers of literature recently studied OGA approach. Barron,

Cohen, Dahmen & Devore (2008) have recently extend the convergence rates of OGA

in noiseless models (i.e., εt = 0 in (1.1)) to regression models by using empirical

process theory, but they require technique condition to satisfy |yt| ≤ B for some


known bound B, since they need this bound to apply empirical process theory to the

sequence of estimates y(B)m (x) = sgn(ym(x)) min{B, |ym(x)|}. In their work, Barron

et.al proposed to terminate OGA after 1na2 iterations for some a ≥ 1 and to select

m∗ that minimizes∑n

i=1{yi − y(B)m (xi)}2 + κm log n over 1 ≤ m ≤ 1na2, for which

they showed choosing κ ≥ 2568B4(a + 5) yields their convergence result for y(B)m∗ . In

comparison, Theorem 2.4 in section 2.3.2 doesn’t need this bound condition on |yt|,

and has a much sharper convergence rate than that of Barron et al. (2008). Wang

(2009) recently proposed to use forward stepwise regression to select a manageable

subset first, and then choose variables along an OGA path by using ”extended” BIC

(2.55) of Chen & Chen (2008).

BICγ(Jm) = n log σ2Jm

+ 1(Jm) log n + 2γ log τ)(Jm),

where τj =

(p

j

), Jm ⊂ {1, · · · , p} is non-empty and 0 ≤ γ ≤ 1.

(2.55)

Wang (2009) only establishes the sure screening property limn→∞ P (Nn ⊆ Jmn) = 1

, where mn = arg min1≤m≤n EBIC(Jm), thus Wang (2009) actually uses it to screen

variables for a second-stage regression analysis using Lasso or adaptive Lasso. In ad-

dition, Wang (2009) proves the sure screening property result under much stronger

assumptions than those of Theorem 2.6, such as εt and xt having normal distribu-

tions and a ≤ λmin(Σn) ≤ λmax(Σn) ≤ b for some positive constants a and b and

all n ≥ 1, where Σn is the covariance matrix of the pn-dimensional random vector

xt. Forward stepwise regression followed by cross-validation as a screening method

in high-dimensional sparse linear models has also been considered by Wasserman &

Roeder (2009), who propose to use out-of-sample least squares estimates for the se-

lected model after partitioning the data into a screening group and a remaining group


for out-of-sample final estimation. By using OGA+HDIC here instead, we can already

achieve the oracle property without any further refinement.

For model selection, recall by using a nested finite order of AR models, Ing &

Wei (2005) and Ing (2007) consider the problem of approximating a stationary infinite

order autoregressive process AR(∞),

yt =∞∑

j=1

bjyt−j + ηt, (2.56)

where bj are unknown AR coefficients, yt is a time series data, and ηt are random

disturbances. They showed that AIC is asymptotically efficient in the sense that the

approximation AR model selected by AIC possesses the optimal prediction capability

when bj satisfies one of the following sparsity conditions:

Ljγ ≤ |bj| ≤ Ujγ (the algebraic decay case), (2.57)

C1 exp(−aj) ≤ |bj| ≤ C2 exp(−aj) (the exponential decay case), (2.58)

where 0 < L ≤ U < ∞, 0 < C1 ≤ C2 < ∞, γ > 1 and a > 0 are constants.

The results of Ing & Lai (2011), together with the asymptotic efficiency of AIC in

AR∞ processes, inspired us to use HDAIC (2.59),

HDAIC(J) = n log σ2J + c1(J) log p. (2.59)

as the high-dimensional version of AIC, to choose the models along the OGA path

under weak sparsity cases that satisfy (2.57) and (2.58), where c is a positive constant

and we usually choose c = 2.01.

Chapter 3

Monte Carlo Cross-Validation and

Estimation of Prediction Error

Distribution

In Chapter 2, we have discussed a variety of regularization methods with penalties

ranging from L0 to L2 as well as their relative strengths. In particular, we have con-

sidered OGA+HDIC under strong sparsity and weak sparsity assumptions. However,

in practice one does not know if the data-generating mechanism for a particular high-

dimensional data set satisfies the sparsity assumptions. Ideally, if we had enough

data, we would set aside a test set and use it to assess the prediction performance

of a model built from the remaining data that is used as a training set. However, we

can’t afford such a “luxury” in the p " n case, and we need to have some procedure

to reuse the available data for both training and testing.

38

CHAPTER 3. MONTE CARLO CROSS-VALIDATION 39

Assume

yi(xi) = β!xi + εi, where β = (β1, · · · , βp)!,xi = (xi,1, · · · , xi,p)

!, (3.1)

Throughout this chapter, we use M to denote a regression method, and βM ,n to

denote the estimatior of β associated with the method M based on a sample of n i.i.d

observations (xi, yi), 1 ≤ i ≤ n. Let (xn+1, yn+1) be a future pair of covariate xn+1

and response yn+1. The prediction error is

e(M ; n) = yn+1 − β"M ,nxn+1 = (β − βM ,n)"xn+1 + εn+1. (3.2)

Let g be a continuous nonnegative function on the real line, e.g., g(e) = |e| or g(e) =

e2. The distribution F (M )n of g(e(M ; n)) can be evaluated by Monte Carlo methods

when β and the distribution Ψ of (x, ε) are given.

Given a consistent estimate β of β, one can use the empirical distribution Ψ of the

centered residuals yi− β"xi− (y−β"x) to estimate Ψ consistently. Generating i.i.d

random variables ε∗i from Ψ and letting y∗i = β"xi + ε∗i (i = 1, 2, . . . , n + 1), one can

apply the method M to the bootstrap sample {(xi, y∗i ), 1 ≤ i ≤ n}, which yields the

bootstrap estimate β∗M ,n and the prediction error e∗(M ; n) = y∗n+1 − (β∗

M ,n)"x∗n+1.

The empirical distribution of B bootstrap replicates of g(e∗(M ; n)) can then be used

to estimate the true distribution F (M )n .

A key assumption in such asymptotic justification of the bootstrap or other re-

sampling estimates of F (M )n is the existence of a consistent estimate β (which may

be based on a sample size different from n and a method different from M ) of β

so that the unobservable random errors εt, and therefore also the distribution Ψ of


(x, ε), can be estimated consistently. This is not an issue for the classical regression

theory that assumes fixed p while n → ∞. However, it is hard to find a consistent

estimate of β in the high-dimensional setting p " n considered in this thesis. Since

our goal is to evaluate the prediction performance of method M for the given data

and a future replicate (x, y), we can’t make a priori assumption on the sparsity of β

and the distribution of the unobserved ε.

In this chapter, we use a Monte Carlo cross-validation (MCCV) approach to ad-

dress this difficulty by estimating F (M )nt instead of F (M )

n , where nt < n is the size of

the training samples in cross-validation. We begin with a review of cross-validation

and MCCV in the literature.

3.1 Overview of Cross-validation

Cross-validation(CV) is one of the most widely used methods to evaluate a model’s

prediction performance. As is well known, training an algorithm and evaluating its

statistical performance on the same data yields an over-optimistic result. CV was

originally developed to fix this issue, starting from the remark that testing the out-

put of the algorithm on new data would yield a better estimate of its performance

(Mosteller & Tukey (1968); Stone (1974)). Its basic idea is to split a sample of size n

into a training set of nt and a test (or validation) set of size nv = n− nt.

There are various data splitting strategies in the literature. Delete-one CV was

proposed by Stone (1974), Allen (1974), Geisser (1975), and is the most widely known

CV procedure. It corresponds to the choice nv = 1, and each data point is successively

left out from the n data points as validation set. Thus delete-one CV can be defined


as:

CVq(1) =1

n

n∑

i=1

(yi − β!(−i)xi)

2, (3.3)

where q is the number of covariates in the model and β(−i) is the model fitted from

the data set that leaves out (yi,xi). However, CVq(1) is inconsistent for choosing the

smallest correct model as it tends to overfit and may also have high variability; see

Efron (1986).

The delete-d CV has been proposed to rectify the inconsistency of CV (1). This is

also called multifold cross-validation or MCV . Specifically, MCV is defined for linear

regression models as:

MCVq =∑

S

‖yS −XSβ!(−S)‖2

2/[d

(n

d

)], (3.4)

where the sum is over all possible subsets S with size d from n observations, β(−S)

is the model-based estimate of β (under a regression model with q covariates) from

the observations not in S, XS = {xi, i ∈ S}, and yS = (yi, i ∈ S); see Shao (1993).

Assuming fixed p as n →∞, Shao (1993) showed that variable selection consistency

of MCVq requires that nv/n → 1 and n− nv →∞, as n →∞.

Instead of using all subsets of size d as possible test sets as in (3.4), more computa-

tionally convenient alternatives to MCV have been proposed. In particular, Breiman,

Friedman, Olshen & Stone (1984) considered k-fold cross-validation that splits the

data into k roughly equal-sized groups and uses one subgroup as a test set and the

remaining data as the training set. Here are more details. Let κ : {1, 2, · · · , n} 3→


{1, 2, · · · , k} be an indexing function that indicates the partition to which observa-

tion i is randomly allocated. Let f−k(x) the fitted model based on the data after

removing the kth subgroup. Then the k-fold CV estimate of the mean-squared pre-

diction error is

k-CVq =1

n

n∑

i=1

(yi − f−κ(i)(xi))2. (3.5)

Another alternative to MCV is the Repeated Learning Test (RLT ) introduced

by Breiman et al. (1984) and further studied by Burman (1989) and Zhang (1993).

It is also called Monte Carlo cross-validation(MCCV); see Picard and Cook (1984).

Instead of summing over all possible subsets of size d as in MCV , it takes B random

subset S∗i of size d, and estimates the true squared prediction error by

MCCVq =1

Bd

B∑

i=1

‖yS∗i− f−S∗

i (xS∗i)‖2. (3.6)

Zhang (1993) proved that under certain assumptions,

k-CVq = MCVq + op(1) for q < q0,

where q0 are the true number of nonzero coefficients in the linear model. In addition,

he showed under certain assumption that as B/n2 →∞, then

MCCVq = MCVq + op(n−1).

This basically means that we can reduce the computational complexity of the delete-d

CV method from exponential to polynomial (just over second order). Furthermore,


numerical results in his study have shown that MCCV (or RTL as he calls it) per-

forms better than k-CV. Shao(1993, Theorem 2) has established variable-selection

consistency of MCCV under several assumptions including

nv/n → 1, nt →∞, and n2/(Bn2t ) → 0.

3.2 MCCV Estimate of F (M )nt

Given the n observations (xi, yi), 1 ≤ i ≤ n, we sample without replacement nt

observations, where nt < n. Apply method M to this training sample, denoted by S, to

obtain the estimate βM ,nt and the prediction errors ei(M ; nt) = yi−β"M ,nt

xi for i /∈ S.

Repeat this procedure B times, yielding the set of prediction errors {g(ebi(M ; nt)) :

i /∈ Sb, b = 1, 2, . . . , B}. The empirical distribution F (M )nt of this set puts weight

{B(n − nt)}−1 to each prediction error g(ebi(M ; nt)), and is the MCCV estimate of

F (M )nt .

Since Sb is a random sample of size nt from the set of i.i.d observations (xi, yi),

i = 1, 2, . . . , n, it follows that {ebi(M ; nt) : i /∈ Sb, b = 1, 2, . . . , B} is a set of

identically distributed random variables having the same distribution as e(M ; nt),

which is defined in (3.2) with n replaced by nt. Hence, the empirical distribution

F (M )nt of the these B(n− nt) random variables is an unbiased estimate of F (M )

nt .

3.3 Choice of the Training Sample Size nt

The choice of the validation sample size nv = n − nt for cross-validation has been

considered by Shao (1993) and Yang (2008) in the context of consistent variable


selection in the regression model (2.1) when n → ∞ but with p fixed. They use the

sum of squared prediction errors in multi-fold or Monte Carlo cross-validation as the

model selection criterion. Shao (1993) showed that

nt →∞, andnv

n→ 1 (3.7)

is a sufficient condition to ensure consistency, i.e, that the selected model is the

minimal correct model (including all nonzero regression coefficients) with probability

approaching 1 as n →∞. Yang (2008, Proposition 1) later showed that this condition

(3.7) is also necessary. Since our goal is to estimate F (M )nt , rather than only its second

moment, in the case p " n (instead of p being fixed as n →∞), we consider whether

(3.7) ensures that the MCCV estimates Fnt is a consistent estimates Fnt . In the next

section, we show that under condition (3.7),

V ar(F (M )nt

(A)) → 0, as n →∞, (3.8)

uniformly for all Borel subsets A of real line.

Because p " n, the training sample size nt = o(n) implied by condition (3.7)

causes a large bias in approximating F (M )n by F (M )

nt = E(F (M )nt ). Although one would

like to use nt ∼ n instead to make the bias negligible, the random samples of size

nt drawn without replacement from {(xi, y∗i ), 1 ≤ i ≤ n} have so much overlap

that the effective number of Monte Carlo simulations would be small even though

a large number B of actual Monte Carlo runs are performed. This results in sub-

stantial variance in F (M )nt (A) unless (β − βM ,nt )

"x is small; details are given in next

Section 3.4.


Thus, suitable choice of nt amounts to a bias-variance tradeoff in using F (M )nt to

estimate F (M )n . A natural compromise between condition (3.7) that yields large bias

and small variance and the contrary case nt ∼ n is

nt ∼ (1− ε)n, with 0 < ε < 1/2. (3.9)

In particular, we recommend using ε = 0.1, which corresponding to 10-fold cross-

validation for n ≥ 200. Zhang (1993) has shown that as n →∞ but with p fixed, the

average squared prediction error criterion in multifold or Monte Carlo cross-validation

is inconsistent under (3.9), and his Corollary 1 gives the limiting distribution of the

selected model. In Section 3.4, we consider the case when p " n and analyze the

performance of F (M )nt under (3.9). In particular, we prove the following.

Theorem 3.1. Suppose p " n and condition(3.9) holds. If

(β − βM ,nt )"x −→P 0 as nt →∞, (3.10)

i.e., βM ,nt is consistent estimator for β, then (3.8) holds.

Proof. See section 3.4.

3.4 Asymptotic Theory of MCCV

Section 3.3 has used some asymptotic properties of the MCCV estimate F (M )nt of F (M )

nt

to arrive at an appropriate choice, from the bias versus variance viewpoint, of the

training sample size nt for cross-validation. We provide here the technical details for

the asymptotic theory. Whereas previous authors, e.g., Shao (1993), considered the


case of fixed p and sum of squared prediction errors for the least squares method M

and thereby can make use of explicit formulas for these squared prediction errors in

their analysis, we need to use more general arguments to handle p " n and unspecified

methods M .

As shown in Section 3.2, we have E(F (M )nt (A)) = F (M )

nt (A), for every Borel set A.

Therefore, F (M )nt (A)− F (M )

nt (A) is the average of Bnv zero-mean random variables of

the form

Ibi = I{g(eb

i (M ;nt))∈A} − P{g(e(M ; nt)) ∈ A)}, for i /∈ Sb and 1 ≤ b ≤ B. (3.11)

Although the indicator variables are identically distributed, they are correlated be-

cause Sb is a random set of size nt drawn without replacement from {(xi, yi), 1 ≤

i ≤ n} and β(b)M ,nt

is estimated from Sb while ebi(M ; nt) = yi − β(b)"

M ,ntxi for i /∈ Sb.

Suppose (3.7) holds. In this case, (3.7) is equivalent to nt →∞ and nt = o(n), Sb

and Sb′ are asymptotically independent for b *= b′as n → ∞. Hence eb

i(M ; nt) and

eb′

j (M ; nt) are asymptotically independent, and therefore we have E(Ibi I

b′

j ) = o(1),

uniformly for b *= b′and i *= j. The uniformity follows from the fact that these are

exchangeable random variables. Therefore,

∑

b,=b′

∑

i,=j

E(Ibi I

b′

j )/B2n2v → 0, as n →∞. (3.12)

Since∑B

b=1

∑i,=j E(Ib

i Ib′

j ) = O(Bn2v), and a similar bound holds for the sum over

i = j and b *= b′, it follows that (3.8) holds under (3.7).

We next consider the case nt ∼ (1 − ε)n with 0 ≤ ε ≤ 1/2. For b *= b′, there


is substantial overlap between Sb and Sb′ ; in fact, the size of Sb

⋂Sb′ is at least

(1− 2ε + o(1))n. For i /∈ Sb and j /∈ Sb′ ,

E(Ibi I

b′

j ) = P{g(ebi(M ; nt)) ∈ A, g(eb

′

j (M ; nt)) ∈ A}− P 2{g(e(M ; nt)) ∈ A},(3.13)

by (3.2), we have for i /∈ Sb and j /∈ Sb′ ,

ebi(M ; n) = (β − β(b)

M ,nt)"xi + εi, eb

j(M ; n) = (β − β(b′)

M ,nt)"xj + εj.

If (3.10) holds, then ebi(M ; nt) and eb

′

j (M ; nt) are still asymptotically independent for

b *= b′and i *= j. Hence, (3.12) and therefore (3.8) still hold, proving Theorem 3.1.

On the other hand, if (3.10) does not hold, then ebi(M ; nt) and eb

′

j (M ; nt) are

correlated because of the overlap between Sb and Sb′ . In this case, because of ex-

changeability, the left-hand side of (3.12) is bounded away from 0 and (3.8) no longer

holds.

For the case of fixed p as n →∞, (3.10) holds if the method M contains all regres-

sors with nonzero regression coefficients. Even though (3.8) holds in this case, the av-

erage squared prediction error criterion in multi-fold or Monte Carlo cross-validation

is still inconsistent, as shown by Zhang (1993), because M may not be associated

with the smallest correct model. The Monte Carlo cross-validation method in this

chapter aims at choosing the regularization method rather than the regularization

parameters as in Zhang (1993), or in the choice of λ in (2.2), (2.3), (2.15) or (2.30),

or in the choice of k in (2.47).


3.5 Comparing the Prediction Performance of Two

Methods

Since F (M )nt uses a training sample of size nt satisfying condition (3.9), it tends to

be stochastically larger than F (M )n when g(e) is nonnegative and increasing in |e|. To

compare the predictive performance of two methods M1 and M2 based on a given

data set {(xi, yi), 1 ≤ i ≤ n}, we can mitigate this upward bias by not estimating

F (M1 )nt and F (M2 )

nt separately. Instead we use MCCV to estimated the distribution of

the difference g(e(M1; n))−g(e(M2; n)) in prediction errors between M1 and M2. This

point will be illustrated in Chapter 4.

Chapter 4

Simulation Studies

In this chapter, we compare different regularization methods in a variety of scenar-

ios, and illustrate how MCCV can be used to choose an appropriate regularization

when the data-generating mechanism for the scenario is unknown, which is typically

the case in practical applications. The scenario in Section 4.1 corresponds to strong

sparsity and that in Section 4.2 is weakly sparse. The simulation studies show that

the squared prediction errors of OGA+HDBIC and OGA+HDAIC are stochastically

smaller and have tighter distributions than those of other methods. Moreover, the

MCCV estimates of these distributions show the same patterns as the true squared

prediction error distributions. The scenario in Section 4.3 is not weakly sparse and

the simulation results show that the OGA+HDIC performs substantially worse than

other regularization methods. This further confirms that the weak sparsity condition

imposed for the theory of OGA+HDIC is critical, and that MCCV can help the user

determine which regularization method should be used for the problem at hand. The

simulation study in Section 4.3 also shows the advantage of the Max1,2 regularization

over other regularization methods in this scenario that is not weakly sparse.

49

CHAPTER 4. SIMULATION STUDIES 50

4.1 Strongly Sparse Scenario

Consider the regression model

yt =q∑

j=1

βjxtj +p∑

j=q+1

βjxtj + εt, t = 1, · · · , n, (4.1)

where βq+1 = · · · = βp = 0, εt ∼i.i.d N(0, σ2) and are independent of the i.i.d vectors

xt = (xt1, xt2, · · · , xtp).

Example 4.1. Assuming the following in (4.1):

• n = 500, p = 4000, q = 9, and q is the number of nonzero variables.

• (β1, · · · , βq)=(3.2, 3.2, 3.2, 3.2, 4.4, 4.4, 3.5,3.5, 3.5), and βj = 0, q < j ≤ p.

• σ2 = 2.25.

• x = (x1, x2, ...xp) ∼ N(µ,Σ), with µ=(1, 1, · · · , 1)1×p and

Σ =

1 + η2 η ... η

η 1 + η2 ... η

... ... ... ...

η η ... 1 + η2

,

where η = 1.

In view of Theorem 2.4 which requires the number of Kn iterations to satisfy Kn =

O((n/ log(n))1/2), we choose Kn = [5(n/ log(n))1/2] for OGA. Here and in the sequel,

we choose the constant C = 2.01 for HDAIC.


4.1.1 Sure Screening and Comparison with Other Methods

Under strong sparsity, theoretical results in section 2.3.3 show that OGA+HDIC has

“sure screening property” to select all the relevant variables under strong sparsity case.

The concept of sure screening was introduced by Fan & Lv (2008), who also proposed

a method called “sure independence screening” (SIS) that is based on correlation

learning and has the sure screening property in sparse high-dimensional regression

models satisfying certain conditions. SIS selects d regressors whose sample correlation

coefficients with yt have the largest d absolute values. Although SIS with suitably

chosen d = dn has been shown by Fan and Lv (2008, Section 5) to have the sure

screening property without the irrepresentable (or neighborhood stability) condition

mentioned in section 2.1.1 in connection with Lasso, it requires an assumption on the

maximum eigenvalue of the covariance matrix of the candidate regressors that can

fail to hold when all regressors are equally correlated, as is the case for this example.

Fan & Lv (2010) also proposed a modification with “iteratively” SIS+SCAD, called

ISIS, for variable selection in sparse linear and generalized linear models. We will

compare the “sure screening property” of OGA+HDIC with that of SIS, SIS+SCAD,

ISIS+SCAD, and Lasso for the present example.

For variable selection consistency of Lasso, it is required to satisfy neighborhood

stability condition, i.e., for some 0 < δ < 1 and all i = q + 1, · · · , p,

|c′qiR−1(q)sign(β(q))| < δ, (4.2)

where xt(q) = (xt1, · · · , xtq)!, cqi = E(xt(q)xti), R(q) = E(xt(q)x!t (q)) and sign(β(q)) =

(sign(β1), · · · , sign(βq))!. Straightforward calculations give cqi = η21q, R−1(q) = I−


{η2/(1 + η2q)}1q1!q , and sign(β(q)) = 1q, where 1q is the q-dimensional vector of 1’s.

Therefore, for all i = q +1, · · · , p, |c′qiR−1(q)sign(β(q))| = η2q/(1+ η2q) < 1, so (4.2)

indeed holds in this example. Under (4.2) and some other conditions, Meinshausen

and Buhlmann (2006, Theorems 1 and 2) have shown that if r = rn in the Lasso esti-

mate (3.21) converges to 0 at a rate slower than n−1/2, then limn→∞ P (Ln = Nn) = 1,

where Ln is the set of regressors whose associated regression coefficients estimated by

Lasso(rn) are nonzero. On the other hand, Fan & Lv (2008) imposed the condition

λmax(Γ({1, · · · , p})) ≤ cnr, for some c > 0 and 0 ≤ r < 1, (4.3)

for SIS to have the sure screening property. Here, we have

max1≤)(J)≤ν

λmax(Γ(J)) = (1 + νη2)/(1 + η2),

which violates (4.3) for J = {1, · · · , p} and p " n.

We run 1000 simulations to evaluate the performance of each method. Following

Wang (2009), we define the percentage of correct zeros to characterize the method’s

capability in producing sparse models:

% of correct zeros =1

1000

1000∑

b=1

1

q{

p∑

j=1

1βj,b=0 · 1βj=0} ,

, where β(b) = (β1,b, · · · , βp,b) is the model fitted coefficients in the bth replicate. We

also define the percentage of incorrect zeros to characterize the degree of under-fitting


of the model as:

% of incorrect zeros =1

1000

1000∑

b=1

1

p{

p∑

j=1

1βj,b=0 · 1βj ,=0}.

We define the “coverage probability” to gauge the sure screening performance of the

method, and the “correctly fitted probability” to show the chance that the fitted

model exactly contain all relevant variables:

Coverage probability =1

1000

1000∑

b=1

1T⊆S(b),

Correctly fitted =1

1000

1000∑

b=1

1T=S(b),

where T = (1, 2, · · · , q) and S(b) = {j(b) : βj,b *= 0}.

Method η cover. Prob. correct zeros incorrect zeros correctly fitted # of varsSIS 1 0.32 98.5% 12.67% 0.00 66.00SIS+SCAD 1 0.34 99.51% 12.00% 0.00 27.43ISIS+SCAD 1 1 98.57% 0.00% 0.00 65.81OGA+HDBIC 1 1 99.99% 0.00% 0.97 9.30Lasso 1 1 98.84% 0.00% 0.00 55.10

Table 4.1: Sure screening results for Example 4.1

Table 4.1 shows the sure screening results for SIS, SIS+SCAD, ISIS+SCAD, Lasso

and OGA+HDIC for example 4.1. As expected, SIS, which uses a single screening

step based on correlations, can’t identify all the relevant variables since the irrele-

vant variables are correlated with relevant variables in this case. On the other hand,

OGA+HDBIC, ISIS+SCAD, and Lasso can select all the relevant variables, and

OGA+HDBIC has the highest “correctly fitted” probability that is close to 1.


To evaluate the prediction performance, define

MSPE =1

1000

1000∑

l=1

(p∑

j=1

βjx(l)n+1,j − y(l)

n+1)2, (4.4)

in which x(l)n+1,1, · · · , x(l)

n+1,p are the regressors associated with y(l)n+1, the new outcome

in the lth simulation run, and y(l)n+1 denotes the predictor of y(l)

n+1. Table 4.2, which

gives the MSPE for OGA+HDIC, ISIS+SCAD, and Lasso with η = 1 and 3, shows

that OGA+HDIC performs much better than Lasso and ISIS+SCAD. Moreover, the

MSPE of OGA+HDIC is quite close to the oracle value qσ2/n, since OGA+HDIC

almost successfully selects out the q = 9 relevant variables.

Method η E E+1 E+2 E+3 Correct MSPEOGA+HDBIC 1 982 16 1 1 1000 0.052ISIS+SCAD 1 0 0 0 0 1000 0.491Lasso 1 0 0 0 0 1000 0.392OGA+HDBIC 3 621 313 57 9 1000 0.061ISIS+SCAD 3 10 25 50 15 1000 0.780Lasso 3 0 0 0 0 1000 0.530

Table 4.2: MSPE results for Example 4.1, and frequency, in 1000 simulations, ofincluding all 9 relevant variables (Correct), of selecting exactly the relevant variables(E), of selecting all relevant variables and i irrelevant variables (E+i).


4.1.2 MCCV Analysis

In this section, we use MCCV proposed in Chapter 3 to choose the best regu-

larization method among five regularization methods, namely OGA+HDAIC and

OGA+HDBIC (L0-regularization), Lasso (L1-regularization), ridge regression (L2-

regularization), and Elastic Net. As we mentioned in Chapter 3, we aim at estimating

the squared prediction error distribution F (M)nt with nt ∼ (1− ε)n rather than F (M)

n .

One of our underlying assumption is the F (M)nt should be quite close to that of F (M)

n

if nt is chosen appropriately. This assumption is reasonable for this example with

ε = 0.1, i.e, nt = 450, and Table 4.3 gives the 5-number summaries, together with

the mean, for the squared prediction error distributions of different methods with

n = 450 and 500.

n OGA + HDBIC OGA + HDAIC Lasso ENet Ridge500 Min 2.20e− 11 2.20e− 11 1.17e− 06 7.36e− 09 1.09e− 03

1st Qu. 0.00468 0.00565 0.0384 0.0506 10.59Median 0.02056 0.02553 0.1769 0.2289 46.943rd Qu. 0.06171 0.07738 0.5109 0.6603 136.65Max. 1.049 3.0335 6.8630 8.9433 1124.20Mean 0.05165 0.06962 0.3920 0.5039 101.55

450 Min 9.43e− 11 9.43e− 11 5.53e− 13 7.36e− 09 2.05e− 031st Qu. 0.00479 0.00572 0.0399 0.0518 10.79Median 0.02106 0.02582 0.1816 0.2346 48.423rd Qu. 0.06441 0.07993 0.5259 0.6792 140.60Max. 1.198 3.454 7.711 10.09 993.60Mean 0.05214 0.07113 0.4056 0.5204 102.80

Table 4.3: 5-number summary and mean of FMn with n = 500 and 450 in Example 4.1

Figure 4.1, left panel, shows the simulated squared prediction error distributions

for different procedures (except ridge regression which has much larger values and


inflates the vertical scale). These distributions, denoted by FM450 for each method M ,

are computed by simulating 1000 replicates of a training sample of size 450 (instead of

n = 500) together with an independent observation in the lth simulation run as test

sample, as in Table 4.3. The boxplots show that OGA+HDBIC works best among

the four regularization methods.

oga+hdbic oga+hdaic lasso ENet

0.0

0.5

1.0

1.5

2.0

2.5

Squared pred error

oga+hdbic oga+hdaic lasso ENet

0.0

0.5

1.0

1.5

2.0

2.5

MCCV estimates

Figure 4.1: MCCV performance for example 4.1, and the solid line in the center ofeach box shows its corresponding median.

In comparison, we use a particular data set from these 1000 simulated samples of

size n = 500 to estimate F450 by MCCV with nc = 450 and B = 100, and Figure 4.1,

right panel, gives the MCCV estimate of FM450 for each of the four methods. The box-

plot shows that MCCV estimate is close to FM450 for each method M , and shows that

OGA+HDBIC again works best among the four regularization methods. Table 4.4


gives the 5-number summaries of MCCV estimates for each of four regularization

methods. Comparison with table 4.3 shows that the MCCV estimates approximate

quite well the true squared prediction error distributions.

Method M OGA + HDBIC OGA + HDAIC Lasso ENet

Min 1.76E − 11 1.76E − 11 4.64E − 08 2.33E − 08

1st Qu. 0.00544 0.00594 0.0398 0.0512

Median 0.02436 0.02648 0.1650 0.2195

3rd Qu. 0.06499 0.07705 0.5145 0.6591

Max. 1.1584 3.316 7.198 9.823

Mean 0.05217 0.06978 0.3979 0.5174

Table 4.4: 5-number summary together with mean for MCCV estimates of FM450 in

Example 4.1


4.2 Weakly Sparse Scenario

Under weak sparsity, supn≥1

∑pn

j=1 |βj| < ∞ and all βj may be nonzero. This includes

the algebraic decay and exponential decay cases mentioned at the end of Section 2.3.3.

Consider the regression models,

yt =p∑

j=1

βjxtj + εt, t = 1, · · · , n, (4.5)

where p > n, εt are i.i.d. N(0, σ2) and are independent of (xt1, xt2, · · · , xtp), which are

i.i.d N(0,Σ), with

Σ =

1 + η2 η ... η

η 1 + η2 ... η

... ... ... ...

η η ... 1 + η2

.

Example 4.2. (Exponential decay). Consider (4.5) with the following parameter spec-

ifications:

• n = 500, p = 1000, 2000

• βj = κ exp(−a · j), j = 1, 2, 3 · · · p, where a = 0.25, κ = 5, 10, 15

• σ = 1, η = 0, 1

In view of Theorem 2.4 which requires the number of Kn iterations to satisfy Kn =

O((n/ log(n))1/2), we choose Kn = [5(n/ log(n))1/2] in OGA, and choose the constant

C = 2.01 for HDAIC.


OGA+HDAIC Performance

Table 4.5 gives the mean squared prediction error(MSPE) of OGA+HDAIC and other

regularization methods, based on 1000 simulations. The simulated MSPE is defined

as in (4.4).

As expected from the theory, OGA+HDAIC is significantly better than other

methods. As κ increases, the MSPE of OGA+HDAIC remains nearly constant, while

the MSPEs of other methods increase substantially, especially for Elastic Net and

ridge regression. Note that ridge regression introduces large bias for large coefficients.

η n p κ OGA + HDAIC OGA + HDBIC Lasso ENet Ridge0 500 1000 5 0.0349 0.0756 0.0839 0.1198 2.7851

10 0.0432 0.0509 0.0967 0.1422 10.227015 0.0368 0.0888 0.1032 0.1511 22.4633

2000 5 0.0420 0.0821 0.1013 0.1477 3.328510 0.0508 0.0496 0.1164 0.1759 12.658315 0.0433 0.0923 0.1236 0.1886 28.1109

1 500 1000 5 0.04026 0.07518 0.07351 0.10684 2.7747610 0.04951 0.07093 0.08352 0.12219 10.1597515 0.04326 0.07548 0.08892 0.12987 22.31791

2000 5 0.05058 0.09533 0.10098 0.14223 3.2229110 0.06034 0.07193 0.10859 0.15902 12.7339015 0.05098 0.09788 0.11503 0.16934 28.09917

Table 4.5: MSPEs of different methods in Example 4.2

MCCV Analysis

Here, we are focus on the case n = 500, p = 2000, η = 1 and κ = 10, and use MCCV

to choose the best among the five regularization methods mentioned in Section 4.1

and PGA, which we also include since Buhlmann (2006) has shown that it has good

performance in weakly sparse models.


1. MCCV estimates of squared prediction error distributions:

As we mentioned in Chapter 3, we aim at estimating the squared prediction

error distribution F (M)nt with nt ∼ (1− ε)n rather than F (M)

n . Table 4.6 gives the

5-number summaries, together with the mean, for the squared prediction error

distributions of different methods with n = 450 and 500, which shows that F (M)450

is quite close to F (M)500 for the scenario we consider.

n OGA + HDAIC OGA + HDBIC Lasso ENet PGA Ridge500 Min 1.38E − 10 7.19E − 10 6.31E − 08 2.43E − 08 1.75E − 09 1.15E − 06

1st Qu. 0.00571 0.00673 0.01093 0.01666 0.01354 1.26794Median 0.02620 0.03120 0.04853 0.07117 0.06177 5.708443rd Qu. 0.07740 0.09124 0.14302 0.20821 0.18162 16.86109Max. 1.26823 1.59471 1.76220 2.82896 3.13588 198.43241Mean 0.06034 0.07193 0.10859 0.15902 0.13939 12.73390

450 Min 2.58E − 10 9.16E − 10 8.45E − 08 4.41E − 08 6.77E − 09 1.02E − 051st Qu. 0.00661 0.00693 0.01181 0.01546 0.01452 1.36701Median 0.02712 0.03208 0.05023 0.07403 0.06332 6.108223rd Qu. 0.08134 0.09322 0.15267 0.22149 0.19501 17.06214Max. 1.4632 1.7914 2.0322 2.9689 3.2982 200.31251Mean 0.06126 0.07393 0.11029 0.16515 0.14291 12.9392

Table 4.6: 5-number summary and mean of FMn with n = 500 and 450 in Example 4.2

Figure 4.2, left panel, shows the simulated squared prediction error distributions

for different regularization methods. These distributions, denoted by FM450 for

each method M , are computed from 1000 replicates of a training sample of size

450 (instead of n = 500) together with an independent observation in the lth

simulation run as test sample. The boxplots show that OGA+HDAIC works

best among the five regularization procedures shown. In comparison, we use

one data set from these 1000 simulated samples of size n = 500 to estimate

F (M)450 by MCCV with nt = 450 and B = 100, and Figure 4.2, right panel, gives

the MCCV estimate of FM450 for each of the five methods. The boxplot shows

that OGA+HDAIC again works best among the five regularization methods.


Table 4.7 gives 5-number summaries of the MCCV estimates for each of the

different regularization methods. Comparison with Table 4.6 shows that the

MCCV estimate is close to FM450 for each method M .

OGA+HDBIC OGA+HDAIC Lasso ENet PGA

0.0

0.1

0.2

0.3

0.4

0.5

Squared pred error distributions


0.0

0.1

0.2

0.3

0.4

0.5

MCCV estimates

Figure 4.2: Boxplots of MCCV performance for Example 4.2

Method OGA + HDAIC OGA + HDBIC Lasso ENet PGA RidgeMin 1.22E − 08 6.24E − 09 9.31E − 08 4.58E − 08 4.74E − 09 1.23E − 051st Qu. 0.00672 0.00721 0.01307 0.01777 0.01794 1.16769Median 0.02967 0.03307 0.05810 0.08330 0.07344 6.414843rd Qu. 0.08506 0.09978 0.16230 0.24772 0.20562 17.38538Max. 1.72319 1.87355 2.09284 2.87262 2.97496 196.17823Mean 0.06303 0.07181 0.11512 0.16339 0.14260 12.98136

Table 4.7: 5-number summary for MCCV estimates of FM450 in Example 4.2


2. Prediction error differences between two procedures:

True distribution MCCV estimate

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Figure 4.3: Distribution of SPEOGA+HDAIC − SPELasso for Example 4.2

We can use the method in Section 3.5 to compare OGA+HDAIC with Lasso in

this example. In Figure 4.3, the left boxplot is the true distribution of the

squared prediction error difference (SPEOGA+HDAIC − SPELasso) evaluated

from 1000 simulations, and the right boxplot is the MCCV estimate of the

distribution based on one simulated data set. Table 4.8 gives the 5-number sum-

maries and the mean of the true distribution and its MCCV estimate, shown

that MCCV estimates the true distribution well.

3. Comparison of MCCV estimates based on two different data sets

generated from the regression model:

A natural question concerning the MCCV estimate is whether different data


True distribution MCCV estimationMin −2.5967 −2.13821st Qu. −0.1041 −0.1081Median −0.0923 −0.09583rd Qu. 0.0072 0.0058Max. 1.2069 0.7398Mean −0.0723 −0.0795

Table 4.8: 5-number summary for the distribution of squared prediction error differencesbetween OGA+HDAIC and Lasso, and for its MCCV estimate

sets generated from the same regression model yield different conclusions on

the comparisons of squared prediction error distributions based on their MCCV

estimates from these different data sets. Figure 4.4 considers two such data sets

and shows the stability of the comparisons of five regularization methods based

on the two data sets.


0.0

0.2

0.4

Squared prediction error distribution estimated by MCCV, dataset 1


0.0

0.2

0.4

Squared prediction error distribution estimated by MCCV, dataset 2

Figure 4.4: MCCV estimates based on two simulated data sets in Example 4.2


Example 4.3. (Algebraic decay). Consider (4.5) with the following parameter speci-

fications:

• n = 500, p = 1000, 2000

• βj = κ/(ja), j = 1, 2, 3 · · · p, where a = 1.5, κ = 5, 10, 15

• σ = 0.1, η = 0, 1

Choose Kn = [5(n/ log(n))1/2] for OGA, and C = 2.01 for HDAIC as in Example 4.2.

OGA+HDAIC Performance

Table 4.9 gives the mean squared prediction error(MSPE) of OGA+HDAIC and other

regularization methods evaluated from 1000 simulations. The simulated MSPE is

defined as in (4.4).

As expected from the theory, OGA+HDAIC is significantly better than other

methods. In particular, the MSPE of Lasso, which outperforms Elastic Net and ridge

regression in this case, is about twice that of OGA+HDAIC for κ = 5, 10, or 15.


η n p κ OGA + HDAIC OGA + HDBIC Lasso ENet Ridge

0 500 1000 5 0.0522 0.1221 0.0674 0.0754 18.6972

10 0.0987 0.2800 0.1257 0.1351 74.7936

15 0.1425 0.4869 0.2729 0.2872 168.3534

2000 5 0.0554 0.1333 0.0783 0.0909 24.0502

10 0.0956 0.3107 0.1424 0.1619 96.0622

15 0.1513 0.5467 0.2768 0.3085 216.0163

450 2000 10 0.0972 0.3211 0.1503 0.1701 97.1015

1 500 1000 5 0.1713 0.4755 0.2722 0.3307 18.5715

10 0.4879 1.5057 1.2459 1.5048 74.3403

15 1.0294 2.9938 2.9608 3.4607 167.3379

2000 5 0.2069 0.6647 0.3529 0.4737 24.1243

10 0.6027 2.2435 1.5675 1.9517 96.3834

15 1.3987 4.6301 3.6692 4.3859 216.8061

2 500 1000 5 0.1720 0.4844 0.7696 1.0805 18.7922

10 0.4944 1.4459 3.0878 4.9387 75.1914

15 1.0676 3.2402 6.9301 11.2026 169.2467

2000 5 0.2157 0.6985 0.8946 1.3856 24.3712

10 0.6489 2.2645 3.5522 6.0250 97.3697

15 1.4135 4.5187 7.9515 13.5588 219.0319

Table 4.9: MSPE of different methods in Example 4.3


MCCV Analysis

Here, we are focus the case n = 500, p = 2000, η = 0 and κ = 10, and use MCCV to

choose the best regularization method. First the MSPE results in Table 4.9 suggest

that F (M)450 approximate F (M)

500 well for this case, so that the choice of nt = 450 for

MCCV is reasonable. Next we consider one particular simulated data set and use it

to estimate F (M)450 by MCCV with nt = 450 and B = 100. Figure 4.5, right panel, gives

the MCCV estimates for five regularization methods. In comparison, Figure 4.5, left

panel, shows the true squared prediction error distributions F (M)450 for different reg-

ularization methods computed from 1000 replicates of a training sample of size 450

(instead of n = 500) together with an independent observation in the lth simulation

run as test sample. The boxplots show that OGA+HDAIC performs best among the

five regularization methods.

Figure 4.5: Boxplot of MCCV performance for Example 4.3


4.3 Scenario without Weak Sparsity

Example 4.4. Consider the same regression model (4.5) as in section 4.2, in which

βj =

4 for 1 ≤ j ≤ 25

4− (4− β200)× j−25175 for 25 < j < 200

1j0.6 for 200 ≤ j ≤ p

• n = 400, p = 2000,

• (xt1, xt2, ...xtp) ∼iid N(0, Ip×p),

• εt ∼iid N(0, 1), and independent of the (xt1, xt2, ...xtp).

This example does not satisfy the weak sparsity condition, since

p∑

j=200

|βj| =p∑

j=200

1

j0.6→∞, as p →∞.

On the other hand, βj is square-summerable, i.e.,

supp

p∑

j=1

β2j < ∞

4.3.1 MSPE performance

Table 4.10 gives 5-number summaries for the squared prediction error distributions of

OGA+HDAIC, OGA+HDBIC, Lasso, Elastic Net, ridge regression and Max1,2 based

on 1000 simulations. The mean in the table is corresponding to the simulated MSPE

that is defined in (4.4).

As expected from the theory, OGA+HDAIC performs substantially worse than


other regularization methods, and this confirms that the weak sparsity condition is

critical for OGA+HDIC to perform well. On the other hand, Max1,2 regularization

performs best.

n Stats OGA + HDBIC OGA + HDAIC Lasso Ridge ENet Max1,2

400 Min 0.0006954 0.0006954 0.03269 0.2234 0.08898 0.015911st Qu. 148.7 167.0 113.1 120.1 99.6 93.6Median 545.6 615.7 572.6 421.3 515 415.63rd Qu. 2065 1930 1734 1349 1645 1261Max. 10590 13390 9816 8162 8999 7125Mean 1391 1507 1260 1007 1198 871

360 Min 7.60− e04 7.60− e04 0.04511 0.3124 0.09783 0.020611st Qu. 156.2 177.5 119.7 125.5 106.3 99.8Median 561.1 634.8 588.3 434.2 524.1 424.33rd Qu. 2098 1973 1745 1383 1671 1288Max. 10796 13442 9881 8317 9045 7229Mean 1422 1541 1287 1029 1213 894

Table 4.10: 5-number summary and mean for F (M)n in Example 4.4

4.3.2 MCCV Analysis

Table 4.10 shows that the F (M)360 is quite close to F (M)

400 , so nt = 360 is a reasonable

choice of MCCV for this example. Figure 4.6, left panel, shows the simulated squared

prediction error distributions for the six regularization methods. These distributions,

denoted by FM360 for each method M , are computed from 1000 replicates of a training

sample of size 360 (instead of n = 400) together with an independent observation

in the lth simulation run as test sample, as in Table 4.10. The boxplots show that

Max1,2 works best among the six different regularization methods shown.

In comparison, we use one simulated data set from these 1000 simulated samples

of size n = 400 to estimate F (M)360 with nt = 360 and B = 100. Figure 4.6, right


Figure 4.6: Boxplot of MCCV performance for Example 4.4

panel, gives the MCCV estimate of FM360 for each of the six methods. It shows that

MCCV estimate is close to FM360 for each method M , and that Max1,2 again works

best among these regularization methods. Table 4.11 gives the 5-number summaries

of MCCV estimates for different regularization methods. Comparison with Table 4.10

shows that the MCCV estimates approximate quite well the true squared prediction

error distributions.

Stats OGA + HDBIC OGA + HDAIC Lasso Ridge ENet Max1,2

Min 9.30− e03 9.30− e03 0.03516 0.4102 0.10233 0.024211st Qu. 159.5 178.3 121.4 121.3 108.9 97.5Median 553.3 647.1 593.1 430.1 530.2 421.63rd Qu. 2079 1988 1773 1368 1682 1274Max. 10713 13397 9904 8304 9073 7215Mean 1411 1562 1293 1019 1225 882

Table 4.11: 5-number summary and mean for MCCV estimates of F (M)360 in Example 4.4

Chapter 5

Conclusion

In this thesis, we have developed a novel Monte Carlo cross-validation method to

choose an appropriate regularization method for a given regression data set that has

more regressors than the sample cases, i.e., p " n. This method aims at estimating

the squared prediction error distribution of the regularized regression method for a

smaller sample size nt than n. This differs from conventional cross-validation methods

for choosing the tuning parameters of a particular regularization method. There are

also differences in the corresponding asymptotic theorems, as shown in Section 3.1

and Section 3.4.

The simulation studies in Chapter 4 have shown that, with nt suitably chosen,

the Monte Carlo cross-validation estimate provides a good approximation to the ac-

tual squared prediction error distribution, and is therefore a reliable tool to determine

which regularization method should be used for the high-dimensional regression prob-

lem at hand.

Another contribution of this thesis is a new regularization method Max1,2 intro-

duced in Section 2.2. We have derived it as a natural alternative to the Elastic Net

70

CHAPTER 5. CONCLUSION 71

that combines L1-regularization and L2-regularization, and have provided an efficient

exact solution and an approximate pathwise solution. The example in Section 4.3

has also shown that it outperforms the Elastic Net and other regularization methods.

Although we have some heuristic explanation of why Max1,2 works in that scenario,

a comprehensive theory of Max1,2 is lacking and will be a project of future research.

Bibliography

Akaike, H. (1973), Information theory and an extension of the maximum likelihood

principle, in ‘Proc. of the 2nd Int. Symp. on Information Theory’, Akademiai

Kiado, pp. 267–281.

Allen, D. M. (1974), ‘The relationship between variable selection and data augmen-

tation and a method for prediction’, Technometrics 16(1), 125–127.

Antoniadis, A. & Fan, J. (2001), ‘Regularization of wavelet approximations’, Journal

of the American Statistical Association 96(455), 939–967.

Barron, A., Cohen, A., Dahmen, W. & Devore, R. (2008), ‘Approximation and learn-

ing by greedy algorithms’, Ann. Statist pp. 64–94.

Bickel, P. J., Ritov, Y. & Tsybakov, A. (2009), ‘Simultaneous analysis of lasso and

dantzig selector’, Annals of Statistics 37, 1705–1732.

Boyd, S., Parikh, N., Chu, E. & Peleato, B. (2010), ‘Distributed optimization and sta-

tistical learning via the alternating direction method of multipliers’, Information

Systems Journal pp. 1–118.

Boyd, S. & Vandenberghe, L. (2004), Convex Optimization, Cambridge University

Press.

72

BIBLIOGRAPHY 73

Breiman, L. (1998), ‘Arcing classifiers’, Annals of Statistics 26(3), 801–849.

Breiman, L. (1999), ‘Prediction games and arcing algorithms’, Neural Comput.

11, 1493–1517.

Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classification and

Regression Trees.

Buhlmann, P. (2006), ‘Boosting for high-dimensional linear models’, Ann. Statist

34(2), 559–583.

Buhlmann, P. & Hothorn, T. (2010), ‘Twin boosting: improved feature selection and

prediction’, Stat. Comput. 20, 119–138.

Buhlmann, P. & Yu, B. (2003), ‘Boosting with the l2 loss: Regression and classifica-

tion’, Journal of the American Statistical Association 98, 324–339.

Burman, P. (1989), ‘A comparative study of ordinary cross-validation, v -fold cross-

validation and the repeated learning-testing methods’, Biometrika 76, 503–514.

Candes, E. & Tao, T. (2007), ‘The dantzig selector : Statistical estimation when p is

much larger than n’, Annals of Statistics 35, 2313–2351.

Chen, J. & Chen, Z. (2008), ‘Extended bayesian information criteria for model selec-

tion with large model spaces’, Biometrika 95(3), 759–771.

Donoho, D. L. & Elad, M. (2003), ‘Optimally sparse representation in general

(nonorthogonal) dictionaries via minimization’, Proc of the Nat Acad of Sciences

of the USA 100(5), 2197–2202.

BIBLIOGRAPHY 74

Donoho, D. L., Elad, M. & Temlyakov, V. N. (2006), ‘Stable recovery of sparse over-

complete representations in the presence of noise’, IEEE Transactions on Infor-

mation Theory 52(1), 6–18.

Donoho, D. L. & Johnstone, I. M. (1994), ‘Ideal spatial adaptation by wavelet shrink-

age’, Biometrika 81(3), 425–455.

Efron, B. (1986), ‘How biased is the apparent error rate of a prediction rule?’, Journal


Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression

(with discussion)’, Annals of Statistics 32, 407–451.

Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its

oracle properties’, Journal of the American Statistical Association 96(456), 1348–

1360.

Fan, J. & Lv, J. (2008), ‘Sure independence screening for ultrahigh dimensional feature

space’, Journal Of The Royal Statistical Society Series B 70(5), 849–911.

Fan, J. & Lv, J. (2010), ‘A selective overview of variable selection in high dimensional

feature space’, Statistica Sinica 20(1), 1–44.

Frank, I. E. & Friedman, J. H. (1993), ‘A statistical view of some chemometrics

regression tools’, Technometrics 35(2), 109–135.

Freund, Y. (1995), ‘Boosting a weak learning algorithm by majority’, Inf. Comput.

121, 256–285.

BIBLIOGRAPHY 75

Freund, Y. & Schapire, R. (1996a), Experiments with a new boosting algorithm, in

‘ICML’, pp. 148–156.

Freund, Y. & Schapire, R. E. (1996b), ‘Game theory, on-line prediction and boosting’,

Proceedings of the ninth annual conference on Computational learning theory

COLT 96 pages, 325–332.

Friedman, J. H. (1999), ‘Stochastic gradient boosting’, Computational Statistics and

Data Analysis 38, 367–378.

Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting ma-

chine’, Annals of Statistics 29, 1189–1232.

Friedman, J. H. (2008), ‘Fast sparse regression and classification’, Technical report,

Department of Statistics, Stanford University .

Friedman, J., Hastie, T., Hofling, H. & Tibshirani, R. (2007), ‘Pathwise coordinate

optimization’, Annals of Applied Statistics 1(2), 302–332.

Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized

linear models via coordinate descent’, Journal of Statistical Software 33(1), 1–22.

Geisser, S. (1975), ‘The predictive sample reuse method with applications’, Journal


Hannan, E. J. & Quinn, B. G. (1979), ‘The determination of the order of an au-

toregression’, Journal of the Royal Statistical Society, Series B (Methodological)

41(2), 190–195.

BIBLIOGRAPHY 76

Hastie, T., Tibshirani, R. & Friedman, J. H. (2001), The elements of statistical learn-

ing: data mining, inference, and prediction: with 200 full-color illustrations, New

York: Springer-Verlag.

Hoerl, A. E. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for

nonorthogonal problems’, Technometrics 12(1), 55–67.

Huang, J., Horowitz, J. L. & Ma, S. (2008), ‘Asymptotic properties of bridge es-

timators in sparse high-dimensional regression models’, Annals of Statistics

36(2), 587–613.

Huang, J., Ma, S. & Zhang, C. H. (2008), ‘Adaptive lasso for sparse high-dimensional

regression models’, Statistica Sinica 18(374), 1603–1618.

Ing, C. K. (2007), ‘Accumulated prediction errors, information criteria and optimal

forecasting for autoregressive time series’, Annals of Statistics 35, 1238–1277.

Ing, C.-K. & Lai, T. L. (2011), ‘A stepwise regression method and consistent model se-

lection for high-dimensional sparse linear models’, Statistica Sinica 21(4), 1473–

1513.

Ing, C. K. & Wei, C. Z. (2005), ‘Order selection for the same-realization prediction

in autoregressive processes’, Annals of Statistics 33, 2423–2474.

Knight, K. & Fu, W. (2000), ‘Asymptotics for lasso-type estimators’, Annals of Statis-

tics 28(5), 1356–1378.

Leng, C., Lin, Y. & Wahba, G. (2006), ‘Selection a note on the lasso and related

procedures in model selection’, Statistica Sinica 16(4), 1273.

BIBLIOGRAPHY 77

Liu, Y. & Wu, Y. (2007), ‘Variable selection via a combination of the l0 and l1 penal-

ties’, Journal Of Computational And Graphical Statistics 16(4), 782–798.

Meinshausen, N. (2007), ‘Relaxed lasso’, Computational Statistics & Data Analysis

52(1), 374–393.

Meinshausen, N., Buhlmann, P. & Zurich, E. (2006), ‘High dimensional graphs and

variable selection with the lasso’, Annals of Statistics 34, 1436–1462.

Meinshausen, N. & Bhlmann, P. (2010), ‘Stability selection (with discussion)’, Journal

of the Royal Statistical Society Series B pp. 417–473.

Mosteller, F. & Tukey, J. W. (1968), Data analysis, including statistics, in G. Lindzey

& E. Aronson, eds, ‘Handbook of Social Psychology, Vol. 2’, Addison-Wesley.

Owen, A. B. (2006), ‘A robust hybrid of ridge and lasso penalized regression’, Dis-

covery pp. 1–24.

Politis, D. N., Romano, J. P. & Wolf, M. (1999), Subsampling (Springer Series in

Statistics), Springer.

Schapire, R. (1990), ‘The strength of weak learnability’, Mach. Learn. 5, 197–227.

Schapire, R. E. & Singer, Y. (1999), Improved boosting algorithms using confidence-

rated predictions, in ‘Machine Learning’, pp. 80–91.

Schwarz, G. (1978), ‘Estimating the dimension of a model’, Ann. Stat. 14, 461–64.

Shao, J. (1993), ‘Linear model selection by cross-validation’, Journal of the American

Statistical Association 88(422), 486–494.

BIBLIOGRAPHY 78

Stone, M. (1974), ‘Cross-validatory choice and assessment of statistical predictions’,

Roy. Stat. Soc. 36, 111–147.

Temlyakov, V. N. (2000), ‘Weak greedy algorithms’, Adv. Comput. Math. pp. 213–227.

Tibshirani, R. J. (1996), ‘Regression shrinkage and selection via the lasso’, Journal

of the Royal Statistical Society, Series B 58(1), 267–288.

Tropp, J. A. (2004), ‘Greed is good: Algorithmic results for sparse approximation’,

IEEE Trans. Inform. Theory 50, 2231–2242.

Tropp, J. A., Anna & Gilbert, C. (2007), ‘Signal recovery from random measurements

via orthogonal matching pursuit’, IEEE Trans. Inform. Theory 53, 4655–4666.

Valiant, L. G. (1984), ‘A theory of the learnable’, Commun. ACM 27, 1134–1142.

Wang, H. (2009), ‘Forward regression for ultra-high dimensional variable screening’,

Journal of the American Statistical Association 104(488), 1512–1524.

Wasserman, L. & Roeder, K. (2009), ‘High-dimension variable selection’, Ann. Stat.

37, 2178–2201.

Yang, Y. (2008), ‘Consistency of cross validation for comparing regression procedures’,

Annals of Statistics 35(6), 2450–2473.

Zhang, C.-H. & Huang, J. (2008), ‘The sparsity and bias of the lasso selection in

high-dimensional linear regression’, Annals of Statistics 36(4), 1567–1594.

Zhang, P. (1993), ‘Model selection via multifold cross validation’, Annals of Statistics

21(1), 299–313.

BIBLIOGRAPHY 79

Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-

chine Learning Research 7(11), 2541–2563.

Zhou, S., Van De Geer, S. & Buhlmann, P. (2009), ‘Adaptive lasso for high dimen-

sional regression and gaussian graphical modeling’, preprint p. 30.

Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American

Statistical Association 101(476), 1418–1429.

Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elastic

net’, Journal of the Royal Statistical Society - Series B: Statistical Methodology

67(2), 301–320.

Zou, H. & Li, R. (2008), ‘One-step sparse estimates in nonconcave penalized likelihood

models’, Annals of Statistics 36(4), 1509–1533.

CROSS-VALIDATION AND REGRESSION ANALYSIS IN ...yw320tk7289/...CROSS-VALIDATION AND REGRESSION...

Documents

Transcript of CROSS-VALIDATION AND REGRESSION ANALYSIS IN ...yw320tk7289/...CROSS-VALIDATION AND REGRESSION...