Regression Analysis Multiple Regression [ Cross-Sectional Data ]
CROSS-VALIDATION AND REGRESSION ANALYSIS IN ...yw320tk7289/...CROSS-VALIDATION AND REGRESSION...
Transcript of CROSS-VALIDATION AND REGRESSION ANALYSIS IN ...yw320tk7289/...CROSS-VALIDATION AND REGRESSION...
CROSS-VALIDATION AND REGRESSION ANALYSIS IN
HIGH-DIMENSIONAL SPARSE LINEAR MODELS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Feng Zhang
Aug 2011
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/yw320tk7289
© 2011 by Feng Zhang. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Tze Lai, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Balakanapathy Rajaratnam
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Nancy Zhang
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Modern scientific research often involves experiments with at most hundreds of sub-
jects but with tens of thousands of variables for every subject. The challenge of
high dimensionality has reshaped statistical thinking and modeling. Variable selec-
tion plays a pivotal role in the high-dimensional data analysis, and the combination
of sparsity and accuracy is crucial for statistical theory and practical applications.
Regularization methods are attractive for tackling these sparsity and accuracy issues.
The first part of this thesis studies two regularization methods. First, we consider
the orthogonal greedy algorithm (OGA) used in conjunction with a high-dimensional
information criterion introduced by Ing & Lai (2011). Although it has been shown to
have excellent performance for weakly sparse regression models, one does not know a
priori in practice that the actual model is weakly sparse, and we address this problem
by developing a new cross-validation approach. OGA can be viewed as L0 regular-
ization for weakly sparse regression models. When such sparsity fails, as revealed by
the cross-validation analysis, we propose to use a new way to combine L1 and L2
penalties, which we show to have important advantages over previous regularization
methods.
The second part of the thesis develops a Monte Carlo Cross-Validation (MCCV)
method to estimate the distribution of out-of-sample prediction errors when a training
iv
sample is used to build a regression model for prediction. Asymptotic theory and
simulation studies show that the proposed MCCV method mimics the actual (but
unknown) prediction error distribution even when the number of regressors exceeds
the sample size. Therefore MCCV provides a useful tool for comparing the predictive
performance of different regularization methods for real (rather than simulated) data
sets.
v
Acknowledgements
First, I would like to express my deepest gratitude to my advisor, Professor Tze
Leung Lai. He not only suggested the research topic, but also helped me in structuring,
modifying and enriching the contents, which have added enormous value to the quality
of this work. It has been my honor to have had the opportunity to work with him for
four years at Stanford, and I have really benefitted a lot from his deep understanding
in the field, his integral view on research, his patience to listen to my ideas and answer
my questions, and his encouragement. I would not be able to complete this dissertation
without his immense support and invaluable advice throughout my doctoral study.
I would like to thank Professor Jerry Friedman for giving me valuable suggestions
both on my research and study. I am also deeply grateful to him and to Professor
Guenther Walther and Professor Chiara Sabatti for serving on my oral examination
committee, and to Professor Nancy Zhang and Professor Bala Rajaratnam for serving
on my reading committee besides the oral committe, for both their thoughtful sug-
gestion and constructive feedback on my research. I am much indebted to Tong Xia
who helped me to proofread my thesis. I also want to thank Professor Ching-Kang
Ing of Academia Sinica for sharing with me his ongoing research and deep insights in
the subject of model selection.
vi
I am grateful to many friends at Stanford, for their invaluable friendship and enor-
mous support. Among them are Li Ma, Waiwai Liu, Ya Xu, Shaojie Deng, Camilo
Rivera, Justin Dyer, Murat Ahmed, Genevera Allen, Patrick Perry, Victor Hu, Kshi-
tij Khare, Zehao Chen, Yiyuan She, Ling Chen, Kevin Sun, Zongming Ma, Yifang
Chen, Paul Pong, Anwei Chai, Lei Zhao, Xiaoye Jiang, Xin Zhou, Yanxin Shi and
Hongsong Yuan. This is, however, far from a complete list of names. Their stimulating
conversations, constant encouragement and great company have made my last four
years full of joy and happiness, and our friendship will be one of the best treasures
in my life. I would also thank to the staff in the Department of Statistics who have
made Sequoia Hall to be the best place to work. Thank you all.
Last but not least, I would like to thank my wife, Yan Zhai, and my parents. Their
enduring love and support are much more than I could ever be able to acknowledge.
I would like to dedicate this piece of work to them.
vii
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Regularization in High-dimensional Regression 4
2.1 Regularization via Penalized Least Squares . . . . . . . . . . . . . . . 5
2.1.1 L1 regularization: Lasso . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 L2 Regularization: Ridge Regression . . . . . . . . . . . . . . . 8
2.1.3 Lq Regularization: Bridge Regression . . . . . . . . . . . . . . 11
2.1.4 Refinements: Adaptive Lasso & SCAD . . . . . . . . . . . . . 11
2.2 Elastic Net and a New Approach . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Implementation of Max 1 ,2 Regularization . . . . . . . . . . . . 21
2.3 L0-Regularization: Orthogonal Greedy Algorithm . . . . . . . . . . . 27
2.3.1 OGA and Gradient Boosting . . . . . . . . . . . . . . . . . . . 27
2.3.2 OGA+HDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
2.3.3 Variable Selection Consistency under Strong Sparisty . . . . . 33
3 Monte Carlo Cross-Validation 38
3.1 Overview of Cross-validation . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 MCCV Estimate of F (M )nt . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Choice of the Training Sample Size nt . . . . . . . . . . . . . . . . . . 43
3.4 Asymptotic Theory of MCCV . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Comparing the Prediction Performance of Two Methods . . . . . . . 48
4 Simulation Studies 49
4.1 Strongly Sparse Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Sure Screening and Comparison with Other Methods . . . . . 51
4.1.2 MCCV Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Weakly Sparse Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Scenario without Weak Sparsity . . . . . . . . . . . . . . . . . . . . . 67
4.3.1 MSPE performance . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 MCCV Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Conclusion 70
Bibliography 72
ix
List of Tables
4.1 Sure screening results for Example 4.1 . . . . . . . . . . . . . . . . . 53
4.2 MSPE results for Example 4.1, and frequency, in 1000 simulations,
of including all 9 relevant variables (Correct), of selecting exactly the
relevant variables (E), of selecting all relevant variables and i irrelevant
variables (E+i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 5-number summary and mean of FMn with n = 500 and 450 in Exam-
ple 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 5-number summary together with mean for MCCV estimates of FM450
in Example 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 MSPEs of different methods in Example 4.2 . . . . . . . . . . . . . . 59
4.6 5-number summary and mean of FMn with n = 500 and 450 in Exam-
ple 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 5-number summary for MCCV estimates of FM450 in Example 4.2 . . . 61
4.8 5-number summary for the distribution of squared prediction error differ-
ences between OGA+HDAIC and Lasso, and for its MCCV estimate . . . 63
4.9 MSPE of different methods in Example 4.3 . . . . . . . . . . . . . . . 65
4.10 5-number summary and mean for F (M)n in Example 4.4 . . . . . . . . 68
x
4.11 5-number summary and mean for MCCV estimates of F (M)360 in Exam-
ple 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xi
List of Figures
2.1 Contour for different penalties, left to right, L1, L2, and Max1,2 with
ρ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Shrinkage effects of different penalties . . . . . . . . . . . . . . . . . . 20
4.1 MCCV performance for example 4.1, and the solid line in the center
of each box shows its corresponding median. . . . . . . . . . . . . . . 56
4.2 Boxplots of MCCV performance for Example 4.2 . . . . . . . . . . . 61
4.3 Distribution of SPEOGA+HDAIC − SPELasso for Example 4.2 . . . . 62
4.4 MCCV estimates based on two simulated data sets in Example 4.2 . 63
4.5 Boxplot of MCCV performance for Example 4.3 . . . . . . . . . . . 66
4.6 Boxplot of MCCV performance for Example 4.4 . . . . . . . . . . . 69
xii
Chapter 1
Introduction
1.1 Introduction
High-dimensional statistical learning problems arise from diverse fields of modern
scientific research and various engineering areas. With recent advances in processing
power, storage capacity, and cloud computing technology, massive amount of data can
be collected with relatively low cost, and plenty examples for such high-dimensional
data set can be found at the frontiers of scientific research, such as microarray in
computational biology, longitudinal data in economics, and high-frequency data in
financial markets. Meanwhile, the methodology for high-dimensional data analysis
has become increasingly important and also become one of the most active areas in
statistical research.
In traditional statistical theory, it is assumed that the number of observations
n is much larger than the number of variables or parameters, so that large-sample
asymptotic theory can be used to derive procedures and analyze their statistical
accuracy and interpretability. For high-dimensional data, this assumption is violated
1
CHAPTER 1. INTRODUCTION 2
as the number of variables exceeds the number of observations. Analysis of these data
has reshaped statistical thinking.
Variable selection is an important idea in statistical learning with high-dimensional
data. In many applications, the response variable of interest is related to only a rela-
tively small number of predictors from a large pool of possible candidates. For exam-
ple, in gene expression studies using microarrays, the expressions of tens of thousands
of molecules are potential predictors, but only few of them are the important variables
that are truly related to the disease. How to identify the ”sparse” relevant variables
in high-dimensional cases has become a fundamental challenge.
Regularization methods are attractive in high-dimensional modeling and predic-
tive learning, and there is a rapidly growing literature devoted to solving sparse
variable selection by regularization methods under different assumptions. It will be
shown in chapter 2 and chapter 4 that different regularization methods have advan-
tage in different situations. Thus it is a crucial challenge to characterize whether the
underlying assumptions hold for a particular regularization method for the problem
at hand, for which one typically does not know the actual data generating mechanism.
1.2 Outline
This thesis is to address these challenges mentioned above in the context of regression
on high-dimensional input vectors, and is organized as following.
In first part of Chapter 2, we give an overview of different regularization meth-
ods, including Lasso (L1), Ridge Regression (L2), Bridge regression (Lq), and their
refinements like Adaptive Lasso and SCAD. In the second part of Chapter 2, inspired
by Elastic Net, we propose a new approach to take advantage from both L1 and L2
CHAPTER 1. INTRODUCTION 3
regularization(called Max1,2), and give both an efficient exact solution algorithm and
a fast pathwise approximation algorithm. In the third part of Chapter 2, we consider
a particular method to implement L0 regularization (called Orthogonal Gradient Al-
gorithm by Ing & Lai (2011)) and characterize its statistical properties in weakly
sparse regression model. Thus, we have added two new methods (OGA and Max1,2)
to our arsenal of regularization methods.
Since we usually do not know the underlying sparsity condtion for a particular
application in practice, we have to choose a good candidate from our arsenal of meth-
ods. In Chapter 3, we introduce a Monte Carlo Cross-Validation (MCCV) method
to compare different procedures. We also prove some attractive theory for MCCV
method when p " n.
In Chapter 4, we report simulation studies of different regularization methods
and also of MCCV to support the theoretical analysis in Chapter 2 and Chapter 3.
Chapter 5 gives further discussions and concluding remarks.
Chapter 2
Regularization in High-dimensional
Regression
We begin this chapter with a brief reviews, in Section 2.1, of existing regularization
methods including Lasso, Ridge Regression, Bridge Regression and their refinements.
In Section 2.2, we introduce a new regularization method to combine L1 and L2
penalties, and provide fast algorithms for both exact solution and an approximate
pathwise solution. In Section 2.3, we consider the Orthogonal Gradient Algorithm
(OGA) and a high-dimensional information criterion recently proposed by Ing & Lai
(2011) for L0 regularization; L0 uses the number of nonzero estimated regression
coefficients to penalize ordinary least squares. Throughout this chapter, we consider
the linear regression model:
yt = α +p∑
j=1
βjxtj + εt, t = 1, 2, · · · , n, (2.1)
4
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 5
with p predictor variables (xt1, xt2, · · · , xtp) that are uncorrelated with the mean-zero
random disturbances εt.
2.1 Regularization via Penalized Least Squares
2.1.1 L1 regularization: Lasso
Lasso
Tibshirani (1996) proposed the Lasso method that uses L1 penalty to replace L0
penalty. Under ordinary regression setting, the objective function for Lasso is
−%n(β) + λP1(β) =n∑
t=1
(yt −p∑
j=1
βjxtj)2 + λ
p∑
j=1
|βj|. (2.2)
As its name “Least Absolute Shrinkage and Selection Operator” signifies, the objec-
tive of Lasso (the abbreviated name) is to retain good features of both subset selection
which uses L0-penalty and ridge regression which uses L2-penalty by shrinking some
coefficients while setting others to 0, and thereby to produce a model with inter-
pretability similar to that produced by subset selection and with stability similar to
that produced by ridge regression. The Lasso penalty gives the smallest q for which
the Lq-penalty is convex, thus we can solve the optimization problem by convex op-
timization techniques. There is also theoretical support of the method by Donoho &
Johnstone (1994), Donoho & Elad (2003), Candes & Tao (2007) and Bickel, Ritov
& Tsybakov (2009).
However, Lasso shrinkage introduces bias for the estimates of the non-zero coeffi-
cients. Concerning the shrinkage of Lasso as a tool for variable selection, there is an
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 6
extensive literature on variable selection consistency of Lasso. Leng, Lin & Wahba
(2006) have shown that Lasso is not variable-selection consistent when prediction ac-
curacy is used as the criterion for choosing the penalty λ in (2.2). Zhao & Yu (2006)
pointed out “obtaining (sparse) models through classical model selection methods
usually involves heavy combinatorial search”, and Lasso “provides a computationally
feasible way for model selection”. Noting that Lasso can fail to distinguish irrelevant
predictors that are highly correlated with relevant predictors, they proposed a ”strong
irrepresentable condition” that is sufficient for Lasso to select the true model both in
the classical fixed p setting and in the large p setting as the sample size n gets large,
i.e., pn = O(nκ) for some κ > 0. This work is closely related to that of Donoho,
Elad & Temlyakov (2006) where they proved that under a “coherence condition”, the
Lasso solution identifies the correct predictors in a sparse model with high probabil-
ity. Under a “sparse Riesz condition”, Zhang & Huang (2008) have studied sparsity
and bias properties of Lasso-based model selection methods.
LARS
In statistical applications, one often wants to compute the entire solution path for
tuning purposes, not just one solution for a single λ value. This has led to algorithms
to approximate the whole Lasso path along with the shrinkage parameter λ, including
LARS and the coordinate descent algorithm.
Least Angle Regression (LARS) was introduced by Efron et al. (2004) as a fast
algorithm for variable selection, a simple modification of which can produce the entire
Lasso solution path {β(λ) : λ > 0} that optimizes (2.2). By exploiting the piecewise
linear nature of the Lasso solution path in λ, LARS uses an iterative stepwise strategy,
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 7
and only enters as much of a predictor as it deserves for each step. At the first step
it identies the variable most correlated with the response. Rather than fitting this
variable completely, LARS moves the coefficient of this variable continuously toward
its least squares value (causing its correlation with the evolving residual to decrease
in absolute value). As soon as another variable catches up in terms of correlation with
the residual, this iteration stops and the second variable then enters the active set.
Their coefficients are moved together in a way that keeps their correlations with the
residual equal and decreasing. This process is continued until all the variables are in
the model. See Algorithm 1 for more details, and Algorithm 2 for a modification to
get Lasso path.
Algorithm 1 LARS Algorithm
Step 1 Standardize the predictors to have mean zero and unit norm. Start with theresidual r = y − y, and β1 = 0, · · · , βp = 0,
Step 2 Find the predictor xj most correlated with r,
Step 3 Move βj from 0 towards its least-squares coefficients 〈xj, r〉, until some othercompetitor xk has as much correlation with the current residual as does xj,
Step 4 Move βj and βk in the direction defined by their joint least squares coefficientsof the current residual on (xj,xk), until some other competitor xl has as muchcorrelation with the current residual,
Step 5 Continue in this way until all p predictors have been entered. After min(n−1, p)steps, we arrive at the full least-squares solution.
Remark 2.1. The termination condition in step 5 requires some explanation. If p >n− 1, the LARS algorithm reaches a zero residual solution after n− 1 steps (the 1 isbecause we have centered the data).
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 8
Algorithm 2 Modified LARS Algorithm for Lasso Path
Step 4a If a nonzero coefficient along a path crosses zero at the ith step, drop theassociated variable from the active set of variables and recompute the currentjoint least squares direction.
2.1.2 L2 Regularization: Ridge Regression
There are two main issues in regression analysis with high-dimensional input vectors:
sparsity and singularity. While the main aim of L1-regularization is to gain a sparse
solution, L2-regularization (known as ridge regression) helps to address the singularity
problem.
The ridge coefficients minimize a penalized residual sum of squares,
−%n(β) + λP2(β) =n∑
t=1
(yt −p∑
j=1
βjxtj)2 + λ
p∑
j=1
β2j . (2.3)
Rewriting the criterion (2.3) in matrix form yields,
RSS(λ) = (Y −Xβ)!(Y −Xβ) + λβ!β, (2.4)
whose minimization has the explicit solution
βridge
= (X!X + λI)−1X!Y. (2.5)
By adding a positive constant to the diagonal elements of X!X before inversion,
it makes the problem nonsingular even though X!X is singular when p > n. This
helps to solve the singularity problem of the design matrix, which was the main
motivation of Hoerl & Kennard (1970) when they introduced ridge regression into
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 9
statistics research.
Ridge regression has two desirable properties:
1. Shrinkage.
To see the shrinkage effect of ridge regression, we use the singular value decom-
position (SVD) of the n× p matrix X:
X = UDV!, (2.6)
where D = diag(d1, · · · , dmin(p,n)) is a diagonal matrix such that {d2i : 1 ≤ i ≤
min(p, n)} contains the positive singular values of X!X.
Using the SVD, we can write the OLS estimates as
Xβols
= X(X!X)−1XTY (2.7)
= UU!Y. (2.8)
and the ridge regression estimates as
Xβridge
= X(X!X + λI)−1X!Y (2.9)
= UD(D2 + λI)−1DU!Y (2.10)
=p∑
j=1
uj
d2j
d2j + λ
u!j Y. (2.11)
This formula suggests that a greater amount of shrinkage to 0 is applied to basis
vector uj with smaller d2j .
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 10
2. Decorrelation.
Assuming the predictors to be normalized, we can express the sample covariance
matrix in terms of the sample correlation ρij:
X!X =
1 ρ12 . ρ1p
1 . .
1 ρp−1,p
1
p×p
, (2.12)
Ridge estimates with parameter λ are given by βridge
= RX!Y, with
R = (X!X + λI)−1.
Notice that R can be rewriten as
R =1
1 + λR∗ =
1
1 + λ
1 ρ12
1+λ . ρ1p
1+λ
1 . .
1 ρp−1,p
1+λ
1
−1
,
which is the “decorrelated” OLS operator with correlations shrunk by the factor
1/(1 + λ). We will revisit this effect in Section 2.2. However, ridge regression
does not provide sparse solutions for high-dimensional regression.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 11
2.1.3 Lq Regularization: Bridge Regression
Frank & Friedman (1993) has considered the bridge estimator associated with the
regularization problem
−%n(β) + λPq(β) =n∑
t=1
(yt −p∑
j=1
βjxtj)2 + λ
p∑
j=1
|β|qj , where q ≥ 0. (2.13)
Since this power family of penalties contains subset selection (q = 0), Lasso (q = 1),
and ridge regression (q = 2) as special cases, it gives us opportunities to choose
between subset regression (sparsest solutions) and ridge regression (non-sparse solu-
tions) for 0 ≤ q ≤ 2, and indeed one might even try estimating q from the data; see
Frank & Friedman (1993) and Friedman (2008). While one can use convex optimiza-
tion techniques to solve (2.13) for q ≥ 1, the optimization problem is non-convex when
q < 1, which may lead to much more sparse solution than Lasso (q = 1). Huang,
Horowitz & Ma (2008) have studied the asymptotic properties of the bridge estimator
when 0 < q < 1 as p →∞ and n →∞.
2.1.4 Refinements: Adaptive Lasso & SCAD
Alternative penalty functions have been proposed to improve Lasso and ridge regres-
sion in terms of variable selection consistency and prediction accuracy.
Adaptive Lasso
We call a method M an “oracle procedure” if β(M)
has the following asymptotic
properties:
(i) identifies the right subset model, i.e., {j : β(M)
j *= 0} = {j : βj *= 0},
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 12
(ii)√
n(β(M)
A − βA) converges in distribution to N(0,Σ∗), where A = {j : βj *= 0}
and Σ∗ is the covariance matrix when the true subset model is known and βA
is the subvector of β corresponding to the subset A of {1, 2, · · · , p}.
From Section 2.1.1, we know Lasso is only variable selection consistent under
certain conditions, thus Lasso is not an oracle procedure itself. Zou (2006) proposed
Adaptive Lasso that assigns different weights to different coefficients. Defining the
weight vector w = 1/|β|γ if β is a√
n-consistent estimator for β, the Adaptive Lasso
estimator is given by
β∗(n)
= arg minβ
{n∑
t=1
(yt −p∑
j=1
βjxtj)2 + λ(n)
p∑
j=1
wj|β|j}. (2.14)
The data-driven w is the key for adaptive lasso since it depends on a√
n-consistent
initial value β (one usually can use βols). Under the fixed p setting, Zou (2006) proved
Adaptive Lasso enjoys the oracle properties in his Theorem 2. Huang, Ma & Zhang
(2008) extended this oracle properties to high-dimensional cases when p → ∞ as
n →∞.
SCAD
Under a generalized penalized likelihood framework, Fan & Li (2001) introduced the
“smoothly chipped absolute deviation” (SCAD) penalty to address the problem that
the L1-penalty used by Lasso may lead to severe bias for large regression coefficients.
They proposed to estimate β by
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 13
−n−1%n(β) +p∑
j=1
Pλ(|βj|) =1
2n‖Y −Xβ‖2 +
p∑
j=1
Pλ(|βj|), (2.15)
They argued that good penalty functions should result in estimators with the following
properties:
• Approximate Unbiasedness: The resulting estimator is nearly unbiased, espe-
cially when the true unknown coefficient βj is large, to avoid unnecessary mod-
eling bias,
• Sparsity: The resulting estimator is a thresholding rule, thus can automatically
set small estimated coefficients to zero so as to get a sparse model,
• Continuity: The resulting estimator is continuous in data to reduce instability
in model prediction.
By considering canonical linear model in which the design matrix satisfies
X!X = nIp,
they reduced (2.15) to
1
2n‖Y −Xβ‖2 + ‖β − β‖2 +
p∑
j=1
Pλ(|βj|), (2.16)
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 14
where β = n−1X!y is the ordinary least square estimator, thus the minimization
problem of (2.16) is equivalent to minimizing componentwise as:
(z − θ)2 + Pλ(|θ|). (2.17)
To attain these three properties for a good penalty function, Antoniadis & Fan (2001)
gave the followings sufficient conditions as below:
• Approximate unbiasedness if p′(t) = 0 for large t,
• Sparsity if mint≥0{t + p′(t)} > 0,
• Continuity if arg mint≥0
{t + p′(t)} = 0.
Accordingly, Fan and Li introduced the SCAD penalty function, whose derivative
is given by
p′(t) = λ{ t≤λ +(aλ− t)+
(a− 1)λt>λ}, for some a > 2, and pλ(0) = 0. (2.18)
The associated minimization problem is non-convex, so multi-step procedures in which
each step involves convex optimization have been introduced, as in the local quadratic
approximation of Fan & Li (2001) and the local linear approximation (LLA) of Zou
& Li (2008), who also showed that the one-step LLA estimator has certain oracle
properties if the initial estimator is suitably chosen. Zhou, Van De Geer & Buhlmann
(2009) have pointed out that one such procedure is Zou’s (2006) adaptive Lasso, which
uses the Lasso as an initial estimator to determine the weights for a second-stage
weighted Lasso. They have also substantially weakened the conditions of Huang, Ma
& Zhang (2008) on the variable selection consistency of adaptive Lasso, which Zou
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 15
(2006) established earlier for the case of fixed p. However, the computation of both
Adaptive Lasso and SCAD requires a consistent initial estimator for the unknown
regression coefficients. This requirement implicitly assumes p < n or that Lasso or
some other preliminary estimator is consistent for the problem at hand.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 16
2.2 Elastic Net and a New Approach
Zou & Hastie (2005) introduced the Elastic Net, which is a convex combination of L1
and L2 penalties, to address the two serious drawback of Lasso:
(a) Lasso has some amount of sparsity forced onto it, and it selects at most n
variables before it saturates. When p " n, there may be more than n nonzero
of βj’s in the true model, and this is a limitation of Lasso in terms of variable
selection,
(b) Simulation studies have shown that Lasso does not perform well if the predictors
have high correlation structure, even when p < n.
Elastic Net attempts to combine the L2 and L1 penalties, by using ridge regression
to deal with high correlation problem while taking advantage of Lasso’s sparse variable
selection properties, as will be explained below. Assume that the response is centered
and the predictors are normalized, i.e.,
n∑
i=1
yi = 0,n∑
i=1
xij = 0, andn∑
i=1
x2ij = 1, for j = 1, 2, · · · , p. (2.19)
Define
L(λ1, λ2, β) = ‖Y −Xβ‖22 + λ2‖β‖2
2 + λ1‖β‖1, (2.20)
where ‖β‖22 =
∑pj=1 β2
j , ‖β‖1 =∑p
j=1 |βj|. For any fixed λ1 and λ2, the Elastic Net
estimator minimizes (2.20), i.e.,
βenet = (1 + λ2) arg minβ
{L(λ1, λ2, β)}. (2.21)
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 17
As is shown by Zou and Hastie, we can first augment the data as (Y∗,X∗)
X∗(n+p)×p = (1 + λ2)
−1/2
X√
λ2I
Y∗(n+p)×1 =
Y
0
,
and then minimize
L(γ, β) = L(γ, β∗) = ‖Y∗ −X∗β∗‖22 + γ‖β∗‖1, (2.22)
where γ = λ1/√
1 + λ2, and β∗ =√
1 + λ2 ·β; the minimization of (2.22) is equivalent
to minimize (2.20).
In the case of orthogonal design, it is straightforward to show that with parameters
(λ1, λ2), the Elastic Net solution is
βeneti = (1 + λ2)
(|βolsi |− λ1/2)+
1 + λ2sgn{βols
i }
= (1 + λ2)(|βridgei |− λ1/2
1 + λ2)+sgn{βridge
i },(2.23)
which amounts to the Lasso-type soft-thresholding of the ridge regression estimates
associated with L2 penalties.
More generally, Elastic Net can be considered as two-stage procedure: a ridge-
type direct shrinkage followed by a lasso-type thresholding. Actually, the ridge-type
shrinkage is the part that distinguishes Elastic Net from Lasso. To study the operation
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 18
characteristics of ridge operator, assume that the predictors are normalized so that
X!X =
1 ρ12 . ρ1p
1 . .
1 ρp−1,p
1
p×p
, (2.24)
where ρi,j is the sample correlation. Ridge estimates with parameter λ2 are given by
βridge
= RX!Y, with
R = (X!X + λ2I)−1.
Notice that R can be rewriten as
R =1
1 + λ2R∗ =
1
1 + λ2
1 ρ12
1+λ2. ρ1p
1+λ2
1 . .
1 ρp−1,p
1+λ2
1
−1
.
R∗ is like the usual OLS operator except that the correlations are shrunk by the factor
1/(1 + λ2), which actually works as decorrelation.
However, Elastic Net has has some limitations, because it uses a mixture of the
joint double exponential prior (corresponding to Lasso) and joint normal prior (cor-
responding to ridge). Therefore, when one of the penalties does not work well, the
mixture still has to pay the price of that penalty.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 19
The operation characteristics of Elastic Net inspired us to combine the L1 and L2
penalties in an alternative way. Rather than taking a linear combination of the L1
and L2 penalties as in (2.20), we can combine the penalties by taking the maximum
of |βi| and ρβ2i for each component of β, i.e.,
P (β, ρ) =p∑
i=1
max(|βi|, ρβ2i ). (2.25)
!1.0 !0.5 0.0 0.5 1.0!1.0
!0.5
0.0
0.5
1.0
b1
b2
L1 penalty
!1.0 !0.5 0.0 0.5 1.0!1.0
!0.5
0.0
0.5
1.0
b1
b2
L2 penalty
!1.0 !0.5 0.0 0.5 1.0!1.0
!0.5
0.0
0.5
1.0
b1
b2
Max!1,2" penalty
Figure 2.1: Contour for different penalties, left to right, L1, L2, and Max1,2 with ρ = 3
We call it the Max 1 ,2 penalty. Figure 2.1 shows the contours of L1, L2 and
Max 1 ,2 (with ρ = 3) penalties. The essential idea behind the Max 1 ,2 penalty is to
threshold the small coefficients by the L1-penalty, while to keep the large coefficients
after the decorrelating by the L2-penalty. The corresponding prior density for βj is
proportional to
exp{−λ · max(|βj|, ρβ2j )}
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 20
for each j, and the joint prior distribution for β assumes that βj are independent.
In other words, we treat the large coefficients like ridge regression that try to ’share’
the coefficients among the group of correlated regressors, while we treat the small
coefficients like Lasso to do soft thresholding and thus get ”sparsity” among these
small coefficients. Figure 2.2 compares the shrinkage effects of L1, L2, Elastic Net,
and the Max 1 ,2 penalties.
-2 -1 0 1 2
-2-1
01
2
L1 shrinkage
x
x
-2 -1 0 1 2
-2-1
01
2
L2 shrinkage
x
x
-2 -1 0 1 2
-2-1
01
2
ElasticNet shrinkage
x
x
-2 -1 0 1 2
-2-1
01
2
Max(1,2) shrinkage
x
x
Figure 2.2: Shrinkage effects of different penalties
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 21
Now, let’s define our Max 1 ,2 estimator as below:
L(λ, ρ, β) = ‖Y −Xβ‖22 + λP (β, ρ) (2.26)
= ‖Y −Xβ‖22 + λ
p∑
i=1
max(|βi|, ρβ2i ). (2.27)
For any fixed λ and ρ denote Max 1 ,2 penalty estimator by βM :
βM = arg minβ
L(λ, ρ, β). (2.28)
In the literature, Owen (2006) introduced “Berhu” penalty,
BM(βj) =
|βj| |βj| ≤ M,
β2j +M2
2M |βj| > M,
(2.29)
which is another variant of the Max 1 ,2 penalties. His motivation is Huber’s loss func-
tion for robust estimates, and “Berhu” means reversing “huber”. Owen (2006) used
cvx to solve the optimization problem, which is not fast. In the following section we
will present our new efficient implementation.
2.2.1 Implementation of Max 1 ,2 Regularization
To make it possible for p " n cases, in this section, we will first present a recent fast
implementation for Max 1 ,2 regularization when λ and ρ are fixed. This is based on the
ADMM procedure for certain convex optimization problems; see Boyd et al. (2010).
Since the statistics community cares much about the path solution for parameter
tuning purpose, we propose an even faster path approximation algorithm to get all
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 22
path solutions for the Max 1 ,2 regularization method.
Exact solution implementation
We now describe our modified Alternating Direction Method of Multipliers (ADMM,
Boyd et al. (2010)) for the Max 1 ,2 problem using the lagrange form (2.30)
L(λ, ρ, β) = ‖Y −Xβ‖22 + λ
p∑
i=1
max(|βi|, ρβ2i ). (2.30)
The algorithm can also be generalized to other convex hybrid penalties.
ADMM is a variant of the augmented Lagrangian scheme that uses partial updates
for the dual variables, and it is intended to blend the decomposition capability of dual
descent with the superior convergence properties of the method of multipliers. Our
Max 1 ,2 optimization problem is clearly equivalent to
minimize: L(λ, ρ, β) = ‖Y −Xβ‖22 + λ
p∑
i=1
max(|zi|, ρz2i ), (2.31)
subject to: β − z = 0, (2.32)
where z = (z1, z2, · · · , zp). As in the method of multiplier, we can form the augmented
Lagrangian as
Lr(λ, ρ, β, z, d) = ‖Y−Xβ‖22 +λ
p∑
i=1
max(|zi|, ρz2i )+d!(β−z)+
r
2‖β−z‖2
2. (2.33)
Under the ADMM framework, for fixed λ and ρ, we can solve (2.33) through the
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 23
iterations:
βk+1 := arg minβ
Lr(β, zk, dk), (2.34)
zk+1 := arg minz
Lr(βk+1, z, dk), (2.35)
dk+1 := dk + r(βk+1 − zk+1). (2.36)
Application of ADMM to (2.33) updates β and z in an alternating fashion, and
consists of three steps: β optimization step (2.34), a z -minimization step (2.35), and
a dual variable d update (2.36) that uses a step size r. As pointed by Boyd et al.,
the state of ADMM actually only consists of zk and dk, which means (zk+1, dk+1) is
function of (zk, dk). In particular, both of the minimization steps (2.34) and (2.35)
can be solved explicitly,
βk+1 = (X!X +r
2I)−1
(X!y +r
2zk − dk/2), (2.37)
zk+1j = Sλ,ρ,r(
dkj
r+ βk+1
j ), 1 ≤ j ≤ p, (2.38)
and Sλ,ρ,r is the shrinkage operator associated with Max1,2 penalty,
Sλ,ρ,r(v) =
v1+2λρ/r |v| ≥ 1
ρ + 2λr ,
1ρsign(v) 1
ρ + λr ≤ |v| ≤ 1
ρ + 2λr ,
v − sign(v)λr
λr ≤ |v| ≤ 1
ρ + λr ,
0 |v| ≤ λr ,
(2.39)
which is exactly the shrinkage that occurs for orthonormal inputs and is a “hybrid”
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 24
of ridge shrinkage and soft thresholding for Lasso based the magnitude of v.
We iteratively update until ‖zk − βk‖2 ≤ εfeas and ‖zk − zk−1‖2 ≤ εtol, where
εfeas is the feasible tolerance for the residuals of equality constraint (2.32) and εtol is
the stability tolerance. There are many convergence results for ADMM discussed in
literature, and more details can be found in Boyd et al. (2010).
Thus, we can calculate the βλ,ρ on grids of (λ, ρ), and then choose the best tuning
parameters (λ∗, ρ∗) through cross-validation.
Fast path approximation implementation
In previous subsection, we gave a exact path solution given λ and ρ. There are still
large computational burdens of obtaining the solutions to (2.30) for grids of different
(λ, ρ). One approach to mitigate this burden is direct path seeking, which aims at
sequentially constructing a path directly in the parameter space that closely approxi-
mates that for a given penalty rather than repeatedly solving the optimization prob-
lems. Motivated by Generalized Path Seeking(GPS) algorithm of Friedman (2008),
we introduce a direct path seeking algorithm for our Max 1 ,2 estimator.
Denote P (β, ρ) =∑p
i=1 max(|βi|, ρβ2i ), ρ > 0. It is easily to show
{∂P (β, ρ)
∂|βj|> 0}p
1 (2.40)
for all values of β, which satisfy the condition (23) in Friedman (2008). We get the
path seeking algorithm for our Max 1 ,2 estimator based on Friedman’s GPS frame-
work.
Let υ measure the length along the path, ∆υ > 0 be a small increment, and β(υ)
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 25
be the path solution point indexed by υ. Define
gj(υ) = −[∂‖Y −Xβ‖2
2
∂βj
]
β=β(υ)
, (2.41)
pj(υ) =
[∂P (β, ρ)
∂|βj|
]
β=β(υ)
, (2.42)
τj(υ) =gj(υ)
pj(υ), (2.43)
in which gj(υ) and pj(υ) correspond separately to jth component-wise negative gradi-
ent of least squares empirical risk and gradient of regularization P (β, ρ) with respect
to |βj| evaluated at β(υ), thus τ(υ) are the component-wise ratios for these two
gradients at β(υ).
Algorithm 3 Path seeking algorithm for Max 1 ,2 estimator
Choose appropriate ∆υ, Gρ be the number of ρs in the grid:
Step 1 for each ρ, itialize υ = 0; {βj(0) = 0}p1.
Step 2.1 Compute {τj(υ)}p1, and get the candidate set S = {j|τj(υ)·βj(υ) < 0};
Step 2.2 Choose the variable with max gradient ratio from candidate set S:j∗ = arg max
j∈S|τj(υ)| (if S is empty, j∗ = arg max
j|τj(υ)|) ;
Step 2.3 Update βj∗ :βj∗(υ + ∆υ) = βj∗(υ) + ∆υ · sign(τj∗(υ));
Step 2.4 υ = υ + ∆υ;
Step 2.5 Continue step 2.2-2.5 until all τ(υ) = 0.
Step 3 Tuning along the path to get β(υ∗(ρ)) with minimum cross-validation error.
βM = β(υ∗(ρ∗)) , with minimum cross-validation error across all ρ.
For any fixed ρ, the path approximation algorithm (Algorithm 3) gives us the
solution path along path length υ, thus we can tune υ (equivalent to tuning λ) along
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 26
the path by choosing β(υ∗(ρ)) with minimum cross-validation error. Among a grid
of ρ, we then select β(υ∗(ρ∗)) with minimum cross-validation error, which is our
empirical solution for the Max 1 ,2 estimator. The path seeking algorithm reduces the
two-dimensional grid search to a univariate grid search, and also speeds up the path
seeking along λ with fixed ρ by using an approximation, which makes the Max 1 ,2
estimator more easily computable.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 27
2.3 L0-Regularization: Orthogonal Greedy Algorithm
Ing & Lai (2011) introduced a fast stepwise regression procedure, called orthogonal
greedy algorithm (OGA), which consists of (a) forward selection of input variables in
a “greedy” manner so that the selected variable at each step will most improve the
fit, (b) a high dimension information criterion (HDIC) to terminate forward inclusion
of variables and (c) stepwise backward elimination of variables according to HDIC.
2.3.1 OGA and Gradient Boosting
As a greedy algorithm, OGA has a strong connection to the boosting algorithms in
statistical learning. Boosting successively uses a “weak learner” (or “base learner”)
to improve prediction so that after a large number of iterations, the cumulative ef-
fect of these weaker learners can produce much better prediction. The first boosting
algorithm came from Valiant’s (1984) PAC (Probably Approximately Correct) learn-
ing model, and Schapire (1990) developed the first simple boosting procedure under
the PAC learning framework. Freund (1995) proposed a boost by majority voting
to combine many weak learners simultaneously and improve the performance of the
simple boosting algorithm of Schapire. This led to the more adaptive and realistic
AdaBoost (Freund & Schapire 1996a) and its refinements, and theoretical justifica-
tion were provided by Freund & Schapire (1996a) and Schapire & Singer (1999).
Subsequently, Freund & Schapire (1996b) and Breiman(1998, 1999) connected the
Adaboost to game theory and Vapnik-Chervonenkis theory.
The ground-breaking work for boosting came from Friedman(1999, 2001). Fried-
man treats the boosting procedure as a general method for functional gradient descent
learning, and gives a list of choices for loss functions and base learners in a generic
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 28
Algorithm 4 Friedman’s Gradient Boosting AlgorithmGiven {(yi,xi); i=1,2...,n}, denote empirical loss function as φ(F ) =1N
∑ni=1 L(yi, F (xi))
• Step 1 (Initialization) Initialize F0(x) = 0.
• Step 2 (Iteration) For m=1 to M do:
– Step 2.1 Compute the steepest descent ui = −∂L(yi,F (xi))∂F (xi)
|F=Fm−1(xi), i = 1, ..n;
– Step 2.2 Calculate the least squares fit am = arg mina,β∑n
i=1[ui − βh(xi; a)]2,where h can be any base learner. Denote fm(x) = h(x; am);
– Step 2.3 Linear search:
ρm = arg minρ
n∑
i=1
L(yi, Fm−1(xi) + ρfm(xi)));
– Step 2.4 Update Fm(x) = Fm−1(x) + ρmfm(x);
• Step 3(Stopping) Boosted estimate is F (x) =∑m∗
m=1 ρmfm(x), where m∗ is selectedto minimize model selection criterion.
framework for “gradient boosting”. Algorithm 4 is a brief summary view for boosting
under gradient descent point of view.
This work has extended boosting method to regression, which is implemented as
an optimization using the squared error loss function, and this important special case
is called $L2-Boosting in Buhlmann & Yu (2003) and Buhlmann (2006). As a special
case, $L2-Boosting uses loss function L(y, F ) = 12(y−F )2 and takes the component-wise
linear model as base learner, thus we can get the $L2-Boosting algorithm in gradient
descent framework with a simple structure: the negative gradient of step 2.1 is the
classical residual vector and the linear search in step 2.3 is trivial. Algorithm 5 is the
summary for $L2-Boosting.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 29
Algorithm 5 $L2-Boosting Algorithm
• Step 1 (Initialization) Initialize F0(x) = 0.
• Step 2 (Iteration) For m=1 to M do:
– Step 2.1 Compute the steepest descent ui = yi − Fm−1(xi), i = 1, ..n
– Step 2.2 Here, the new base learner is
fm(x) = xjm,
where jm = arg min1≤j≤p
n∑
i=1
(ui − βjxij)2 = arg min1≤j≤p
(1− r2j ), βj =
∑ni=1 uixij∑ni=1 x2
ij
– Step 2.3 Linear search:
ρm = arg minρ
n∑
i=1
L(yi, Fm−1(xi) + ρfm(xi)) = βjm
– Step 2.4 Update Fm(x) = Fm−1(x) + βjmxjm
• Step 3(Stopping) Boosted estimate is F (x) =∑m∗
m=1 βjmxjm
, where m∗ is selected tominimize model selection criterion.
Remark 2.2. It is often better to use small step sized in Step 2.4, as Fm(x) = Fm−1(x) +vβjm
xjm, where v is constant during the iteration.
Buhlmann (2006) has studied the consistency of $L2-Boosting in terms of the con-
ditional prediction error (2.44),
CPE := E{(Fm(x)− f(x))2|y1,x1, · · · , yn,xn}, (2.44)
and shown that for p = exp(O(nξ)) with 0 < ξ < 1, the CPE of $L2-Boosting predictor
Fm(x), under certain technical conditions, can converge in probability to 0 if m =
mn →∞ sufficiently slowly. He also proposed to use corrected-AIC(AICc) as model
selection criterion along the PGA path.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 30
Algorithm 6 OGA Algorithm
• Step 1 (Initialization) Initialize F0(x) = 0, and J (0) = ∅.
• Step 2 (Iteration) For m=1 to Kn do:
– Step 2.1 Compute the steepest descent Ui = Y − Fm−1(Xi), i = 1, ..n
– Step 2.2 Find jm similarly by
jm = argmin1≤j≤p,j /∈J(m−1)(1− r2j ) (2.45)
Compute the projection Xjmof Xjm
into the linear space spannedby(Xj1
,X⊥j2
,X⊥j3
....X⊥jm−1
), and get X⊥jm
= Xjm− Xjm
, which is ”weak” orthog-onal gradient direction.Update J (m) = (J (m−1), jm).
– Step 2.3 Line search along the Orthogonal gradient direction:
ρm = arg minρ
n∑
i=1
L(Yi, Fm−1(Xi) + ρx⊥jm
) = βjm,
where βjm= (
∑ni=1 UiX⊥
i,jm)/
∑ni=1(X
⊥i,jm
)2.
– Step 2.4 Update Fm(x) = xJm βJ(m) = Fm−1(x) + βjmx⊥
jm
Remark 2.3. We can consider the step 2.3 as a hyper-plane search, since we will get theOLS update from this step.
As a alternative to $L2-Boosting, OGA can also be summarized in the framework of
boosting in Algorithm 6. The population version of OGA and $L2-Boosting were also
studied in Temlyakov (2000), called (OGA and PGA separately). In the information
theory, compressed sensing and approximation theory, OGA is also called orthogo-
nal matching pursuit (OMP), which focuses on approximations in noiseless models
(i.e.,εt = 0 in (2.1)); more details are in Tropp (2004) and Tropp et al. (2007).
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 31
2.3.2 OGA+HDIC
Ing & Lai (2011) introduced OGA+HDIC for variable selection. In their work, they
proposed to choose along the OGA path the model that has the smallest value of a
suitably chosen criterion, which is called a “high-dimensional information criterion”
(HDIC). Specifically, for a non-empty subset J of {1, · · · , p}, let σ2J = n−1
∑nt=1(yt−
yt;J)2, where yt;J is OGA fitting after Jth iterations. Let
HDIC(J) = n log σ2J + 1(J)wn log p, (2.46)
kn = arg min1≤k≤Kn
HDIC(Jk), (2.47)
in which different criterias correspond to different choices of wn and Jk = {j1, · · · , jk}.
Note that σ2Jk
, and therefore HDIC(Jk) also, can be readily computed at the kth OGA
iteration, and therefore this model selection method along the OGA path involves
little additional computational cost. In particular, wn = log n corresponds to HDBIC,
the case wn = c log log n with c > 2 corresponds to the high-dimensional Hannan-
Quinn criterion (HDHQ), and the case wn = c corresponds to HDAIC.
Let Kn denote a prescribed upper bound on the number m of OGA iterations,
σ2j = E(x2
j), zj = xj/σj and ztj = xtj/σj. Let
Γ(J) = E{z(J)z!(J)}, gi(J) = E(ziz(J)), (2.48)
where z(J) is a subvector of (z1, · · · , zp)! and J denotes the associated subset of
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 32
indices 1, · · · , p. We assume that for some δ > 0, M > 0 and all large n,
min1≤)(J)≤Kn
λmin(Γ(J)) > δ, max1≤)(J)≤Kn,i∈/J
‖Γ−1(J)gi(J)‖1 < M, (2.49)
where 1(J) denotes the cardinality of J and
‖ν‖1 =k∑
j=1
|νj| for ν = (ν1, · · · , νk)!. (2.50)
For p " n, p = pn →∞, under assumptions :
(C1) log pn = o(n),
(C2) E{exp(sε)} < ∞ for |s| ≤ s0,
(C3) There exists s > 0 such that
lim supn→∞
max1≤j≤pn E{exp(sz2j )} < ∞,
(C4) supn≥1
∑pn
j=1 |βjσj| < ∞.
The following theorem of Ing & Lai (2011) gives the rate of convergence, which holds
uniformly over 1 ≤ m ≤ Kn, for the CPE (defined in (2.44)) of OGA provided that
the correlation matrix of the regressors satisfies (2.49).
Theorem 2.4. Assume (C1)-(C4) and (2.49). Suppose Kn → ∞ such that Kn =
O((n/ log pn)1/2). Then for OGA,
max1≤m≤Kn
(E[{y(x)− ym(x)}2|y1,x1, · · · , yn,xn]
m−1 + n−1m log pn
)= Op(1).
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 33
Theorem 2.4 says that uniformly in m = O((n/ log pn)1/2), OGA can actually
attain heuristically best order of m−1 + n−1m log pn for En({y(x) − ym(x)}2). The
two parts in denominator correspond to bias and variance of OGA method in sample
version. In the population version of OGA, Theorem 3 of Temlyakov (2000) says the
squared bias in approximating y(x) by yJm(x) is E(y(x)−yJm(x))2 = O(m−1), where
yJm(x) denote the best linear predictor of y(x) based on {xj, j ∈ Jm} and Jm is the
set of input variables selected by the population version of OGA at the end of stage
m. Since sample version OGA uses ym(·) rather than yJm(·), it has not only larger
squared bias but also variance in the least squares estimates βji, i = 1, · · · , m. The
variance is of order O(n−1m log pn), noting that m is the number of estimated regres-
sion coefficients, O(n−1) is the variance per coefficient and O(log pn) is the variance
inflation factor due to data-dependent selection of ji from {1, · · · , pn}. Combining
the squared bias with the variance suggests that O(m−1 + n−1m log pn) is the small-
est order one can expect for En({y(x)− ym(x)}2). Moreover, standard bias-variance
tradeoff suggests that m should not be chosen to be larger than O((n/ log pn)1/2), and
OGA usually chooses Kn = 5 · (n/ log pn)1/2.
2.3.3 Variable Selection Consistency under Strong Sparisty
We call a procedure has sure screening property if it can include all relevant variables
with probability approaching 1. Furthermore, we call the procedure variable selection
consistency if it can select all relevant variables and no irrelevant variables with prob-
ability approaching 1. To achieve consistency of variable selection of OGA+HDIC,
some lower bound condition (which may approach 0 as n →∞) on the absolute val-
ues of nonzero regression coefficients needs to be imposed. Define a “strong sparsity”
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 34
condition to quantify the lower bound condition by:
(C5) There exists 0 ≤ γ < 1 such that nγ = o((n/ log pn)1/2) and
lim infn→∞
nγ min1≤j≤pn:βj ,=0
β2j σ
2j > 0.
Denote the set of relevant input variables by
Nn = {1 ≤ j ≤ pn : βj *= 0}.
The following theorem of Ing & Lai (2011) shows OGA path contains all the relevant
variables under strong sparsity condition (C5), thus OGA has sure screening property.
Theorem 2.5. Assume (C1)-(C5) and (2.49). Suppose Kn/nγ → ∞ and Kn =
O((n/ log pn)1/2). Then limn→∞ P (Nn ⊂ JKn) = 1, where Nn = {1 ≤ j ≤ pn : βj *= 0}
denotes the set of relevant input variables.
Define the minimal number of relevant regressors along an OGA path by
kn = min{k : 1 ≤ k ≤ Kn, Nn ⊆ Jk} (min ∅ = Kn), (2.51)
kn = arg min1≤k≤Kn
HDIC(Jk). (2.52)
Ing & Lai (2011) have shown that by choosing wn of HDIC satisfying
wn →∞, wn log pn = o(n1−2γ). (2.53)
the OGA+HDIC scheme is variable selection consistent under strong sparsity case.
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 35
Theorem 2.6. With the same notation and assumptions as in Theorem 2.5, suppose
(2.53) holds, Kn/nγ →∞ and Kn = O((n/ log pn)1/2). Then limn→∞ P (kn = kn) = 1.
Although Theorem 2.6 shows that the minimal number of relevant regressor along
OGA path kn can be consistently estimated by kn, Jknmay still contain irrelevant
variables. Ing & Lai (2011) also proposed a back trim scheme to exclude the irrelevant
variables by using HDIC criterion, i.e define a subset Nn of Jknby
Nn = {jl : HDIC(Jkn− {jl}) > HDIC(Jkn
), 1 ≤ l ≤ kn} if kn > 1, (2.54)
and Nn = {j1} if kn = 1. In order to further trim the irrelevant variables, we only
require the computation of kn − 1 additional least squares estimates and their as-
sociated residual sum of squares∑n
t=1(yt − yt;Jkn−{jl})
2, 1 ≤ l < kn for (2.54), in
contrast to the intractable combinatorial optimization problem of choosing the sub-
set with the smallest extended BIC among all non-empty subsets of {1, · · · , pn}, for
which Chen and Chen (2008, Theorem 1) established variable selection consistency
under an “asymptotic identifiability” condition and pn = O(nκ) for some κ > 0.
The following theorem of Ing & Lai (2011) establishes the oracle property of the
OGA+HDIC+Trim procedure.
Theorem 2.7. Under the same assumptions as in Theorem 2.6 , limn→∞ P (Nn =
Nn) = 1.
There are also numbers of literature recently studied OGA approach. Barron,
Cohen, Dahmen & Devore (2008) have recently extend the convergence rates of OGA
in noiseless models (i.e., εt = 0 in (1.1)) to regression models by using empirical
process theory, but they require technique condition to satisfy |yt| ≤ B for some
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 36
known bound B, since they need this bound to apply empirical process theory to the
sequence of estimates y(B)m (x) = sgn(ym(x)) min{B, |ym(x)|}. In their work, Barron
et.al proposed to terminate OGA after 1na2 iterations for some a ≥ 1 and to select
m∗ that minimizes∑n
i=1{yi − y(B)m (xi)}2 + κm log n over 1 ≤ m ≤ 1na2, for which
they showed choosing κ ≥ 2568B4(a + 5) yields their convergence result for y(B)m∗ . In
comparison, Theorem 2.4 in section 2.3.2 doesn’t need this bound condition on |yt|,
and has a much sharper convergence rate than that of Barron et al. (2008). Wang
(2009) recently proposed to use forward stepwise regression to select a manageable
subset first, and then choose variables along an OGA path by using ”extended” BIC
(2.55) of Chen & Chen (2008).
BICγ(Jm) = n log σ2Jm
+ 1(Jm) log n + 2γ log τ)(Jm),
where τj =
(p
j
), Jm ⊂ {1, · · · , p} is non-empty and 0 ≤ γ ≤ 1.
(2.55)
Wang (2009) only establishes the sure screening property limn→∞ P (Nn ⊆ Jmn) = 1
, where mn = arg min1≤m≤n EBIC(Jm), thus Wang (2009) actually uses it to screen
variables for a second-stage regression analysis using Lasso or adaptive Lasso. In ad-
dition, Wang (2009) proves the sure screening property result under much stronger
assumptions than those of Theorem 2.6, such as εt and xt having normal distribu-
tions and a ≤ λmin(Σn) ≤ λmax(Σn) ≤ b for some positive constants a and b and
all n ≥ 1, where Σn is the covariance matrix of the pn-dimensional random vector
xt. Forward stepwise regression followed by cross-validation as a screening method
in high-dimensional sparse linear models has also been considered by Wasserman &
Roeder (2009), who propose to use out-of-sample least squares estimates for the se-
lected model after partitioning the data into a screening group and a remaining group
CHAPTER 2. REGULARIZATION IN HIGH-DIMENSIONAL REGRESSION 37
for out-of-sample final estimation. By using OGA+HDIC here instead, we can already
achieve the oracle property without any further refinement.
For model selection, recall by using a nested finite order of AR models, Ing &
Wei (2005) and Ing (2007) consider the problem of approximating a stationary infinite
order autoregressive process AR(∞),
yt =∞∑
j=1
bjyt−j + ηt, (2.56)
where bj are unknown AR coefficients, yt is a time series data, and ηt are random
disturbances. They showed that AIC is asymptotically efficient in the sense that the
approximation AR model selected by AIC possesses the optimal prediction capability
when bj satisfies one of the following sparsity conditions:
Ljγ ≤ |bj| ≤ Ujγ (the algebraic decay case), (2.57)
C1 exp(−aj) ≤ |bj| ≤ C2 exp(−aj) (the exponential decay case), (2.58)
where 0 < L ≤ U < ∞, 0 < C1 ≤ C2 < ∞, γ > 1 and a > 0 are constants.
The results of Ing & Lai (2011), together with the asymptotic efficiency of AIC in
AR∞ processes, inspired us to use HDAIC (2.59),
HDAIC(J) = n log σ2J + c1(J) log p. (2.59)
as the high-dimensional version of AIC, to choose the models along the OGA path
under weak sparsity cases that satisfy (2.57) and (2.58), where c is a positive constant
and we usually choose c = 2.01.
Chapter 3
Monte Carlo Cross-Validation and
Estimation of Prediction Error
Distribution
In Chapter 2, we have discussed a variety of regularization methods with penalties
ranging from L0 to L2 as well as their relative strengths. In particular, we have con-
sidered OGA+HDIC under strong sparsity and weak sparsity assumptions. However,
in practice one does not know if the data-generating mechanism for a particular high-
dimensional data set satisfies the sparsity assumptions. Ideally, if we had enough
data, we would set aside a test set and use it to assess the prediction performance
of a model built from the remaining data that is used as a training set. However, we
can’t afford such a “luxury” in the p " n case, and we need to have some procedure
to reuse the available data for both training and testing.
38
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 39
Assume
yi(xi) = β!xi + εi, where β = (β1, · · · , βp)!,xi = (xi,1, · · · , xi,p)
!, (3.1)
Throughout this chapter, we use M to denote a regression method, and βM ,n to
denote the estimatior of β associated with the method M based on a sample of n i.i.d
observations (xi, yi), 1 ≤ i ≤ n. Let (xn+1, yn+1) be a future pair of covariate xn+1
and response yn+1. The prediction error is
e(M ; n) = yn+1 − β"M ,nxn+1 = (β − βM ,n)"xn+1 + εn+1. (3.2)
Let g be a continuous nonnegative function on the real line, e.g., g(e) = |e| or g(e) =
e2. The distribution F (M )n of g(e(M ; n)) can be evaluated by Monte Carlo methods
when β and the distribution Ψ of (x, ε) are given.
Given a consistent estimate β of β, one can use the empirical distribution Ψ of the
centered residuals yi− β"xi− (y−β"x) to estimate Ψ consistently. Generating i.i.d
random variables ε∗i from Ψ and letting y∗i = β"xi + ε∗i (i = 1, 2, . . . , n + 1), one can
apply the method M to the bootstrap sample {(xi, y∗i ), 1 ≤ i ≤ n}, which yields the
bootstrap estimate β∗M ,n and the prediction error e∗(M ; n) = y∗n+1 − (β∗
M ,n)"x∗n+1.
The empirical distribution of B bootstrap replicates of g(e∗(M ; n)) can then be used
to estimate the true distribution F (M )n .
A key assumption in such asymptotic justification of the bootstrap or other re-
sampling estimates of F (M )n is the existence of a consistent estimate β (which may
be based on a sample size different from n and a method different from M ) of β
so that the unobservable random errors εt, and therefore also the distribution Ψ of
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 40
(x, ε), can be estimated consistently. This is not an issue for the classical regression
theory that assumes fixed p while n → ∞. However, it is hard to find a consistent
estimate of β in the high-dimensional setting p " n considered in this thesis. Since
our goal is to evaluate the prediction performance of method M for the given data
and a future replicate (x, y), we can’t make a priori assumption on the sparsity of β
and the distribution of the unobserved ε.
In this chapter, we use a Monte Carlo cross-validation (MCCV) approach to ad-
dress this difficulty by estimating F (M )nt instead of F (M )
n , where nt < n is the size of
the training samples in cross-validation. We begin with a review of cross-validation
and MCCV in the literature.
3.1 Overview of Cross-validation
Cross-validation(CV) is one of the most widely used methods to evaluate a model’s
prediction performance. As is well known, training an algorithm and evaluating its
statistical performance on the same data yields an over-optimistic result. CV was
originally developed to fix this issue, starting from the remark that testing the out-
put of the algorithm on new data would yield a better estimate of its performance
(Mosteller & Tukey (1968); Stone (1974)). Its basic idea is to split a sample of size n
into a training set of nt and a test (or validation) set of size nv = n− nt.
There are various data splitting strategies in the literature. Delete-one CV was
proposed by Stone (1974), Allen (1974), Geisser (1975), and is the most widely known
CV procedure. It corresponds to the choice nv = 1, and each data point is successively
left out from the n data points as validation set. Thus delete-one CV can be defined
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 41
as:
CVq(1) =1
n
n∑
i=1
(yi − β!(−i)xi)
2, (3.3)
where q is the number of covariates in the model and β(−i) is the model fitted from
the data set that leaves out (yi,xi). However, CVq(1) is inconsistent for choosing the
smallest correct model as it tends to overfit and may also have high variability; see
Efron (1986).
The delete-d CV has been proposed to rectify the inconsistency of CV (1). This is
also called multifold cross-validation or MCV . Specifically, MCV is defined for linear
regression models as:
MCVq =∑
S
‖yS −XSβ!(−S)‖2
2/[d
(n
d
)], (3.4)
where the sum is over all possible subsets S with size d from n observations, β(−S)
is the model-based estimate of β (under a regression model with q covariates) from
the observations not in S, XS = {xi, i ∈ S}, and yS = (yi, i ∈ S); see Shao (1993).
Assuming fixed p as n →∞, Shao (1993) showed that variable selection consistency
of MCVq requires that nv/n → 1 and n− nv →∞, as n →∞.
Instead of using all subsets of size d as possible test sets as in (3.4), more computa-
tionally convenient alternatives to MCV have been proposed. In particular, Breiman,
Friedman, Olshen & Stone (1984) considered k-fold cross-validation that splits the
data into k roughly equal-sized groups and uses one subgroup as a test set and the
remaining data as the training set. Here are more details. Let κ : {1, 2, · · · , n} 3→
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 42
{1, 2, · · · , k} be an indexing function that indicates the partition to which observa-
tion i is randomly allocated. Let f−k(x) the fitted model based on the data after
removing the kth subgroup. Then the k-fold CV estimate of the mean-squared pre-
diction error is
k-CVq =1
n
n∑
i=1
(yi − f−κ(i)(xi))2. (3.5)
Another alternative to MCV is the Repeated Learning Test (RLT ) introduced
by Breiman et al. (1984) and further studied by Burman (1989) and Zhang (1993).
It is also called Monte Carlo cross-validation(MCCV); see Picard and Cook (1984).
Instead of summing over all possible subsets of size d as in MCV , it takes B random
subset S∗i of size d, and estimates the true squared prediction error by
MCCVq =1
Bd
B∑
i=1
‖yS∗i− f−S∗
i (xS∗i)‖2. (3.6)
Zhang (1993) proved that under certain assumptions,
k-CVq = MCVq + op(1) for q < q0,
where q0 are the true number of nonzero coefficients in the linear model. In addition,
he showed under certain assumption that as B/n2 →∞, then
MCCVq = MCVq + op(n−1).
This basically means that we can reduce the computational complexity of the delete-d
CV method from exponential to polynomial (just over second order). Furthermore,
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 43
numerical results in his study have shown that MCCV (or RTL as he calls it) per-
forms better than k-CV. Shao(1993, Theorem 2) has established variable-selection
consistency of MCCV under several assumptions including
nv/n → 1, nt →∞, and n2/(Bn2t ) → 0.
3.2 MCCV Estimate of F (M )nt
Given the n observations (xi, yi), 1 ≤ i ≤ n, we sample without replacement nt
observations, where nt < n. Apply method M to this training sample, denoted by S, to
obtain the estimate βM ,nt and the prediction errors ei(M ; nt) = yi−β"M ,nt
xi for i /∈ S.
Repeat this procedure B times, yielding the set of prediction errors {g(ebi(M ; nt)) :
i /∈ Sb, b = 1, 2, . . . , B}. The empirical distribution F (M )nt of this set puts weight
{B(n − nt)}−1 to each prediction error g(ebi(M ; nt)), and is the MCCV estimate of
F (M )nt .
Since Sb is a random sample of size nt from the set of i.i.d observations (xi, yi),
i = 1, 2, . . . , n, it follows that {ebi(M ; nt) : i /∈ Sb, b = 1, 2, . . . , B} is a set of
identically distributed random variables having the same distribution as e(M ; nt),
which is defined in (3.2) with n replaced by nt. Hence, the empirical distribution
F (M )nt of the these B(n− nt) random variables is an unbiased estimate of F (M )
nt .
3.3 Choice of the Training Sample Size nt
The choice of the validation sample size nv = n − nt for cross-validation has been
considered by Shao (1993) and Yang (2008) in the context of consistent variable
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 44
selection in the regression model (2.1) when n → ∞ but with p fixed. They use the
sum of squared prediction errors in multi-fold or Monte Carlo cross-validation as the
model selection criterion. Shao (1993) showed that
nt →∞, andnv
n→ 1 (3.7)
is a sufficient condition to ensure consistency, i.e, that the selected model is the
minimal correct model (including all nonzero regression coefficients) with probability
approaching 1 as n →∞. Yang (2008, Proposition 1) later showed that this condition
(3.7) is also necessary. Since our goal is to estimate F (M )nt , rather than only its second
moment, in the case p " n (instead of p being fixed as n →∞), we consider whether
(3.7) ensures that the MCCV estimates Fnt is a consistent estimates Fnt . In the next
section, we show that under condition (3.7),
V ar(F (M )nt
(A)) → 0, as n →∞, (3.8)
uniformly for all Borel subsets A of real line.
Because p " n, the training sample size nt = o(n) implied by condition (3.7)
causes a large bias in approximating F (M )n by F (M )
nt = E(F (M )nt ). Although one would
like to use nt ∼ n instead to make the bias negligible, the random samples of size
nt drawn without replacement from {(xi, y∗i ), 1 ≤ i ≤ n} have so much overlap
that the effective number of Monte Carlo simulations would be small even though
a large number B of actual Monte Carlo runs are performed. This results in sub-
stantial variance in F (M )nt (A) unless (β − βM ,nt )
"x is small; details are given in next
Section 3.4.
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 45
Thus, suitable choice of nt amounts to a bias-variance tradeoff in using F (M )nt to
estimate F (M )n . A natural compromise between condition (3.7) that yields large bias
and small variance and the contrary case nt ∼ n is
nt ∼ (1− ε)n, with 0 < ε < 1/2. (3.9)
In particular, we recommend using ε = 0.1, which corresponding to 10-fold cross-
validation for n ≥ 200. Zhang (1993) has shown that as n →∞ but with p fixed, the
average squared prediction error criterion in multifold or Monte Carlo cross-validation
is inconsistent under (3.9), and his Corollary 1 gives the limiting distribution of the
selected model. In Section 3.4, we consider the case when p " n and analyze the
performance of F (M )nt under (3.9). In particular, we prove the following.
Theorem 3.1. Suppose p " n and condition(3.9) holds. If
(β − βM ,nt )"x −→P 0 as nt →∞, (3.10)
i.e., βM ,nt is consistent estimator for β, then (3.8) holds.
Proof. See section 3.4.
3.4 Asymptotic Theory of MCCV
Section 3.3 has used some asymptotic properties of the MCCV estimate F (M )nt of F (M )
nt
to arrive at an appropriate choice, from the bias versus variance viewpoint, of the
training sample size nt for cross-validation. We provide here the technical details for
the asymptotic theory. Whereas previous authors, e.g., Shao (1993), considered the
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 46
case of fixed p and sum of squared prediction errors for the least squares method M
and thereby can make use of explicit formulas for these squared prediction errors in
their analysis, we need to use more general arguments to handle p " n and unspecified
methods M .
As shown in Section 3.2, we have E(F (M )nt (A)) = F (M )
nt (A), for every Borel set A.
Therefore, F (M )nt (A)− F (M )
nt (A) is the average of Bnv zero-mean random variables of
the form
Ibi = I{g(eb
i (M ;nt))∈A} − P{g(e(M ; nt)) ∈ A)}, for i /∈ Sb and 1 ≤ b ≤ B. (3.11)
Although the indicator variables are identically distributed, they are correlated be-
cause Sb is a random set of size nt drawn without replacement from {(xi, yi), 1 ≤
i ≤ n} and β(b)M ,nt
is estimated from Sb while ebi(M ; nt) = yi − β(b)"
M ,ntxi for i /∈ Sb.
Suppose (3.7) holds. In this case, (3.7) is equivalent to nt →∞ and nt = o(n), Sb
and Sb′ are asymptotically independent for b *= b′as n → ∞. Hence eb
i(M ; nt) and
eb′
j (M ; nt) are asymptotically independent, and therefore we have E(Ibi I
b′
j ) = o(1),
uniformly for b *= b′and i *= j. The uniformity follows from the fact that these are
exchangeable random variables. Therefore,
∑
b,=b′
∑
i,=j
E(Ibi I
b′
j )/B2n2v → 0, as n →∞. (3.12)
Since∑B
b=1
∑i,=j E(Ib
i Ib′
j ) = O(Bn2v), and a similar bound holds for the sum over
i = j and b *= b′, it follows that (3.8) holds under (3.7).
We next consider the case nt ∼ (1 − ε)n with 0 ≤ ε ≤ 1/2. For b *= b′, there
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 47
is substantial overlap between Sb and Sb′ ; in fact, the size of Sb
⋂Sb′ is at least
(1− 2ε + o(1))n. For i /∈ Sb and j /∈ Sb′ ,
E(Ibi I
b′
j ) = P{g(ebi(M ; nt)) ∈ A, g(eb
′
j (M ; nt)) ∈ A}− P 2{g(e(M ; nt)) ∈ A},(3.13)
by (3.2), we have for i /∈ Sb and j /∈ Sb′ ,
ebi(M ; n) = (β − β(b)
M ,nt)"xi + εi, eb
j(M ; n) = (β − β(b′)
M ,nt)"xj + εj.
If (3.10) holds, then ebi(M ; nt) and eb
′
j (M ; nt) are still asymptotically independent for
b *= b′and i *= j. Hence, (3.12) and therefore (3.8) still hold, proving Theorem 3.1.
On the other hand, if (3.10) does not hold, then ebi(M ; nt) and eb
′
j (M ; nt) are
correlated because of the overlap between Sb and Sb′ . In this case, because of ex-
changeability, the left-hand side of (3.12) is bounded away from 0 and (3.8) no longer
holds.
For the case of fixed p as n →∞, (3.10) holds if the method M contains all regres-
sors with nonzero regression coefficients. Even though (3.8) holds in this case, the av-
erage squared prediction error criterion in multi-fold or Monte Carlo cross-validation
is still inconsistent, as shown by Zhang (1993), because M may not be associated
with the smallest correct model. The Monte Carlo cross-validation method in this
chapter aims at choosing the regularization method rather than the regularization
parameters as in Zhang (1993), or in the choice of λ in (2.2), (2.3), (2.15) or (2.30),
or in the choice of k in (2.47).
CHAPTER 3. MONTE CARLO CROSS-VALIDATION 48
3.5 Comparing the Prediction Performance of Two
Methods
Since F (M )nt uses a training sample of size nt satisfying condition (3.9), it tends to
be stochastically larger than F (M )n when g(e) is nonnegative and increasing in |e|. To
compare the predictive performance of two methods M1 and M2 based on a given
data set {(xi, yi), 1 ≤ i ≤ n}, we can mitigate this upward bias by not estimating
F (M1 )nt and F (M2 )
nt separately. Instead we use MCCV to estimated the distribution of
the difference g(e(M1; n))−g(e(M2; n)) in prediction errors between M1 and M2. This
point will be illustrated in Chapter 4.
Chapter 4
Simulation Studies
In this chapter, we compare different regularization methods in a variety of scenar-
ios, and illustrate how MCCV can be used to choose an appropriate regularization
when the data-generating mechanism for the scenario is unknown, which is typically
the case in practical applications. The scenario in Section 4.1 corresponds to strong
sparsity and that in Section 4.2 is weakly sparse. The simulation studies show that
the squared prediction errors of OGA+HDBIC and OGA+HDAIC are stochastically
smaller and have tighter distributions than those of other methods. Moreover, the
MCCV estimates of these distributions show the same patterns as the true squared
prediction error distributions. The scenario in Section 4.3 is not weakly sparse and
the simulation results show that the OGA+HDIC performs substantially worse than
other regularization methods. This further confirms that the weak sparsity condition
imposed for the theory of OGA+HDIC is critical, and that MCCV can help the user
determine which regularization method should be used for the problem at hand. The
simulation study in Section 4.3 also shows the advantage of the Max1,2 regularization
over other regularization methods in this scenario that is not weakly sparse.
49
CHAPTER 4. SIMULATION STUDIES 50
4.1 Strongly Sparse Scenario
Consider the regression model
yt =q∑
j=1
βjxtj +p∑
j=q+1
βjxtj + εt, t = 1, · · · , n, (4.1)
where βq+1 = · · · = βp = 0, εt ∼i.i.d N(0, σ2) and are independent of the i.i.d vectors
xt = (xt1, xt2, · · · , xtp).
Example 4.1. Assuming the following in (4.1):
• n = 500, p = 4000, q = 9, and q is the number of nonzero variables.
• (β1, · · · , βq)=(3.2, 3.2, 3.2, 3.2, 4.4, 4.4, 3.5,3.5, 3.5), and βj = 0, q < j ≤ p.
• σ2 = 2.25.
• x = (x1, x2, ...xp) ∼ N(µ,Σ), with µ=(1, 1, · · · , 1)1×p and
Σ =
1 + η2 η ... η
η 1 + η2 ... η
... ... ... ...
η η ... 1 + η2
,
where η = 1.
In view of Theorem 2.4 which requires the number of Kn iterations to satisfy Kn =
O((n/ log(n))1/2), we choose Kn = [5(n/ log(n))1/2] for OGA. Here and in the sequel,
we choose the constant C = 2.01 for HDAIC.
CHAPTER 4. SIMULATION STUDIES 51
4.1.1 Sure Screening and Comparison with Other Methods
Under strong sparsity, theoretical results in section 2.3.3 show that OGA+HDIC has
“sure screening property” to select all the relevant variables under strong sparsity case.
The concept of sure screening was introduced by Fan & Lv (2008), who also proposed
a method called “sure independence screening” (SIS) that is based on correlation
learning and has the sure screening property in sparse high-dimensional regression
models satisfying certain conditions. SIS selects d regressors whose sample correlation
coefficients with yt have the largest d absolute values. Although SIS with suitably
chosen d = dn has been shown by Fan and Lv (2008, Section 5) to have the sure
screening property without the irrepresentable (or neighborhood stability) condition
mentioned in section 2.1.1 in connection with Lasso, it requires an assumption on the
maximum eigenvalue of the covariance matrix of the candidate regressors that can
fail to hold when all regressors are equally correlated, as is the case for this example.
Fan & Lv (2010) also proposed a modification with “iteratively” SIS+SCAD, called
ISIS, for variable selection in sparse linear and generalized linear models. We will
compare the “sure screening property” of OGA+HDIC with that of SIS, SIS+SCAD,
ISIS+SCAD, and Lasso for the present example.
For variable selection consistency of Lasso, it is required to satisfy neighborhood
stability condition, i.e., for some 0 < δ < 1 and all i = q + 1, · · · , p,
|c′qiR−1(q)sign(β(q))| < δ, (4.2)
where xt(q) = (xt1, · · · , xtq)!, cqi = E(xt(q)xti), R(q) = E(xt(q)x!t (q)) and sign(β(q)) =
(sign(β1), · · · , sign(βq))!. Straightforward calculations give cqi = η21q, R−1(q) = I−
CHAPTER 4. SIMULATION STUDIES 52
{η2/(1 + η2q)}1q1!q , and sign(β(q)) = 1q, where 1q is the q-dimensional vector of 1’s.
Therefore, for all i = q +1, · · · , p, |c′qiR−1(q)sign(β(q))| = η2q/(1+ η2q) < 1, so (4.2)
indeed holds in this example. Under (4.2) and some other conditions, Meinshausen
and Buhlmann (2006, Theorems 1 and 2) have shown that if r = rn in the Lasso esti-
mate (3.21) converges to 0 at a rate slower than n−1/2, then limn→∞ P (Ln = Nn) = 1,
where Ln is the set of regressors whose associated regression coefficients estimated by
Lasso(rn) are nonzero. On the other hand, Fan & Lv (2008) imposed the condition
λmax(Γ({1, · · · , p})) ≤ cnr, for some c > 0 and 0 ≤ r < 1, (4.3)
for SIS to have the sure screening property. Here, we have
max1≤)(J)≤ν
λmax(Γ(J)) = (1 + νη2)/(1 + η2),
which violates (4.3) for J = {1, · · · , p} and p " n.
We run 1000 simulations to evaluate the performance of each method. Following
Wang (2009), we define the percentage of correct zeros to characterize the method’s
capability in producing sparse models:
% of correct zeros =1
1000
1000∑
b=1
1
q{
p∑
j=1
1βj,b=0 · 1βj=0} ,
, where β(b) = (β1,b, · · · , βp,b) is the model fitted coefficients in the bth replicate. We
also define the percentage of incorrect zeros to characterize the degree of under-fitting
CHAPTER 4. SIMULATION STUDIES 53
of the model as:
% of incorrect zeros =1
1000
1000∑
b=1
1
p{
p∑
j=1
1βj,b=0 · 1βj ,=0}.
We define the “coverage probability” to gauge the sure screening performance of the
method, and the “correctly fitted probability” to show the chance that the fitted
model exactly contain all relevant variables:
Coverage probability =1
1000
1000∑
b=1
1T⊆S(b),
Correctly fitted =1
1000
1000∑
b=1
1T=S(b),
where T = (1, 2, · · · , q) and S(b) = {j(b) : βj,b *= 0}.
Method η cover. Prob. correct zeros incorrect zeros correctly fitted # of varsSIS 1 0.32 98.5% 12.67% 0.00 66.00SIS+SCAD 1 0.34 99.51% 12.00% 0.00 27.43ISIS+SCAD 1 1 98.57% 0.00% 0.00 65.81OGA+HDBIC 1 1 99.99% 0.00% 0.97 9.30Lasso 1 1 98.84% 0.00% 0.00 55.10
Table 4.1: Sure screening results for Example 4.1
Table 4.1 shows the sure screening results for SIS, SIS+SCAD, ISIS+SCAD, Lasso
and OGA+HDIC for example 4.1. As expected, SIS, which uses a single screening
step based on correlations, can’t identify all the relevant variables since the irrele-
vant variables are correlated with relevant variables in this case. On the other hand,
OGA+HDBIC, ISIS+SCAD, and Lasso can select all the relevant variables, and
OGA+HDBIC has the highest “correctly fitted” probability that is close to 1.
CHAPTER 4. SIMULATION STUDIES 54
To evaluate the prediction performance, define
MSPE =1
1000
1000∑
l=1
(p∑
j=1
βjx(l)n+1,j − y(l)
n+1)2, (4.4)
in which x(l)n+1,1, · · · , x(l)
n+1,p are the regressors associated with y(l)n+1, the new outcome
in the lth simulation run, and y(l)n+1 denotes the predictor of y(l)
n+1. Table 4.2, which
gives the MSPE for OGA+HDIC, ISIS+SCAD, and Lasso with η = 1 and 3, shows
that OGA+HDIC performs much better than Lasso and ISIS+SCAD. Moreover, the
MSPE of OGA+HDIC is quite close to the oracle value qσ2/n, since OGA+HDIC
almost successfully selects out the q = 9 relevant variables.
Method η E E+1 E+2 E+3 Correct MSPEOGA+HDBIC 1 982 16 1 1 1000 0.052ISIS+SCAD 1 0 0 0 0 1000 0.491Lasso 1 0 0 0 0 1000 0.392OGA+HDBIC 3 621 313 57 9 1000 0.061ISIS+SCAD 3 10 25 50 15 1000 0.780Lasso 3 0 0 0 0 1000 0.530
Table 4.2: MSPE results for Example 4.1, and frequency, in 1000 simulations, ofincluding all 9 relevant variables (Correct), of selecting exactly the relevant variables(E), of selecting all relevant variables and i irrelevant variables (E+i).
CHAPTER 4. SIMULATION STUDIES 55
4.1.2 MCCV Analysis
In this section, we use MCCV proposed in Chapter 3 to choose the best regu-
larization method among five regularization methods, namely OGA+HDAIC and
OGA+HDBIC (L0-regularization), Lasso (L1-regularization), ridge regression (L2-
regularization), and Elastic Net. As we mentioned in Chapter 3, we aim at estimating
the squared prediction error distribution F (M)nt with nt ∼ (1− ε)n rather than F (M)
n .
One of our underlying assumption is the F (M)nt should be quite close to that of F (M)
n
if nt is chosen appropriately. This assumption is reasonable for this example with
ε = 0.1, i.e, nt = 450, and Table 4.3 gives the 5-number summaries, together with
the mean, for the squared prediction error distributions of different methods with
n = 450 and 500.
n OGA + HDBIC OGA + HDAIC Lasso ENet Ridge500 Min 2.20e− 11 2.20e− 11 1.17e− 06 7.36e− 09 1.09e− 03
1st Qu. 0.00468 0.00565 0.0384 0.0506 10.59Median 0.02056 0.02553 0.1769 0.2289 46.943rd Qu. 0.06171 0.07738 0.5109 0.6603 136.65Max. 1.049 3.0335 6.8630 8.9433 1124.20Mean 0.05165 0.06962 0.3920 0.5039 101.55
450 Min 9.43e− 11 9.43e− 11 5.53e− 13 7.36e− 09 2.05e− 031st Qu. 0.00479 0.00572 0.0399 0.0518 10.79Median 0.02106 0.02582 0.1816 0.2346 48.423rd Qu. 0.06441 0.07993 0.5259 0.6792 140.60Max. 1.198 3.454 7.711 10.09 993.60Mean 0.05214 0.07113 0.4056 0.5204 102.80
Table 4.3: 5-number summary and mean of FMn with n = 500 and 450 in Example 4.1
Figure 4.1, left panel, shows the simulated squared prediction error distributions
for different procedures (except ridge regression which has much larger values and
CHAPTER 4. SIMULATION STUDIES 56
inflates the vertical scale). These distributions, denoted by FM450 for each method M ,
are computed by simulating 1000 replicates of a training sample of size 450 (instead of
n = 500) together with an independent observation in the lth simulation run as test
sample, as in Table 4.3. The boxplots show that OGA+HDBIC works best among
the four regularization methods.
oga+hdbic oga+hdaic lasso ENet
0.0
0.5
1.0
1.5
2.0
2.5
Squared pred error
oga+hdbic oga+hdaic lasso ENet
0.0
0.5
1.0
1.5
2.0
2.5
MCCV estimates
Figure 4.1: MCCV performance for example 4.1, and the solid line in the center ofeach box shows its corresponding median.
In comparison, we use a particular data set from these 1000 simulated samples of
size n = 500 to estimate F450 by MCCV with nc = 450 and B = 100, and Figure 4.1,
right panel, gives the MCCV estimate of FM450 for each of the four methods. The box-
plot shows that MCCV estimate is close to FM450 for each method M , and shows that
OGA+HDBIC again works best among the four regularization methods. Table 4.4
CHAPTER 4. SIMULATION STUDIES 57
gives the 5-number summaries of MCCV estimates for each of four regularization
methods. Comparison with table 4.3 shows that the MCCV estimates approximate
quite well the true squared prediction error distributions.
Method M OGA + HDBIC OGA + HDAIC Lasso ENet
Min 1.76E − 11 1.76E − 11 4.64E − 08 2.33E − 08
1st Qu. 0.00544 0.00594 0.0398 0.0512
Median 0.02436 0.02648 0.1650 0.2195
3rd Qu. 0.06499 0.07705 0.5145 0.6591
Max. 1.1584 3.316 7.198 9.823
Mean 0.05217 0.06978 0.3979 0.5174
Table 4.4: 5-number summary together with mean for MCCV estimates of FM450 in
Example 4.1
CHAPTER 4. SIMULATION STUDIES 58
4.2 Weakly Sparse Scenario
Under weak sparsity, supn≥1
∑pn
j=1 |βj| < ∞ and all βj may be nonzero. This includes
the algebraic decay and exponential decay cases mentioned at the end of Section 2.3.3.
Consider the regression models,
yt =p∑
j=1
βjxtj + εt, t = 1, · · · , n, (4.5)
where p > n, εt are i.i.d. N(0, σ2) and are independent of (xt1, xt2, · · · , xtp), which are
i.i.d N(0,Σ), with
Σ =
1 + η2 η ... η
η 1 + η2 ... η
... ... ... ...
η η ... 1 + η2
.
Example 4.2. (Exponential decay). Consider (4.5) with the following parameter spec-
ifications:
• n = 500, p = 1000, 2000
• βj = κ exp(−a · j), j = 1, 2, 3 · · · p, where a = 0.25, κ = 5, 10, 15
• σ = 1, η = 0, 1
In view of Theorem 2.4 which requires the number of Kn iterations to satisfy Kn =
O((n/ log(n))1/2), we choose Kn = [5(n/ log(n))1/2] in OGA, and choose the constant
C = 2.01 for HDAIC.
CHAPTER 4. SIMULATION STUDIES 59
OGA+HDAIC Performance
Table 4.5 gives the mean squared prediction error(MSPE) of OGA+HDAIC and other
regularization methods, based on 1000 simulations. The simulated MSPE is defined
as in (4.4).
As expected from the theory, OGA+HDAIC is significantly better than other
methods. As κ increases, the MSPE of OGA+HDAIC remains nearly constant, while
the MSPEs of other methods increase substantially, especially for Elastic Net and
ridge regression. Note that ridge regression introduces large bias for large coefficients.
η n p κ OGA + HDAIC OGA + HDBIC Lasso ENet Ridge0 500 1000 5 0.0349 0.0756 0.0839 0.1198 2.7851
10 0.0432 0.0509 0.0967 0.1422 10.227015 0.0368 0.0888 0.1032 0.1511 22.4633
2000 5 0.0420 0.0821 0.1013 0.1477 3.328510 0.0508 0.0496 0.1164 0.1759 12.658315 0.0433 0.0923 0.1236 0.1886 28.1109
1 500 1000 5 0.04026 0.07518 0.07351 0.10684 2.7747610 0.04951 0.07093 0.08352 0.12219 10.1597515 0.04326 0.07548 0.08892 0.12987 22.31791
2000 5 0.05058 0.09533 0.10098 0.14223 3.2229110 0.06034 0.07193 0.10859 0.15902 12.7339015 0.05098 0.09788 0.11503 0.16934 28.09917
Table 4.5: MSPEs of different methods in Example 4.2
MCCV Analysis
Here, we are focus on the case n = 500, p = 2000, η = 1 and κ = 10, and use MCCV
to choose the best among the five regularization methods mentioned in Section 4.1
and PGA, which we also include since Buhlmann (2006) has shown that it has good
performance in weakly sparse models.
CHAPTER 4. SIMULATION STUDIES 60
1. MCCV estimates of squared prediction error distributions:
As we mentioned in Chapter 3, we aim at estimating the squared prediction
error distribution F (M)nt with nt ∼ (1− ε)n rather than F (M)
n . Table 4.6 gives the
5-number summaries, together with the mean, for the squared prediction error
distributions of different methods with n = 450 and 500, which shows that F (M)450
is quite close to F (M)500 for the scenario we consider.
n OGA + HDAIC OGA + HDBIC Lasso ENet PGA Ridge500 Min 1.38E − 10 7.19E − 10 6.31E − 08 2.43E − 08 1.75E − 09 1.15E − 06
1st Qu. 0.00571 0.00673 0.01093 0.01666 0.01354 1.26794Median 0.02620 0.03120 0.04853 0.07117 0.06177 5.708443rd Qu. 0.07740 0.09124 0.14302 0.20821 0.18162 16.86109Max. 1.26823 1.59471 1.76220 2.82896 3.13588 198.43241Mean 0.06034 0.07193 0.10859 0.15902 0.13939 12.73390
450 Min 2.58E − 10 9.16E − 10 8.45E − 08 4.41E − 08 6.77E − 09 1.02E − 051st Qu. 0.00661 0.00693 0.01181 0.01546 0.01452 1.36701Median 0.02712 0.03208 0.05023 0.07403 0.06332 6.108223rd Qu. 0.08134 0.09322 0.15267 0.22149 0.19501 17.06214Max. 1.4632 1.7914 2.0322 2.9689 3.2982 200.31251Mean 0.06126 0.07393 0.11029 0.16515 0.14291 12.9392
Table 4.6: 5-number summary and mean of FMn with n = 500 and 450 in Example 4.2
Figure 4.2, left panel, shows the simulated squared prediction error distributions
for different regularization methods. These distributions, denoted by FM450 for
each method M , are computed from 1000 replicates of a training sample of size
450 (instead of n = 500) together with an independent observation in the lth
simulation run as test sample. The boxplots show that OGA+HDAIC works
best among the five regularization procedures shown. In comparison, we use
one data set from these 1000 simulated samples of size n = 500 to estimate
F (M)450 by MCCV with nt = 450 and B = 100, and Figure 4.2, right panel, gives
the MCCV estimate of FM450 for each of the five methods. The boxplot shows
that OGA+HDAIC again works best among the five regularization methods.
CHAPTER 4. SIMULATION STUDIES 61
Table 4.7 gives 5-number summaries of the MCCV estimates for each of the
different regularization methods. Comparison with Table 4.6 shows that the
MCCV estimate is close to FM450 for each method M .
OGA+HDBIC OGA+HDAIC Lasso ENet PGA
0.0
0.1
0.2
0.3
0.4
0.5
Squared pred error distributions
OGA+HDBIC OGA+HDAIC Lasso ENet PGA
0.0
0.1
0.2
0.3
0.4
0.5
MCCV estimates
Figure 4.2: Boxplots of MCCV performance for Example 4.2
Method OGA + HDAIC OGA + HDBIC Lasso ENet PGA RidgeMin 1.22E − 08 6.24E − 09 9.31E − 08 4.58E − 08 4.74E − 09 1.23E − 051st Qu. 0.00672 0.00721 0.01307 0.01777 0.01794 1.16769Median 0.02967 0.03307 0.05810 0.08330 0.07344 6.414843rd Qu. 0.08506 0.09978 0.16230 0.24772 0.20562 17.38538Max. 1.72319 1.87355 2.09284 2.87262 2.97496 196.17823Mean 0.06303 0.07181 0.11512 0.16339 0.14260 12.98136
Table 4.7: 5-number summary for MCCV estimates of FM450 in Example 4.2
CHAPTER 4. SIMULATION STUDIES 62
2. Prediction error differences between two procedures:
True distribution MCCV estimate
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Figure 4.3: Distribution of SPEOGA+HDAIC − SPELasso for Example 4.2
We can use the method in Section 3.5 to compare OGA+HDAIC with Lasso in
this example. In Figure 4.3, the left boxplot is the true distribution of the
squared prediction error difference (SPEOGA+HDAIC − SPELasso) evaluated
from 1000 simulations, and the right boxplot is the MCCV estimate of the
distribution based on one simulated data set. Table 4.8 gives the 5-number sum-
maries and the mean of the true distribution and its MCCV estimate, shown
that MCCV estimates the true distribution well.
3. Comparison of MCCV estimates based on two different data sets
generated from the regression model:
A natural question concerning the MCCV estimate is whether different data
CHAPTER 4. SIMULATION STUDIES 63
True distribution MCCV estimationMin −2.5967 −2.13821st Qu. −0.1041 −0.1081Median −0.0923 −0.09583rd Qu. 0.0072 0.0058Max. 1.2069 0.7398Mean −0.0723 −0.0795
Table 4.8: 5-number summary for the distribution of squared prediction error differencesbetween OGA+HDAIC and Lasso, and for its MCCV estimate
sets generated from the same regression model yield different conclusions on
the comparisons of squared prediction error distributions based on their MCCV
estimates from these different data sets. Figure 4.4 considers two such data sets
and shows the stability of the comparisons of five regularization methods based
on the two data sets.
OGA+HDBIC OGA+HDAIC Lasso ENet PGA
0.0
0.2
0.4
Squared prediction error distribution estimated by MCCV, dataset 1
OGA+HDBIC OGA+HDAIC Lasso ENet PGA
0.0
0.2
0.4
Squared prediction error distribution estimated by MCCV, dataset 2
Figure 4.4: MCCV estimates based on two simulated data sets in Example 4.2
CHAPTER 4. SIMULATION STUDIES 64
Example 4.3. (Algebraic decay). Consider (4.5) with the following parameter speci-
fications:
• n = 500, p = 1000, 2000
• βj = κ/(ja), j = 1, 2, 3 · · · p, where a = 1.5, κ = 5, 10, 15
• σ = 0.1, η = 0, 1
Choose Kn = [5(n/ log(n))1/2] for OGA, and C = 2.01 for HDAIC as in Example 4.2.
OGA+HDAIC Performance
Table 4.9 gives the mean squared prediction error(MSPE) of OGA+HDAIC and other
regularization methods evaluated from 1000 simulations. The simulated MSPE is
defined as in (4.4).
As expected from the theory, OGA+HDAIC is significantly better than other
methods. In particular, the MSPE of Lasso, which outperforms Elastic Net and ridge
regression in this case, is about twice that of OGA+HDAIC for κ = 5, 10, or 15.
CHAPTER 4. SIMULATION STUDIES 65
η n p κ OGA + HDAIC OGA + HDBIC Lasso ENet Ridge
0 500 1000 5 0.0522 0.1221 0.0674 0.0754 18.6972
10 0.0987 0.2800 0.1257 0.1351 74.7936
15 0.1425 0.4869 0.2729 0.2872 168.3534
2000 5 0.0554 0.1333 0.0783 0.0909 24.0502
10 0.0956 0.3107 0.1424 0.1619 96.0622
15 0.1513 0.5467 0.2768 0.3085 216.0163
450 2000 10 0.0972 0.3211 0.1503 0.1701 97.1015
1 500 1000 5 0.1713 0.4755 0.2722 0.3307 18.5715
10 0.4879 1.5057 1.2459 1.5048 74.3403
15 1.0294 2.9938 2.9608 3.4607 167.3379
2000 5 0.2069 0.6647 0.3529 0.4737 24.1243
10 0.6027 2.2435 1.5675 1.9517 96.3834
15 1.3987 4.6301 3.6692 4.3859 216.8061
2 500 1000 5 0.1720 0.4844 0.7696 1.0805 18.7922
10 0.4944 1.4459 3.0878 4.9387 75.1914
15 1.0676 3.2402 6.9301 11.2026 169.2467
2000 5 0.2157 0.6985 0.8946 1.3856 24.3712
10 0.6489 2.2645 3.5522 6.0250 97.3697
15 1.4135 4.5187 7.9515 13.5588 219.0319
Table 4.9: MSPE of different methods in Example 4.3
CHAPTER 4. SIMULATION STUDIES 66
MCCV Analysis
Here, we are focus the case n = 500, p = 2000, η = 0 and κ = 10, and use MCCV to
choose the best regularization method. First the MSPE results in Table 4.9 suggest
that F (M)450 approximate F (M)
500 well for this case, so that the choice of nt = 450 for
MCCV is reasonable. Next we consider one particular simulated data set and use it
to estimate F (M)450 by MCCV with nt = 450 and B = 100. Figure 4.5, right panel, gives
the MCCV estimates for five regularization methods. In comparison, Figure 4.5, left
panel, shows the true squared prediction error distributions F (M)450 for different reg-
ularization methods computed from 1000 replicates of a training sample of size 450
(instead of n = 500) together with an independent observation in the lth simulation
run as test sample. The boxplots show that OGA+HDAIC performs best among the
five regularization methods.
Figure 4.5: Boxplot of MCCV performance for Example 4.3
CHAPTER 4. SIMULATION STUDIES 67
4.3 Scenario without Weak Sparsity
Example 4.4. Consider the same regression model (4.5) as in section 4.2, in which
βj =
4 for 1 ≤ j ≤ 25
4− (4− β200)× j−25175 for 25 < j < 200
1j0.6 for 200 ≤ j ≤ p
• n = 400, p = 2000,
• (xt1, xt2, ...xtp) ∼iid N(0, Ip×p),
• εt ∼iid N(0, 1), and independent of the (xt1, xt2, ...xtp).
This example does not satisfy the weak sparsity condition, since
p∑
j=200
|βj| =p∑
j=200
1
j0.6→∞, as p →∞.
On the other hand, βj is square-summerable, i.e.,
supp
p∑
j=1
β2j < ∞
4.3.1 MSPE performance
Table 4.10 gives 5-number summaries for the squared prediction error distributions of
OGA+HDAIC, OGA+HDBIC, Lasso, Elastic Net, ridge regression and Max1,2 based
on 1000 simulations. The mean in the table is corresponding to the simulated MSPE
that is defined in (4.4).
As expected from the theory, OGA+HDAIC performs substantially worse than
CHAPTER 4. SIMULATION STUDIES 68
other regularization methods, and this confirms that the weak sparsity condition is
critical for OGA+HDIC to perform well. On the other hand, Max1,2 regularization
performs best.
n Stats OGA + HDBIC OGA + HDAIC Lasso Ridge ENet Max1,2
400 Min 0.0006954 0.0006954 0.03269 0.2234 0.08898 0.015911st Qu. 148.7 167.0 113.1 120.1 99.6 93.6Median 545.6 615.7 572.6 421.3 515 415.63rd Qu. 2065 1930 1734 1349 1645 1261Max. 10590 13390 9816 8162 8999 7125Mean 1391 1507 1260 1007 1198 871
360 Min 7.60− e04 7.60− e04 0.04511 0.3124 0.09783 0.020611st Qu. 156.2 177.5 119.7 125.5 106.3 99.8Median 561.1 634.8 588.3 434.2 524.1 424.33rd Qu. 2098 1973 1745 1383 1671 1288Max. 10796 13442 9881 8317 9045 7229Mean 1422 1541 1287 1029 1213 894
Table 4.10: 5-number summary and mean for F (M)n in Example 4.4
4.3.2 MCCV Analysis
Table 4.10 shows that the F (M)360 is quite close to F (M)
400 , so nt = 360 is a reasonable
choice of MCCV for this example. Figure 4.6, left panel, shows the simulated squared
prediction error distributions for the six regularization methods. These distributions,
denoted by FM360 for each method M , are computed from 1000 replicates of a training
sample of size 360 (instead of n = 400) together with an independent observation
in the lth simulation run as test sample, as in Table 4.10. The boxplots show that
Max1,2 works best among the six different regularization methods shown.
In comparison, we use one simulated data set from these 1000 simulated samples
of size n = 400 to estimate F (M)360 with nt = 360 and B = 100. Figure 4.6, right
CHAPTER 4. SIMULATION STUDIES 69
Figure 4.6: Boxplot of MCCV performance for Example 4.4
panel, gives the MCCV estimate of FM360 for each of the six methods. It shows that
MCCV estimate is close to FM360 for each method M , and that Max1,2 again works
best among these regularization methods. Table 4.11 gives the 5-number summaries
of MCCV estimates for different regularization methods. Comparison with Table 4.10
shows that the MCCV estimates approximate quite well the true squared prediction
error distributions.
Stats OGA + HDBIC OGA + HDAIC Lasso Ridge ENet Max1,2
Min 9.30− e03 9.30− e03 0.03516 0.4102 0.10233 0.024211st Qu. 159.5 178.3 121.4 121.3 108.9 97.5Median 553.3 647.1 593.1 430.1 530.2 421.63rd Qu. 2079 1988 1773 1368 1682 1274Max. 10713 13397 9904 8304 9073 7215Mean 1411 1562 1293 1019 1225 882
Table 4.11: 5-number summary and mean for MCCV estimates of F (M)360 in Example 4.4
Chapter 5
Conclusion
In this thesis, we have developed a novel Monte Carlo cross-validation method to
choose an appropriate regularization method for a given regression data set that has
more regressors than the sample cases, i.e., p " n. This method aims at estimating
the squared prediction error distribution of the regularized regression method for a
smaller sample size nt than n. This differs from conventional cross-validation methods
for choosing the tuning parameters of a particular regularization method. There are
also differences in the corresponding asymptotic theorems, as shown in Section 3.1
and Section 3.4.
The simulation studies in Chapter 4 have shown that, with nt suitably chosen,
the Monte Carlo cross-validation estimate provides a good approximation to the ac-
tual squared prediction error distribution, and is therefore a reliable tool to determine
which regularization method should be used for the high-dimensional regression prob-
lem at hand.
Another contribution of this thesis is a new regularization method Max1,2 intro-
duced in Section 2.2. We have derived it as a natural alternative to the Elastic Net
70
CHAPTER 5. CONCLUSION 71
that combines L1-regularization and L2-regularization, and have provided an efficient
exact solution and an approximate pathwise solution. The example in Section 4.3
has also shown that it outperforms the Elastic Net and other regularization methods.
Although we have some heuristic explanation of why Max1,2 works in that scenario,
a comprehensive theory of Max1,2 is lacking and will be a project of future research.
Bibliography
Akaike, H. (1973), Information theory and an extension of the maximum likelihood
principle, in ‘Proc. of the 2nd Int. Symp. on Information Theory’, Akademiai
Kiado, pp. 267–281.
Allen, D. M. (1974), ‘The relationship between variable selection and data augmen-
tation and a method for prediction’, Technometrics 16(1), 125–127.
Antoniadis, A. & Fan, J. (2001), ‘Regularization of wavelet approximations’, Journal
of the American Statistical Association 96(455), 939–967.
Barron, A., Cohen, A., Dahmen, W. & Devore, R. (2008), ‘Approximation and learn-
ing by greedy algorithms’, Ann. Statist pp. 64–94.
Bickel, P. J., Ritov, Y. & Tsybakov, A. (2009), ‘Simultaneous analysis of lasso and
dantzig selector’, Annals of Statistics 37, 1705–1732.
Boyd, S., Parikh, N., Chu, E. & Peleato, B. (2010), ‘Distributed optimization and sta-
tistical learning via the alternating direction method of multipliers’, Information
Systems Journal pp. 1–118.
Boyd, S. & Vandenberghe, L. (2004), Convex Optimization, Cambridge University
Press.
72
BIBLIOGRAPHY 73
Breiman, L. (1998), ‘Arcing classifiers’, Annals of Statistics 26(3), 801–849.
Breiman, L. (1999), ‘Prediction games and arcing algorithms’, Neural Comput.
11, 1493–1517.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classification and
Regression Trees.
Buhlmann, P. (2006), ‘Boosting for high-dimensional linear models’, Ann. Statist
34(2), 559–583.
Buhlmann, P. & Hothorn, T. (2010), ‘Twin boosting: improved feature selection and
prediction’, Stat. Comput. 20, 119–138.
Buhlmann, P. & Yu, B. (2003), ‘Boosting with the l2 loss: Regression and classifica-
tion’, Journal of the American Statistical Association 98, 324–339.
Burman, P. (1989), ‘A comparative study of ordinary cross-validation, v -fold cross-
validation and the repeated learning-testing methods’, Biometrika 76, 503–514.
Candes, E. & Tao, T. (2007), ‘The dantzig selector : Statistical estimation when p is
much larger than n’, Annals of Statistics 35, 2313–2351.
Chen, J. & Chen, Z. (2008), ‘Extended bayesian information criteria for model selec-
tion with large model spaces’, Biometrika 95(3), 759–771.
Donoho, D. L. & Elad, M. (2003), ‘Optimally sparse representation in general
(nonorthogonal) dictionaries via minimization’, Proc of the Nat Acad of Sciences
of the USA 100(5), 2197–2202.
BIBLIOGRAPHY 74
Donoho, D. L., Elad, M. & Temlyakov, V. N. (2006), ‘Stable recovery of sparse over-
complete representations in the presence of noise’, IEEE Transactions on Infor-
mation Theory 52(1), 6–18.
Donoho, D. L. & Johnstone, I. M. (1994), ‘Ideal spatial adaptation by wavelet shrink-
age’, Biometrika 81(3), 425–455.
Efron, B. (1986), ‘How biased is the apparent error rate of a prediction rule?’, Journal
of the American Statistical Association 81(394), 461–470.
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression
(with discussion)’, Annals of Statistics 32, 407–451.
Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its
oracle properties’, Journal of the American Statistical Association 96(456), 1348–
1360.
Fan, J. & Lv, J. (2008), ‘Sure independence screening for ultrahigh dimensional feature
space’, Journal Of The Royal Statistical Society Series B 70(5), 849–911.
Fan, J. & Lv, J. (2010), ‘A selective overview of variable selection in high dimensional
feature space’, Statistica Sinica 20(1), 1–44.
Frank, I. E. & Friedman, J. H. (1993), ‘A statistical view of some chemometrics
regression tools’, Technometrics 35(2), 109–135.
Freund, Y. (1995), ‘Boosting a weak learning algorithm by majority’, Inf. Comput.
121, 256–285.
BIBLIOGRAPHY 75
Freund, Y. & Schapire, R. (1996a), Experiments with a new boosting algorithm, in
‘ICML’, pp. 148–156.
Freund, Y. & Schapire, R. E. (1996b), ‘Game theory, on-line prediction and boosting’,
Proceedings of the ninth annual conference on Computational learning theory
COLT 96 pages, 325–332.
Friedman, J. H. (1999), ‘Stochastic gradient boosting’, Computational Statistics and
Data Analysis 38, 367–378.
Friedman, J. H. (2001), ‘Greedy function approximation: A gradient boosting ma-
chine’, Annals of Statistics 29, 1189–1232.
Friedman, J. H. (2008), ‘Fast sparse regression and classification’, Technical report,
Department of Statistics, Stanford University .
Friedman, J., Hastie, T., Hofling, H. & Tibshirani, R. (2007), ‘Pathwise coordinate
optimization’, Annals of Applied Statistics 1(2), 302–332.
Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized
linear models via coordinate descent’, Journal of Statistical Software 33(1), 1–22.
Geisser, S. (1975), ‘The predictive sample reuse method with applications’, Journal
of the American Statistical Association 70(350), 320–328.
Hannan, E. J. & Quinn, B. G. (1979), ‘The determination of the order of an au-
toregression’, Journal of the Royal Statistical Society, Series B (Methodological)
41(2), 190–195.
BIBLIOGRAPHY 76
Hastie, T., Tibshirani, R. & Friedman, J. H. (2001), The elements of statistical learn-
ing: data mining, inference, and prediction: with 200 full-color illustrations, New
York: Springer-Verlag.
Hoerl, A. E. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for
nonorthogonal problems’, Technometrics 12(1), 55–67.
Huang, J., Horowitz, J. L. & Ma, S. (2008), ‘Asymptotic properties of bridge es-
timators in sparse high-dimensional regression models’, Annals of Statistics
36(2), 587–613.
Huang, J., Ma, S. & Zhang, C. H. (2008), ‘Adaptive lasso for sparse high-dimensional
regression models’, Statistica Sinica 18(374), 1603–1618.
Ing, C. K. (2007), ‘Accumulated prediction errors, information criteria and optimal
forecasting for autoregressive time series’, Annals of Statistics 35, 1238–1277.
Ing, C.-K. & Lai, T. L. (2011), ‘A stepwise regression method and consistent model se-
lection for high-dimensional sparse linear models’, Statistica Sinica 21(4), 1473–
1513.
Ing, C. K. & Wei, C. Z. (2005), ‘Order selection for the same-realization prediction
in autoregressive processes’, Annals of Statistics 33, 2423–2474.
Knight, K. & Fu, W. (2000), ‘Asymptotics for lasso-type estimators’, Annals of Statis-
tics 28(5), 1356–1378.
Leng, C., Lin, Y. & Wahba, G. (2006), ‘Selection a note on the lasso and related
procedures in model selection’, Statistica Sinica 16(4), 1273.
BIBLIOGRAPHY 77
Liu, Y. & Wu, Y. (2007), ‘Variable selection via a combination of the l0 and l1 penal-
ties’, Journal Of Computational And Graphical Statistics 16(4), 782–798.
Meinshausen, N. (2007), ‘Relaxed lasso’, Computational Statistics & Data Analysis
52(1), 374–393.
Meinshausen, N., Buhlmann, P. & Zurich, E. (2006), ‘High dimensional graphs and
variable selection with the lasso’, Annals of Statistics 34, 1436–1462.
Meinshausen, N. & Bhlmann, P. (2010), ‘Stability selection (with discussion)’, Journal
of the Royal Statistical Society Series B pp. 417–473.
Mosteller, F. & Tukey, J. W. (1968), Data analysis, including statistics, in G. Lindzey
& E. Aronson, eds, ‘Handbook of Social Psychology, Vol. 2’, Addison-Wesley.
Owen, A. B. (2006), ‘A robust hybrid of ridge and lasso penalized regression’, Dis-
covery pp. 1–24.
Politis, D. N., Romano, J. P. & Wolf, M. (1999), Subsampling (Springer Series in
Statistics), Springer.
Schapire, R. (1990), ‘The strength of weak learnability’, Mach. Learn. 5, 197–227.
Schapire, R. E. & Singer, Y. (1999), Improved boosting algorithms using confidence-
rated predictions, in ‘Machine Learning’, pp. 80–91.
Schwarz, G. (1978), ‘Estimating the dimension of a model’, Ann. Stat. 14, 461–64.
Shao, J. (1993), ‘Linear model selection by cross-validation’, Journal of the American
Statistical Association 88(422), 486–494.
BIBLIOGRAPHY 78
Stone, M. (1974), ‘Cross-validatory choice and assessment of statistical predictions’,
Roy. Stat. Soc. 36, 111–147.
Temlyakov, V. N. (2000), ‘Weak greedy algorithms’, Adv. Comput. Math. pp. 213–227.
Tibshirani, R. J. (1996), ‘Regression shrinkage and selection via the lasso’, Journal
of the Royal Statistical Society, Series B 58(1), 267–288.
Tropp, J. A. (2004), ‘Greed is good: Algorithmic results for sparse approximation’,
IEEE Trans. Inform. Theory 50, 2231–2242.
Tropp, J. A., Anna & Gilbert, C. (2007), ‘Signal recovery from random measurements
via orthogonal matching pursuit’, IEEE Trans. Inform. Theory 53, 4655–4666.
Valiant, L. G. (1984), ‘A theory of the learnable’, Commun. ACM 27, 1134–1142.
Wang, H. (2009), ‘Forward regression for ultra-high dimensional variable screening’,
Journal of the American Statistical Association 104(488), 1512–1524.
Wasserman, L. & Roeder, K. (2009), ‘High-dimension variable selection’, Ann. Stat.
37, 2178–2201.
Yang, Y. (2008), ‘Consistency of cross validation for comparing regression procedures’,
Annals of Statistics 35(6), 2450–2473.
Zhang, C.-H. & Huang, J. (2008), ‘The sparsity and bias of the lasso selection in
high-dimensional linear regression’, Annals of Statistics 36(4), 1567–1594.
Zhang, P. (1993), ‘Model selection via multifold cross validation’, Annals of Statistics
21(1), 299–313.
BIBLIOGRAPHY 79
Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-
chine Learning Research 7(11), 2541–2563.
Zhou, S., Van De Geer, S. & Buhlmann, P. (2009), ‘Adaptive lasso for high dimen-
sional regression and gaussian graphical modeling’, preprint p. 30.
Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American
Statistical Association 101(476), 1418–1429.
Zou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elastic
net’, Journal of the Royal Statistical Society - Series B: Statistical Methodology
67(2), 301–320.
Zou, H. & Li, R. (2008), ‘One-step sparse estimates in nonconcave penalized likelihood
models’, Annals of Statistics 36(4), 1509–1533.