Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… ·...

24
Does Cross-validation Work when p n? Laurence de Torrent´ e Mathematics Institute, EPFL, 1015 Lausanne, Switzerland [email protected] Trevor Hastie * Department of Statistics, Stanford University, Stanford, CA 94305 [email protected] September 12, 2012 Abstract Cross-validation is a popular tool for evaluating the performance of a predictive model, and hence also for model selection when we have a se- ries of models to choose from. It has been suggested that cross-validation can fail when the number of predictors p is very large. We demonstrate through a suggestive simulation example that while K-fold cross-validation can have high variance in some situations, it is unbiased. We also study two permutation methods to assess the quality of a cross-validation curve for model selection, which we demonstrate using the lasso. The first ap- proach is visual, while the second computes a p-value for our observed error. We demonstrate these approaches using two real datasets “Colon” and“Leukemia”, as well as a null dataset. Finally we use the bootstrap to estimate the sampling distribution of our cross-validation curves, and their functionals. This adds to the graphical evidence of whether our findings are real or could have arisen by chance. * Trevor Hastie was partially supported by grant DMS-1007719 from the National Science Foundation, and grant RO1-EB001988-15 from the National Institutes of Health. 1

Transcript of Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… ·...

Page 1: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Does Cross-validation Work when p� n?

Laurence de Torrente

Mathematics Institute, EPFL, 1015 Lausanne, Switzerland

[email protected]

Trevor Hastie∗

Department of Statistics, Stanford University, Stanford, CA 94305

[email protected]

September 12, 2012

Abstract

Cross-validation is a popular tool for evaluating the performance of a

predictive model, and hence also for model selection when we have a se-

ries of models to choose from. It has been suggested that cross-validation

can fail when the number of predictors p is very large. We demonstrate

through a suggestive simulation example that while K-fold cross-validation

can have high variance in some situations, it is unbiased. We also study

two permutation methods to assess the quality of a cross-validation curve

for model selection, which we demonstrate using the lasso. The first ap-

proach is visual, while the second computes a p-value for our observed

error. We demonstrate these approaches using two real datasets “Colon”

and “Leukemia”, as well as a null dataset. Finally we use the bootstrap to

estimate the sampling distribution of our cross-validation curves, and their

functionals. This adds to the graphical evidence of whether our findings

are real or could have arisen by chance.

∗Trevor Hastie was partially supported by grant DMS-1007719 from the National ScienceFoundation, and grant RO1-EB001988-15 from the National Institutes of Health.

1

Page 2: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Keywords: variable selection, high-dimensional, permutation, bootstrap.

2

Page 3: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

1 Introduction

Cross-validation is an old method, which was investigated and reintroduced by

Stone (1974). It splits the dataset into two parts, using one part to fit the model

(training set) and one to test it (test set). K-fold cross-validation (K-fold CV)

and leave-one-out cross-validation (LOOCV) are the best-known. There is also

generalized cross-validation (GCV) which is an approximation due to Wahba &

Wold (1975).

We consider a regression problem where we want to predict a response vari-

able Y using a vector of p predictors XT = (X1, . . . , Xp) via a linear model,

Y = β0 +

p∑j=1

Xj βj .

In general we would like to eliminate the superfluous variables among the Xj ,

and obtain good estimates of the coefficients for those retained. Any variable

with a small |βj | is a candidate for removal: it may be advantageous to set βj =

0, allow for a small bias, and yet reduce the prediction error. Ordinary least-

squares estimators tend to have small bias but large variance in the prediction

of Y . The choice between variance and bias can be made explicit by variable

selection procedures or by shrinkage methods. Having less predictors can also

improve data visualization and understanding. With genomic data, the number

of variables is often so large that selection becomes essential—both for statistical

inference and for interpretation.

A variety of selection and shrinkage methods have been proposed and inves-

tigated, among them:

• Forward stepwise selection, which begins with a model containing only the

intercept and adds at each step one covariate until no better model can

be found.

• Ridge regression leaves all the covariates in, but regularizes their coeffi-

cients. The criterion to be minimized is a weighted sum of the residual

3

Page 4: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

sum of squares and the squared Euclidean norm of the coefficients.

• Lasso (Tibshirani 1996) is like ridge, except it penalizes the L1-norm of

the coefficients, thereby achieving subset selection and shrinkage at the

same time. The lasso solves the convex optimization problem

minimizeβ∈<p

n∑i=1

(Yi −XTi β)2 + λ

p∑j=1

|βj |. (1)

All of these methods correspond to families of estimators of βλ rather than

a single one. This is a situation where cross-validation is useful. The cross-

validated estimate of the prediction error can help in choosing within these

families. It is often used to select the tuning parameter (λ) for lasso or ridge

regression, or the subset size for stepwise regression. For example, we choose the

tuning parameter that results in the model with the minimal cross-validation

error.

Variable selection with high dimensional data can be difficult. We have to

work with a large number of covariates and small sample size. We want to

estimate correctly the prediction error and the “sparsity” pattern. Wasserman

& Roeder (2009) propose a multistage procedure which “screens” and “cleans”

the covariates in order to reduce the dimension and have reasonable power .

It is often suggested that cross-validation could fail in the p � n situation,

as the following simple situation appears to demonstrate (a simplification of

a similar example in Section 7.10.3 of Hastie, Tibshirani & Friedman (2008)).

Suppose we have p (very large) binary predictors, and a binary response (e.g.

a genomics dataset). We consider a very simple procedure: we search through

the covariates, and find the one most aligned with the response, and use it

to predict the response. With n sufficiently small, we might find one or more

features that are perfectly aligned with the response. If this is the case, it would

also be true for every cross-validation fold! This could happen even in the null

case, where the response is independent of the features. It seems that cross-

validation would fail here, and report a cross-validation error of zero. In other

4

Page 5: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

words, an association between a feature and the response is found even if there is

none. It turns out that this is not a counter example, but does demonstrate the

high variance of cross-validation in such situations. Braga-Neto & Dougherty

(2004) also showed that cross-validated estimators exhibits high variance and

large outliers when p� n. Fan, Guo & Hao (2012) also displayed the problem

of variance and proposed alternative estimators.

In Section 2 we pursue this example. We generate a null dataset and use this

simple prediction rule, to demonstrate that cross-validation error is unbiased for

the expected test error but can have high variance.

Cross-validation being unbiased is not enough. By luck we might find a

model with very low cross-validated error, but not be sure if it is real or not.

We have seen that cross-validation error can have high variation, so we look at

two different methods to assess this variance. In Section 3 we look at cross-

validation error used to select the tuning parameter (λ) in the lasso. First, we

use a visual method which allows us to see if our error is credible, or could

have arisen from a null model. Then, we compute the null distribution of the

minimum error and hence a “p-value” for our observed error. The methods we

propose are visual and descriptive, and use permutations to generate null CV

curves. We demonstrate these approaches using two real datasets “Colon” and

“Leukemia”, as well as a simulated dataset.

In Section 4, we discuss the use of the bootstrap in this context. The boot-

strap estimates the sampling distribution of the cross-validation curves and their

functionals such as optimal λ and minimal error. This helps us establish whether

the good performance we see in our observed data (relative to the null data)

could be by chance.

In every section, we will use fivefold cross-validation which seems to be a good

compromise (see Hastie et al. (2008), Breiman & Spector (1992) and Kohavi

(1995)).

5

Page 6: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

2 Cross-validation is unbiased

In this section we show that cross-validation is unbiased but can have high

variance, in the context of the simple example outlined in the introduction. We

generate a null dataset with binary covariates and independent binary response,

and use a simple rule for choosing the predictor. We want to see if cross-

validation estimates the true error of this selection procedure.

We generate p binary covariates of length n i.i.d from a Bernoulli distribution

Xij ∼ B(0.2); i = 1, . . . , n; j = 1, . . . , d;

and Yi; i = 1, . . . , n from a Bernoulli B(0.5). The true error rate is the null rate

50%, since Y is independent of all the Xj and no matter what class we predict,

half will be wrong.

The best predictor is chosen based on the following criterion: select the

covariate most aligned with the response (smallest Hamming distance), and use

it to predict the response. If there are ties for the best, choose one at random.

5-Fold cross-validation is applied as follows. The best predictor is selected

using the “training” set T ktr (4/5ths data) from the kth fold, and the error is

computed on the “test” set T kte (1/5th data) as follows. With the training set,

we choose the X` with

` = argminj=1,...,d

∑i∈T k

tr

1Yi 6=Xij. (2)

Then, with the test set, the error for the kth fold is

ek =1

nte

∑i∈T k

te

1Yi 6=Xi`. (3)

We repeat this for every fold ( and possibly selecting a different ` each time),

and set the cross-validation error to be e = 1/5∑5k=1 ek. For these experiments

we generated the Xj once, and obtained repeated independent realizations of Y

by permutation.

6

Page 7: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Figure 1 shows the results of cross-validation using 1000 different realizations

of Y , separately for different configurations of n and p. We see in all cases that

cross-validation is indeed unbiased; the true error 50% is the median of the

distribution, but that the spread can be high. One might wonder how this

can be in these examples — especially in the configurations where there is a

reasonable probability of pure examples: features that align perfectly with the

response over all n observations. The reason is that there will also likely be pure

4/5 training sets, including ones where the 1/5 test points are not pure. In this

case there would be tied “best” predictors, and the one picked at random might

be the one with the “impure” test set.

●●

●●●

●●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●●●

n=10/p=1000 n=50/p=5000 n=100/p=10000 n=150/p=15000

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.5

0.6

0.8

1.0

●●

●●●●●●

●●

●●●

●●●●

●●

●●●

●●

●●●

●●

●●●●●●

●●●●

n=10/p=1000 n=50/p=5000 n=100/p=10000 n=150/p=15000

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.5

0.6

0.8

1.0

Figure 1: Boxplots of 5-folds cross-validation error for different values of n andp and 1000 realizations of Y . On the left, we generate Y once and permute it1000 times and on the right we draw 1000 new realizations of the data. Wewanted to compare both methods: permutation vs generating each time. Theyappear to give similar results. Both plots show that cross-validation seems tobe unbiased but can have high variance. For instance, on the left with n = 10and p = 1000, the median is at 0.5 (unbiased) but the values vary from 0 to 0.9(very high variance).

We conclude that via this simple but compelling experiment, cross-validation

does estimate the expected test error in an apparently unbiased way, but it can

have high variance. This variance is somewhat troubling. One would normally

take the result of a single cross-validation, and hence could be subject to a

high-variance stroke of bad luck. We could imagine that in the case n = 10

7

Page 8: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

and p = 1000 you might find a cross-validation error of 0 (the lowest point on

the boxplot), even if the real error is 50%! This is motivation for augmenting

cross-validation with some variance information, a topic we address next.

3 The null distribution of cross-validation

In this section we examine the use of cross-validation to select the tuning pa-

rameter λ in the lasso. Figure 2 shows a cross-validation curve for the leukemia

data (next section), as a function of the regularization parameter. Standard

error bands are included, based on the variability across the five folds. How-

ever, since these five error estimates are correlated, we are reluctant to place

too much faith in these bands.

We present two methods to assess the CV misclassification error estimates.

Both are based on the distribution of the CV curves using null data — data

obtained by randomly permuting the responses Y , holding all the X variables

fixed. The idea is that if we have an impressive CV result on our real data, it

should not be easily achievable with null data. If it is, we would not be able

to trust it. The reason we permute only Y is that we wish to create a null

relationship between predictors and response, without breaking the correlation

structure of the predictors.

• The first method is graphical, and simply plots the null distribution of the

CV curves, and compares it to the original curve (Figures 3, 5 and 7).

• The second method uses the same distribution of curves, and computes

the null distribution of the minimized cross-validation error, and hence a

“p-value” for the observed minimal error (Figures 4, 6 and 8).

We demonstrate these two approaches using lasso applied to two real datasets

“Leukemia” and “Colon”, as well as a simulated null dataset, using 5-fold CV in

all cases.

8

Page 9: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

3.1 Leukemia dataset

The leukemia dataset (Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov,

Coller, Loh, Downing, Caligiuri, Bloomfield & Lander 1999) contains gene-

expression data and a class label indicating type of leukemia (“AML” vs “ALL”).

We compute the cross-validation error with the package glmnet (Friedman,

Hastie & Tibshirani 2010). The dataset contains n = 38 samples (11 “AML”

and 27 “ALL”), and the number of covariates p is 7129.

−5 −4 −3 −2 −1

0.05

0.10

0.15

0.20

0.25

0.30

0.35

log(Lambda)

Mis

clas

sific

atio

n E

rror

●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

9 9 10 9 7 8 8 8 8 8 7 7 5 4 4 3 3 2 2 1

Figure 2: Cross-validation curve for the leukemia dataset, as produced by thepackage glmnet. The bands indicate pointwise standard errors based on the errorestimates from the 5 folds. The number of non-zero coefficients for each log(λ)are above the plot. The dotted line on the left corresponds to the minimum CVerror. The other one is the largest value of log(λ) such that error is within onestandard error of the minimum. In this paper, we will focus on the minimumCV error or the whole CV curve.

A lasso fit is defined up to the tuning parameter λ, which must be specified.

We use CV to estimate the prediction error for a sequence of lasso models, and

pick a value of λ that has good estimated prediction performance. For example,

we can pick that value that minimizes the CV error. This corresponds, in

Figure 2, to a CV error of 0.053 at log(λ) = −1.505. (In cases where the

CV curve is flat, like this one, we always pick the largest value of λ, which

9

Page 10: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

corresponds to the sparsest model).

Our first method analyses the whole CV error curve. In Figure 3, the grey

curves correspond to the “null” cross-validation curves from 1000 permutations

of Y . The red curves correspond to CV curves on the real data. Since there

will be variation in these curves because of the randomness in fold selection, we

repeated the CV procedure 1000 times. For clarity in the left plot we show just

five such red curves curves. Superimposed in purple is the average CV curve

(over all 1000 fold selections).

We would be concerned if the actual cross-validation curves lay within the

grey null cloud. In this example there is fairly good separation, even when

taking into account the randomness due to fold selection (right panel).

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

Figure 3: Cross-validation curves for the leukemia dataset. In both plots the greycloud corresponds to CV curves computed from null data. In the right plot, weshow the variation in the true CV curves resulting from 1000 different randomfold selections. When selecting a tuning parameter, we would not compute 1000different CV curves; typically one or at most a few of these curves are computed.In the left plot, we have selected 5 of these at random. Overlaid in purple in bothplots is the average CV curve (averaged over all the folds). Of interest is whetherthe real CV curves (red) lie inside or outside the set of null CV curves (grey).Here they appear to lie outside, and so we might conclude there is informationin these data.

The second method computes the null distribution of the minimal cross-

validation error.

The purple bars in figure 4 corresponds to the minimal cross-validation er-

10

Page 11: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

0 0.053 0.105 0.158 0.211 0.263 0.316 0.368

cross−validation error

Fre

quen

cy

010

020

030

040

050

060

070

0

nullobserved

Figure 4: Real and null distribution for the leukemia dataset. The purple his-togram corresponds to the minimal error rate found on actual data and the blueone shows the frequencies of the minimal values under the null distribution.There is not much overlap in these distributions, which suggests strong evidenceof signal in these data.

11

Page 12: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

rors computed with the observed Y , using different random folds. The actual

different values are listed in table 1.

Table 1: Frequency of CV error values for the observed Y for the leukemiadataset (purple bars in figure 4).

CV error 038

138

238

338

438

538

638

Frequency 43 314 482 138 20 1 2

The blue part of figure 4 is the null distribution of minimal cross-validation

error obtained by randomly permuting Y a thousand times. We find the follow-

ing values listed in table 2.

We can estimate the null probability for each observed CV value using this

distribution. For example

P(CV err = 0) = P(

CV err =1

38

)= P

(CV err =

2

38

)= 0,

P(

CV err =3

38

)=

2

1000= 0.002

P(

CV err =4

38

)=

1

1000= 0.001

The null hypothesis we are interested in is H0: the covariates Xj and Y are

independent, vs the alternative H1: there is dependence between Y and the Xj .

We will use the minimum CV error as the test statistic. The mode for the null

CV values in Figure 4 is 0.289, which is the null error rate 11/(11 + 27).

Table 2: Frequency of CV error values when permuting Y for the leukemiadataset (blue bars in figure 4).

CV error 338

438

638

738

838

938

1038

1138

1238

1338

1438

Frequency 2 1 5 23 38 57 133 704 25 9 3

12

Page 13: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

For instance, if we observe a CV error of 438 , we can compute the p-value of

this observation given H0,

P(

CV error ≤ 4

38

)=

3

1000= 0.003.

Therefore, we can reject the null hypothesis at the 1% significance level. In

other words, a CV error as low as 438 cannot be explained by luck if Y were

independent of the Xj .

3.2 Colon dataset

The dataset colon contains snapshots of the expression pattern of different cell

types with a class label indicating if the colon is healthy or not (Alon, Barkai,

Notterman, Gish, Ybarra, Mack & Levine 1999).

In this dataset, the response Y contains 22 normal colon tissue samples and

40 colon tumor samples. The number of covariates is 2000.

In figure 5, the real cross-validation curves overlap the grey cloud slightly in

both plots. But the minimal value lies under the cloud so the CV error seems

to be relevant.

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

Figure 5: Cross-validation curves for the colon dataset. Details the same as inFigure 3.

The blue part of figure 6 is the null distribution of CV error when permuting

13

Page 14: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Table 3: Frequency of CV error values when permuting Y for the colon dataset(blue bars in figure 6).

CV error 1062

1162

1262

1362

1462

1562

1662

1762

Frequency 1 1 2 4 6 5 16 19

CV error 1862

1962

2062

2162

2262

2362

2462

2562

Frequency 38 45 64 117 662 14 5 1

Table 4: Frequency and p-values of the minimal CV errors for the observed Y(different random folds) for the colon dataset (purple bars in figure 6). The p-value is the fraction of times a minimal CV value as small or smaller was seenamongst the permuted data. Except for 14/62, none of the values exceed 0.01.

CV error 562

662

762

862

962

1062

1162

1262

1362

1462

Frequency 4 32 132 259 311 181 59 16 4 2p-value 0 0 0 0 0 0.001 0.002 0.004 0.008 0.014

Y a thousand times. The actual values found are listed in table 3.

The different values for the minimal CV errors computed with the observed

Y , using different random folds (purple part of figure 6) are listed in table 4.

The p-value computed with the null distribution is also included.

14

Page 15: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

0.081 0.129 0.177 0.21 0.242 0.29 0.323 0.371

cross−validation error

Fre

quen

cy

010

020

030

040

050

060

0

nullobserved

Figure 6: Real and null distribution for the colon dataset. Details are as inFigure 4.

We test if the CV error comes from the null distribution or if it is real.

Therefore, we can reject the null hypothesis at the 1% significance level for

CV error less than 1362 . There might be a problem with 14

62 but it appears only

twice in a thousand simulations.

3.3 Simulated dataset

We apply both methods on the simulated dataset with n = 50 individuals and

p = 5000 covariates from section 2. Here, we are in the null case where the

covariates are totally independent from the response. The one designated “real”

dataset shows a somewhat optimistic CV curve.

15

Page 16: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

−7 −6 −5 −4 −3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

1.0

log(Lambda)

Mis

clas

sific

atio

n E

rror

Figure 7: Cross-validation curves for the simulated dataset. Details the same asin Figure 3. In this case the “real” CV curves lie inside the null cloud.

In figure 7, the actual cross-validation curves lay inside the grey cloud in

both plots. The curves from the “real” response behave like curves from the null

distribution. With this first plot we could already conclude that there might be

a problem with the result of the CV error.

16

Page 17: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Table 5: Frequency of CV error values when permuting Y for the simulateddataset (blue bars in figure 8).

CV error 650

750

850

950

1050

1150

1250

1350

1450

1550

1650

1750

1850

1950

2050

Frequency 1 1 6 1 6 13 14 15 29 34 42 56 33 58 63

CV error 2150

2250

2350

2450

2550

2650

2750

2850

2950

3050

3150

3250

3350

3450

Frequency 81 77 86 79 79 72 54 41 29 13 10 4 1 2

0.1 0.16 0.22 0.28 0.34 0.4 0.46 0.52 0.58 0.64

cross−validation error

Fre

quen

cy

020

4060

8010

012

014

0

nullobserved

Figure 8: Real and null distribution for the simulated dataset. Details are as inFigure 4. In this case the histograms overlap considerably.

The blue part of figure 8 is the null distribution of cross-validation error

when permuting Y a thousand times. We find the following values listed in

table 5.

The different values for the minimal cross-validation errors computed with

the observed Y , using different random folds (purple part of figure 8) are listed

in table 6. We also include the p-value for each observed value with the null

distribution.

17

Page 18: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Table 6: Frequency and p-value of CV error values for the observed Y for thesimulated dataset (purple bars in figure 8). The most frequent observed value13/50 has a p-value=0.057. Moreover, every value above 10/50 exceeds a p-valueof 0.01.

CV error 550

750

850

950

1050

1150

1250

1350

1450

Frequency 1 4 12 38 58 83 122 150 133p-value 0 0.002 0.008 0.009 0.015 0.028 0.042 0.057 0.086

CV error 1550

1650

1750

1850

1950

2050

2150

2250

Frequency 140 109 65 42 23 13 6 1p-value 0.120 0.162 0.218 0.251 0.309 0.372 0.453 0.530

While there is strong evidence here that the data is null, we see some poten-

tial dangers in the variation observed. Firstly, despite the data being null, we

will sometimes see a realized dataset such as seen here. In this particular case

there is a 5.5% chance (55/1000) that the minimal CV p-value from random

fold selection would be significant at the 1% level. This suggests that it might

be wiser to average the CV curves to reduce the variation in fold-selection.

In the next section we augment the null data with bootstrapped versions

of the CV curves. This throws sampling variation into the mix, and helps

distinguish the real situations from the null.

4 Bootstrap analysis

In this section we use the bootstrap to estimate the sampling distribution of

cross-validation, and the hence of the functionals we derive from these curves.

The bootstrap is useful in studying certain properties of statistics, such as

variance. It mimics the process of getting new samples from nature, and hence

the sampling distribution of the statistics in question. We take the dataset and

resample from it and so we expect each bootstrap dataset to have properties

similar to those of the original data. We use the non-parametric bootstrap

which puts a mass of 1n (distribution F ) on every pair (Xi, Yi); i = 1, . . . , n.

We resample n times from F to get one bootstrapped dataset.

In the context of cross-validation, we have to be careful. Cross-validation

18

Page 19: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Table 7: Covariates retained in more than one-fifth of the 1000 bootstrappeddatasets, after the models are selected by cross-validation.

Covariates 249 377 493 625 1582 1772 1843

divides the data in two parts, the training set and the test set. For the original

data, every individual observation appears once in the dataset (we assume all the

(X,Y ) p+1 tuples are distinct.) Hence there is no possibility that an individual

is used in the training and the test set, a potential source of positive bias. This

is not the case for a bootstrapped dataset — there will almost certainly be

ties. If we let some of these tied values go to the training set, and some to

the validation set, we will have artificially created (additional) correlation. To

avoid this, we operate at the level of observation weights. In the original sample,

each observation has weight 1/n. In a bootstrapped sample, these weights will

be of the form k/n, for k = 0, 1, 2, . . .. When we cross-validate a bootstrap

sample, we randomly divide the original observations into the two groups, but

their bootstrap weights go with them (including the 0s).

Then, for every bootstrap dataset, we can compute statistics like the number

of non-zero coefficients, the minimal value of λ, the minimal value for the cross-

validation error, etc. We can look at the variation of these statistics to see if our

model is stable or not. For instance, we may study the number of times each

covariate has a nonzero coefficient in the model. With that information, we can

judge whether some of the covariates appear to be more important than others.

For example, we can declare that a covariate is important if it is chosen in

more than one-fifth of the 1000 bootstrap datasets. For the colon dataset we

find the covariates listed in table 7.

We may also study some statistics like the minimal cross-validation error. In

order to compute a confidence interval we sort the 1000 minimal cross-validation

errors found when bootstrapping and take the 25th as the 2.5 percentile and

the 975th as the 97.5 percentiles. We find the 95% confidence interval for the

minimal cross-validation error : [0,0.2368]. Figure 9 shows the frequencies of

19

Page 20: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

the minimal cross-validation values under the bootstrap distribution and under

the null distribution computed by permutation in section 3.2.

0 0.032 0.081 0.129 0.177 0.226 0.274 0.323 0.371

cross−validation error

Fre

quen

cy

010

020

030

040

050

060

0

nullboot

Figure 9: The purple histogram corresponds to the minimal cross-validation er-ror under the bootstrap distribution and the blue one shows the null distribution.Both are computed with the colon dataset. The bootstrap distribution representsthe underlying distribution of the real CV error of the colon dataset. The nulldistribution still corresponds to the one computed from 1000 permutation of Y .Here we would conclude that the bootstrap distribution looks very different fromthe null distribution, which provides more evidence that the effects found arereal.

In figure 9, the real CV error does not seem to have the same distribution as

the null one, with the vast majority of values shifted to the left. This provides

more evidence that the signal found in the colon dataset is real.

In figure 10 (null data), both distributions are very similar, and we would

(correctly) not trust the results of cross-validation in this case. For the “real”

dataset in this case (one of the null datasets with somewhat optimistic CV

curves), the modal value for the minimal CV error is 0.26. Figure 8 (purple)

shows the variation in this minimal error under random fold selection, and we

20

Page 21: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

saw that 0.26 or smaller occurs less than 5.7% of the time. The bootstrap

includes sampling variation as well, and now we see that this sample is virtually

indistinguishable from null data.

0.04 0.12 0.18 0.24 0.3 0.36 0.42 0.48 0.54 0.6 0.66 0.76

cross−validation error

Fre

quen

cy

020

4060

80

nullboot

Figure 10: Bootstrap and null distribution of the minimal CV error, computedfor the null dataset from Section 2. The details are the same as in figure 9. Herethere is substantial overlap, and we would not trust the original cross-validation.

The bootstrap is thus helpful in providing evidence whether our original

sample might be an opportunistic draw from the null distribution.

5 Conclusion

In this paper, we demonstrated through a simple example that cross-validation

is unbiased but can have high variance. Care has to be taken when p� n.

We presented two methods based on permutations which can help us to assess

the validity of the CV error. The first one plots the null cross-validation curves

and offers visual information about the observed cross-validation curve(s). The

21

Page 22: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

second one computes a p-value of the observed minimal cross-validation error

calculated with the null distribution found by permutation of Y . For both real

datasets, the original CV curves lie outside the distributions of null curves.

Likewise, the minimal CV errors are shifted to the left for the real datasets, and

using permutation-based hypothesis tests, we can conclude that they would not

be found by chance.

For the “null” dataset — even an opportunistic draw from the family of null

datasets — the overlap is much more substantial.

The bootstrap adds to the picture, by adding sampling variation into the

variation of the cross-validation curves and their functionals. The real datasets,

even under bootstrap resampling, have minimal CV errors that look distinct

from the null distribution. The null dataset, on the other hand, becomes indis-

tinguishable from the family of null data.

In this article, we analyzed the independent case. We wanted our methods

to establish a null effect. If we have a small CV error, what is the chance that it

could have occurred at random, i.e. could have arisen from a null distribution

where there is no dependence between the response and the predictors.

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. &

Levine, A. J. (1999). Broad patterns of gene expression revealed by cluster-

ing analysis of tumor and normal colon tissues probed by oligonucleotide ar-

rays, Proceedings of the National Academy of Sciences of the United States

of America 96(12): 6745–6750.

URL: http://dx.doi.org/10.1073/pnas.96.12.6745

Braga-Neto, U. M. & Dougherty, E. R. (2004). Is cross-validation valid for

small-sample microarray classification?, Bioinformatics 20(3): 374–380.

URL: http://dx.doi.org/10.1093/bioinformatics/btg419

22

Page 23: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Breiman, L. & Spector, P. (1992). Submodel selection and evaluation in regres-

sion. the x-random case, International Statistical Review 60(3): 291–319.

Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted

cross,Aevalidation in ultrahigh dimensional regression, Journal of the Royal

Statistical Society Series B 74(1): 37–65.

URL: http://ideas.repec.org/a/bla/jorssb/v74y2012i1p37-65.html

Friedman, J. H., Hastie, T. & Tibshirani, R. (2010). Regularization Paths for

Generalized Linear Models via Coordinate Descent, Journal of Statistical

Software 33(1): 1–22.

URL: http://www.jstatsoft.org/v33/i01

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,

J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield,

C. D. & Lander, E. S. (1999). Molecular classification of cancer: class

discovery and class prediction by gene expression monitoring., Science (New

York, N.Y.) 286(5439): 531–537.

URL: http://dx.doi.org/10.1126/science.286.5439.531

Good, P. I. (2006). Resampling Methods, 3rd edn, Birkhauser.

Hastie, T. J., Tibshirani, R. J. & Friedman, J. H. (2008). The Elements of Sta-

tistical Learning: Data Mining, Inference, and Prediction, Springer series

in statistics, 2nd edn, Springer, New-York.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy es-

timation and model selection, International Joint Conference on Artificial

Intelligence, Vol. 14, Citeseer, pp. 1137–1145.

Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Pre-

dictions, Journal of the Royal Statistical Society. Series B (Methodological)

36(2): 111–147.

23

Page 24: Does Cross-validation Work when - web.stanford.eduhastie/Papers/does_cross-validation_work.… · • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L 1-norm of the

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso, Journal

of the Royal Statistical Society. Series B (Methodological) 58(1): 267–288.

URL: http://dx.doi.org/10.2307/2346178

Wahba, G. & Wold, S. (1975). A completely automatic french curve: Fitting

spline functions by cross validation, Communications in Statistics-Theory

and Methods 4(1): 1–18.

Wasserman, L. & Roeder, K. (2009). High dimensional variable selection., An-

nals of statistics 37(5A): 2178–2201.

URL: http://view.ncbi.nlm.nih.gov/pubmed/19784398

24