cv_boot

download cv_boot

of 66

Transcript of cv_boot

  • 8/10/2019 cv_boot

    1/66

    Cross-validation and the Bootstrap

    In the section we discuss two resamplingmethods:cross-validation and the bootstrap.

    1/44

  • 8/10/2019 cv_boot

    2/66

    Cross-validation and the Bootstrap

    In the section we discuss two resamplingmethods:cross-validation and the bootstrap.

    These methods refit a model of interest to samples formedfrom the training set, in order to obtain additionalinformation about the fitted model.

    1/44

  • 8/10/2019 cv_boot

    3/66

    Cross-validation and the Bootstrap

    In the section we discuss two resamplingmethods:cross-validation and the bootstrap.

    These methods refit a model of interest to samples formedfrom the training set, in order to obtain additionalinformation about the fitted model.

    For example, they provide estimates of test-set predictionerror, and the standard deviation and bias of our

    parameter estimates

    1/44

  • 8/10/2019 cv_boot

    4/66

    Training Error versus Test error

    Recall the distinction between the test errorand thetraining error:

    The test erroris the average error that results from using astatistical learning method to predict the response on a new

    observation, one that was not used in training the method. In contrast, the training errorcan be easily calculated by

    applying the statistical learning method to the observationsused in its training.

    But the training error rate often is quite different from thetest error rate, and in particular the former candramatically underestimatethe latter.

    2/44

  • 8/10/2019 cv_boot

    5/66

    Training- versus Test-Set Performance

    High Bias

    Low Variance

    Low Bias

    High Variance

    Prediction

    Error

    Model Complexity

    Training Sample

    Test Sample

    Low High

    3/44

  • 8/10/2019 cv_boot

    6/66

    More on prediction-error estimates

    Best solution: a large designated test set. Often notavailable

    Some methods make a mathematical adjustmentto thetraining error rate in order to estimate the test error rate.These include the Cp statistic, AICand BIC. They arediscussed elsewhere in this course

    Here we instead consider a class of methods that estimatethe test error by holding outa subset of the training

    observations from the fitting process, and then applying thestatistical learning method to those held out observations

    4/44

  • 8/10/2019 cv_boot

    7/66

    Validation-set approach

    Here we randomly divide the available set of samples intotwo parts: a training setand a validationor hold-out set.

    The model is fit on the training set, and the fitted model is

    used to predict the responses for the observations in thevalidation set.

    The resulting validation-set error provides an estimate ofthe test error. This is typically assessed using MSE in thecase of a quantitative response and misclassification rate inthe case of a qualitative (discrete) response.

    5/44

  • 8/10/2019 cv_boot

    8/66

    The Validation process

    A random splitting into two halves: left part is training set,right part is validation set

    6/44

  • 8/10/2019 cv_boot

    9/66

    Example: automobile data

    Want to compare linear vs higher-order polynomial terms

    in a linear regression We randomly split the 392 observations into two sets, a

    training set containing 196 of the data points, and avalidation set containing the remaining 196 observations.

    2 4 6 8 10

    16

    18

    20

    22

    24

    26

    28

    Degree of Polynomial

    Mean

    SquaredError

    2 4 6 8 10

    16

    18

    20

    22

    24

    26

    28

    Degree of Polynomial

    Mean

    SquaredError

    Left panel shows single split; right panel shows multiple splits7/44

  • 8/10/2019 cv_boot

    10/66

    Drawbacks of validation set approach

    the validation estimate of the test error can be highlyvariable, depending on precisely which observations areincluded in the training set and which observations areincluded in the validation set.

    In the validation approach, only a subset of theobservations those that are included in the training setrather than in the validation set are used to fit themodel.

    This suggests that the validation set error may tend tooverestimatethe test error for the model fit on the entiredata set.

    8/44

  • 8/10/2019 cv_boot

    11/66

    Drawbacks of validation set approach

    the validation estimate of the test error can be highlyvariable, depending on precisely which observations areincluded in the training set and which observations areincluded in the validation set.

    In the validation approach, only a subset of theobservations those that are included in the training setrather than in the validation set are used to fit themodel.

    This suggests that the validation set error may tend tooverestimatethe test error for the model fit on the entiredata set. Why?

    8/44

  • 8/10/2019 cv_boot

    12/66

    K-fold Cross-validation

    Widely used approachfor estimating test error.

    Estimates can be used to select best model, and to give anidea of the test error of the final chosen model.

    Idea is to randomly divide the data into Kequal-sizedparts. We leave out part k, fit the model to the otherK 1 parts (combined), and then obtain predictions forthe left-out kth part.

    This is done in turn for each part k= 1, 2, . . .K , and thenthe results are combined.

    9/44

  • 8/10/2019 cv_boot

    13/66

    K-fold Cross-validation in detail

    Divide data into Kroughly equal-sized parts (K= 5 here)

    1

    TrainTrainValidation Train Train

    5432

    10/44

  • 8/10/2019 cv_boot

    14/66

    The details

    Let the Kparts be C1, C2, . . . C K, where Ck denotes theindices of the observations in part k. There are nkobservations in part k: ifN is a multiple ofK, thennk =n/K.

    ComputeCV(K)=

    Kk=1

    nkn

    MSEk

    where MSEk =

    iCk(yiyi)

    2/nk, and yi is the fit for

    observation i, obtained from the data with part k removed.

    11/44

  • 8/10/2019 cv_boot

    15/66

    The details

    Let the Kparts be C1, C2, . . . C K, where Ck denotes theindices of the observations in part k. There are nkobservations in part k: ifN is a multiple ofK, thennk =n/K.

    Compute

    CV(K)=

    Kk=1

    nkn

    MSEk

    where MSEk =

    iCk(yiyi)

    2/nk, and yi is the fit for

    observation i, obtained from the data with part k removed. Setting K=n yields n-fold or leave-one out

    cross-validation(LOOCV).

    11/44

  • 8/10/2019 cv_boot

    16/66

    A nice special case!

    With least-squares linear or polynomial regression, an

    amazing shortcut makes the cost of LOOCV the same asthat of a single model fit! The following formula holds:

    CV(n)= 1

    n

    n

    i=1yiyi1 hi

    2,

    where yi is the ith fitted value from the original leastsquares fit, and hi is the leverage (diagonal of the hatmatrix; see book for details.) This is like the ordinaryMSE, except the ith residual is divided by 1 hi.

    12/44

  • 8/10/2019 cv_boot

    17/66

    A nice special case!

    With least-squares linear or polynomial regression, an

    amazing shortcut makes the cost of LOOCV the same asthat of a single model fit! The following formula holds:

    CV(n)= 1

    n

    n

    i=1yiyi1 hi

    2,

    where yi is the ith fitted value from the original leastsquares fit, and hi is the leverage (diagonal of the hatmatrix; see book for details.) This is like the ordinaryMSE, except the ith residual is divided by 1 hi.

    LOOCV sometimes useful, but typically doesnt shake upthe data enough. The estimates from each fold are highlycorrelated and hence their average can have high variance.

    a better choice is K= 5 or 10.

    12/44

  • 8/10/2019 cv_boot

    18/66

    Auto data revisited

    2 4 6 8 10

    16

    18

    20

    22

    24

    26

    28

    LOOCV

    Degree of Polynomial

    MeanSquare

    dError

    2 4 6 8 10

    16

    18

    20

    22

    24

    26

    28

    10fold CV

    Degree of Polynomial

    MeanSquare

    dError

    13/44

  • 8/10/2019 cv_boot

    19/66

    True and estimated test MSE for the simulated data

    2 5 10 20

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    Flexibility

    MeanSquare

    dError

    2 5 10 20

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    Flexibility

    MeanSquare

    dError

    2 5 10 20

    0

    5

    10

    15

    20

    Flexibility

    MeanSquare

    dError

    14/44

  • 8/10/2019 cv_boot

    20/66

    Other issues with Cross-validation

    Since each training set is only (K 1)/Kas big as theoriginal training set, the estimates of prediction error willtypically be biased upward.

    15/44

  • 8/10/2019 cv_boot

    21/66

    Other issues with Cross-validation

    Since each training set is only (K 1)/Kas big as theoriginal training set, the estimates of prediction error willtypically be biased upward. Why?

    15/44

  • 8/10/2019 cv_boot

    22/66

    Other issues with Cross-validation

    Since each training set is only (K 1)/Kas big as theoriginal training set, the estimates of prediction error willtypically be biased upward. Why?

    This bias is minimized when K=n (LOOCV), but thisestimate has high variance, as noted earlier.

    K= 5 or 10 provides a good compromise for thisbias-variance tradeoff.

    15/44

  • 8/10/2019 cv_boot

    23/66

    Cross-Validation for Classification Problems

    We divide the data into Kroughly equal-sized parts

    C1, C2, . . . C K

    . Ck

    denotes the indices of the observationsin part k. There are nk observations in part k: ifn is amultiple ofK, then nk =n/K.

    Compute

    CVK=

    K

    k=1

    nk

    n Errk

    where Errk =

    iCkI(yi = yi)/nk.

    The estimated standard deviation of CVK is

    SE(CVK) = Kk=1

    (Errk Errk)2/(K 1)

    This is a useful estimate, but strictly speaking, not quitevalid.

    16/44

  • 8/10/2019 cv_boot

    24/66

    Cross-Validation for Classification Problems

    We divide the data into Kroughly equal-sized parts

    C1, C2, . . . C K

    . Ck

    denotes the indices of the observationsin part k. There are nk observations in part k: ifn is amultiple ofK, then nk =n/K.

    Compute

    CVK=

    K

    k=1

    nk

    n Errk

    where Errk =

    iCkI(yi = yi)/nk.

    The estimated standard deviation of CVK is

    SE(CVK) = Kk=1

    (Errk Errk)2/(K 1)

    This is a useful estimate, but strictly speaking, not quitevalid. Why not?

    16/44

  • 8/10/2019 cv_boot

    25/66

    Cross-validation: right and wrong

    Consider a simple classifier applied to some two-class data:

    1. Starting with 5000 predictors and 50 samples, find the 100predictors having the largest correlation with the classlabels.

    2. We then apply a classifier such as logistic regression, usingonly these 100 predictors.

    How do we estimate the test set performance of thisclassifier?

    17/44

  • 8/10/2019 cv_boot

    26/66

    Cross-validation: right and wrong

    Consider a simple classifier applied to some two-class data:

    1. Starting with 5000 predictors and 50 samples, find the 100predictors having the largest correlation with the classlabels.

    2. We then apply a classifier such as logistic regression, usingonly these 100 predictors.

    How do we estimate the test set performance of thisclassifier?

    Can we apply cross-validation in step 2, forgetting aboutstep 1?

    17/44

  • 8/10/2019 cv_boot

    27/66

  • 8/10/2019 cv_boot

    28/66

    NO!

    This would ignore the fact that in Step 1, the procedurehas already seen the labels of the training data, and madeuse of them. This is a form of training and must beincluded in the validation process.

    It is easy to simulate realistic data with the class labelsindependent of the outcome, so that true test error =50%,but the CV error estimate that ignores Step 1 is zero!Try to do this yourself

    18/44

  • 8/10/2019 cv_boot

    29/66

  • 8/10/2019 cv_boot

    30/66

    The Wrong and Right Way

    Wrong: Apply cross-validation in step 2.

    Right: Apply cross-validation to steps 1 and 2.

    19/44

  • 8/10/2019 cv_boot

    31/66

    Wrong Way

    Predictors

    CV folds

    Selected set

    of predictors

    Samples

    Outcome

    20/44

  • 8/10/2019 cv_boot

    32/66

  • 8/10/2019 cv_boot

    33/66

    The Bootstrap

    The bootstrap is a flexible and powerful statistical tool thatcan be used to quantify the uncertainty associated with a

    given estimator or statistical learning method. For example, it can provide an estimate of the standard

    error of a coefficient, or a confidence interval for thatcoefficient.

    22/44

    h d h f ?

  • 8/10/2019 cv_boot

    34/66

    Where does the name came from?

    The use of the term bootstrap derives from the phrase topull oneself up by ones bootstraps, widely thought to bebased on one of the eighteenth century The SurprisingAdventures of Baron Munchausen by Rudolph ErichRaspe:

    The Baron had fallen to the bottom of a deep lake. Just

    when it looked like all was lost, he thought to pick himself

    up by his own bootstraps.

    It is not the same as the term bootstrap used incomputer science meaning to boot a computer from a setof core instructions, though the derivation is similar.

    23/44

    A i l l

  • 8/10/2019 cv_boot

    35/66

    A simple example

    Suppose that we wish to invest a fixed sum of money in

    two financial assets that yield returns ofX and Y,respectively, where X and Yare random quantities.

    We will invest a fraction of our money in X, and willinvest the remaining 1 in Y.

    We wish to choose to minimize the total risk, orvariance, of our investment. In other words, we want tominimize Var(X+ (1 )Y).

    24/44

    A i l l

  • 8/10/2019 cv_boot

    36/66

    A simple example

    Suppose that we wish to invest a fixed sum of money in

    two financial assets that yield returns ofX and Y,respectively, where X and Yare random quantities.

    We will invest a fraction of our money in X, and willinvest the remaining 1 in Y.

    We wish to choose to minimize the total risk, orvariance, of our investment. In other words, we want tominimize Var(X+ (1 )Y).

    One can show that the value that minimizes the risk isgiven by

    = 2Y XY2X

    + 2Y 2XY

    ,

    where 2X

    = Var(X), 2Y

    = Var(Y), and XY= Cov(X, Y).

    24/44

    E l ti d

  • 8/10/2019 cv_boot

    37/66

    Example continued

    But the values of2X

    , 2Y

    , and XY are unknown.

    We can compute estimates for these quantities, 2X

    , 2Y

    ,and XY, using a data set that contains measurements forX and Y.

    We can then estimate the value of that minimizes thevariance of our investment using

    = 2

    YXY

    2X+

    2Y 2XY

    .

    25/44

    E l ti d

  • 8/10/2019 cv_boot

    38/66

    Example continued

    2 1 0 1 2

    2

    1

    0

    1

    2

    X

    2 1 0 1 2

    2

    1

    0

    1

    2

    X

    3 2 1 0 1 2

    3

    2

    1

    0

    1

    2

    X

    2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    X

    Each panel displays100 simulated returns for investmentsXandY. From left to right and top to bottom, the resulting

    estimates for are0.576, 0.532, 0.657, and0.651. 26/44

    E l ti d

  • 8/10/2019 cv_boot

    39/66

    Example continued

    To estimate the standard deviation of , we repeated theprocess of simulating 100 paired observations ofX and Y,and estimating 1,000 times.

    We thereby obtained 1,000 estimates for , which we can

    call 1,2, . . . ,1000. The left-hand panel of the Figure on slide29displays a

    histogram of the resulting estimates.

    For these simulations the parameters were set to

    2X= 1,

    2Y = 1.25, and XY = 0.5, and so we know that

    the true value of is 0.6 (indicated by the red line).

    27/44

    Example continued

  • 8/10/2019 cv_boot

    40/66

    Example continued

    The mean over all 1,000 estimates for is

    = 1

    1000

    1000r=1

    r = 0.5996,

    very close to = 0.6, and the standard deviation of theestimates is 1

    1000 1

    1000r=1

    (r )2 = 0.083.

    This gives us a very good idea of the accuracy of :SE()0.083.

    So roughly speaking, for a random sample from thepopulation, we would expect to differ from byapproximately 0.08, on average.

    28/44

    Results

  • 8/10/2019 cv_boot

    41/66

    Results

    0.4 0.5 0.6 0.7 0.8 0.9

    0

    50

    100

    150

    200

    0.3 0.4 0.5 0.6 0.7 0.8 0.9

    0

    50

    100

    150

    200

    True Bootstrap

    0.

    3

    0.

    4

    0.

    5

    0.

    6

    0.

    7

    0.

    8

    0.

    9

    Left: A histogram of the estimates of obtained by generating

    1,000 simulated data sets from the true population. Center: Ahistogram of the estimates of obtained from 1,000 bootstrapsamples from a single data set. Right: The estimates ofdisplayed in the left and center panels are shown as boxplots. Ineach panel, the pink line indicates the true value of.

    29/44

    Now back to the real world

  • 8/10/2019 cv_boot

    42/66

    Now back to the real world

    The procedure outlined above cannot be applied, becausefor real data we cannot generate new samples from theoriginal population.

    However, the bootstrap approach allows us to use acomputer to mimic the process of obtaining new data sets,so that we can estimate the variability of our estimate

    without generating additional samples. Rather than repeatedly obtaining independent data sets

    from the population, we instead obtain distinct data setsby repeatedly sampling observations from the original dataset with replacement.

    Each of these bootstrap data sets is created by samplingwith replacement, and is the same sizeas our originaldataset. As a result some observations may appear morethan once in a given bootstrap data set and some not at all.

    30/44

    Example with just 3 observations

  • 8/10/2019 cv_boot

    43/66

    Example with just 3 observations

    2.85.33

    1.12.12

    2.44.31

    YXObs

    2.85.33

    2.44.31

    2.85.33

    YXObs

    2.44.31

    2.85.33

    1.12.12

    YXObs

    2.44.31

    1.12.12

    1.12.12

    YXObs

    Original Data (Z)

    1*Z

    2*Z

    Z*B

    1*

    2*

    *B

    A graphical illustration of the bootstrap approach on a smallsample containing n= 3 observations. Each bootstrap data setcontains n observations, sampled with replacement from theoriginal data set. Each bootstrap data set is used to obtain anestimate of

    31/44

    Denoting the first bootstrap data set by Z1 we use Z1 to

  • 8/10/2019 cv_boot

    44/66

    Denoting the first bootstrap data set by Z , we use Z toproduce a new bootstrap estimate for , which we call 1

    This procedure is repeated B times for some large value of

    B (say 100 or 1000), in order to produce B differentbootstrap data sets, Z1, Z2, . . . , Z B, and Bcorresponding estimates, 1,2, . . . ,B .

    We estimate the standard error of these bootstrap

    estimates using the formula

    SEB() =

    1B 1

    Br=1

    r

    2.

    This serves as an estimate of the standard error of estimated from the original data set. See center and rightpanels of Figure on slide29. Bootstrap results are in blue.For this example SEB() = 0.087.

    32/44

    A general picture for the bootstrap

  • 8/10/2019 cv_boot

    45/66

    A general picture for the bootstrap

    Real World

    Estimate

    RandomSampling Data dataset

    Estimated

    Estimate

    Bootstrap World

    RandomSampling

    Bootstrap

    Bootstrap

    PopulationPopulation

    f(Z) f(Z)

    Z= (z1, z2, . . . zn) Z

    = (z

    1 , z

    2 , . . . z

    n)P P

    33/44

    The bootstrap in general

  • 8/10/2019 cv_boot

    46/66

    The bootstrap in general

    In more complex data situations, figuring out theappropriate way to generate bootstrap samples can requiresome thought.

    For example, if the data is a time series, we cant simplysample the observations with replacement (why not?).

    34/44

    The bootstrap in general

  • 8/10/2019 cv_boot

    47/66

    The bootstrap in general

    In more complex data situations, figuring out theappropriate way to generate bootstrap samples can requiresome thought.

    For example, if the data is a time series, we cant simplysample the observations with replacement (why not?).

    We can instead create blocks of consecutive observations,and sample those with replacements. Then we pastetogether sampled blocks to obtain a bootstrap dataset.

    34/44

    Other uses of the bootstrap

  • 8/10/2019 cv_boot

    48/66

    Other uses of the bootstrap

    Primarily used to obtain standard errors of an estimate. Also provides approximate confidence intervals for a

    population parameter. For example, looking at thehistogram in the middle panel of the Figure on slide29, the

    5% and 95% quantiles of the 1000 values is (.43, .72). This represents an approximate 90% confidence interval for

    the true .

    35/44

    Other uses of the bootstrap

  • 8/10/2019 cv_boot

    49/66

    p

    Primarily used to obtain standard errors of an estimate. Also provides approximate confidence intervals for a

    population parameter. For example, looking at thehistogram in the middle panel of the Figure on slide29, the5% and 95% quantiles of the 1000 values is (.43, .72).

    This represents an approximate 90% confidence interval forthe true . How do we interpret this confidence interval?

    35/44

    Other uses of the bootstrap

  • 8/10/2019 cv_boot

    50/66

    p

    Primarily used to obtain standard errors of an estimate. Also provides approximate confidence intervals for a

    population parameter. For example, looking at thehistogram in the middle panel of the Figure on slide29, the5% and 95% quantiles of the 1000 values is (.43, .72).

    This represents an approximate 90% confidence interval forthe true . How do we interpret this confidence interval?

    The above interval is called a Bootstrap Percentileconfidence interval. It is the simplest method (among many

    approaches) for obtaining a confidence interval from thebootstrap.

    35/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    51/66

    p p

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success.

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    52/66

    p p

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    53/66

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    To estimate prediction error using the bootstrap, we couldthink about using each bootstrap dataset as our training

    sample, and the original sample as our validation sample. But each bootstrap sample has significant overlap with the

    original data. About two-thirds of the original data pointsappear in each bootstrap sample.

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    54/66

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    To estimate prediction error using the bootstrap, we couldthink about using each bootstrap dataset as our training

    sample, and the original sample as our validation sample. But each bootstrap sample has significant overlap with the

    original data. About two-thirds of the original data pointsappear in each bootstrap sample. Can you prove this?

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    55/66

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    To estimate prediction error using the bootstrap, we couldthink about using each bootstrap dataset as our training

    sample, and the original sample as our validation sample. But each bootstrap sample has significant overlap with the

    original data. About two-thirds of the original data pointsappear in each bootstrap sample. Can you prove this?

    This will cause the bootstrap to seriously underestimate

    the true prediction error.

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    56/66

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    To estimate prediction error using the bootstrap, we couldthink about using each bootstrap dataset as our training

    sample, and the original sample as our validation sample. But each bootstrap sample has significant overlap with the

    original data. About two-thirds of the original data pointsappear in each bootstrap sample. Can you prove this?

    This will cause the bootstrap to seriously underestimate

    the true prediction error. Why?

    36/44

    Can the bootstrap estimate prediction error?

  • 8/10/2019 cv_boot

    57/66

    In cross-validation, each of the Kvalidation folds is

    distinct from the other K 1 folds used for training: thereis no overlap. This is crucial for its success. Why?

    To estimate prediction error using the bootstrap, we couldthink about using each bootstrap dataset as our training

    sample, and the original sample as our validation sample. But each bootstrap sample has significant overlap with the

    original data. About two-thirds of the original data pointsappear in each bootstrap sample. Can you prove this?

    This will cause the bootstrap to seriously underestimate

    the true prediction error. Why?

    The other way around with original sample = trainingsample, bootstrap dataset = validation sample is worse!

    36/44

    Removing the overlap

  • 8/10/2019 cv_boot

    58/66

    Can partly fix this problem by only using predictions forthose observations that did not (by chance) occur in thecurrent bootstrap sample.

    But the method gets complicated, and in the end,cross-validation provides a simpler, more attractiveapproach for estimating prediction error.

    37/44

    Pre-validation

  • 8/10/2019 cv_boot

    59/66

    In microarray and other genomic studies, an importantproblem is to compare a predictor of disease outcomederived from a large number of biomarkers to standardclinical predictors.

    Comparing them on the same dataset that was used toderive the biomarker predictor can lead to results stronglybiased in favor of the biomarker predictor.

    38/44

    Pre-validation

  • 8/10/2019 cv_boot

    60/66

    In microarray and other genomic studies, an importantproblem is to compare a predictor of disease outcomederived from a large number of biomarkers to standardclinical predictors.

    Comparing them on the same dataset that was used toderive the biomarker predictor can lead to results stronglybiased in favor of the biomarker predictor.

    Pre-validationcan be used to make a fairer comparison

    between the two sets of predictors.

    38/44

    Motivating example

  • 8/10/2019 cv_boot

    61/66

    An example of this problem arose in the paper of vant Veer et

    al. Nature(2002). Their microarray data has 4918 genesmeasured over 78 cases, taken from a study of breast cancer.There are 44 cases in the good prognosis group and 34 in thepoor prognosis group. A microarray predictor was

    constructed as follows:1. 70 genes were selected, having largest absolute correlation

    with the 78 class labels.

    2. Using these 70 genes, a nearest-centroid classifier C(x) wasconstructed.

    3. Applying the classifier to the 78 microarrays gave adichotomous predictor zi =C(xi) for each case i.

    39/44

    ResultsC i f h i di i h li i l

  • 8/10/2019 cv_boot

    62/66

    Comparison of the microarray predictor with some clinicalpredictors, using logistic regression with outcome prognosis:

    Model Coef Stand. Err. Z score p-value

    Re-usemicroarray 4.096 1.092 3.753 0.000angio 1.208 0.816 1.482 0.069er -0.554 1.044 -0.530 0.298grade -0.697 1.003 -0.695 0.243

    pr 1.214 1.057 1.149 0.125age -1.593 0.911 -1.748 0.040size 1.483 0.732 2.026 0.021

    Pre-validatedmicroarray 1.549 0.675 2.296 0.011angio 1.589 0.682 2.329 0.010er -0.617 0.894 -0.690 0.245grade 0.719 0.720 0.999 0.159pr 0.537 0.863 0.622 0.267age -1.471 0.701 -2.099 0.018size 0.998 0.594 1.681 0.046

    40/44

    Idea behind Pre-validation

  • 8/10/2019 cv_boot

    63/66

    Designed for comparison of adaptively derived predictors tofixed, pre-defined predictors.

    The idea is to form a pre-validated version of theadaptive predictor: specifically, a fairer version thathasnt seen the response y.

    41/44

    Pre-validation process

  • 8/10/2019 cv_boot

    64/66

    ResponsePrevalidatedPredictor

    Observations

    Predictors

    m tte ata

    Logistic Regression

    Fixedpredictors

    42/44

    Pre-validation in detail for this example

  • 8/10/2019 cv_boot

    65/66

    1. Divide the cases up into K= 13 equal-sized parts of 6cases each.

    2. Set aside one of parts. Using only the data from the other12 parts, select the features having absolute correlation atleast .3 with the class labels, and form a nearest centroid

    classification rule.3. Use the rule to predict the class labels for the 13th part

    4. Do steps 2 and 3 for each of the 13 parts, yielding apre-validated microarray predictor zi for each of the 78

    cases.5. Fit a logistic regression model to the pre-validated

    microarray predictor and the 6 clinical predictors.

    43/44

    The Bootstrap versus Permutation tests

  • 8/10/2019 cv_boot

    66/66

    The bootstrap samples from the estimated population, and

    uses the results to estimate standard errors and confidenceintervals.

    Permutation methods sample from an estimated nulldistribution for the data, and use this to estimate p-valuesand False Discovery Rates for hypothesis tests.

    The bootstrap can be used to test a null hypothesis insimple situations. Eg if= 0 is the null hypothesis, wecheck whether the confidence interval for contains zero.

    Can also adapt the bootstrap to sample from a null

    distribution (See Efron and Tibshirani book AnIntroduction to the Bootstrap (1993), chapter 16) buttheres no real advantage over permutations.

    44/44