Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression...

Binary Response:

Logistic Regression

STAT 526

Professor Olga Vitek

March 29, 2011

4

Model Specification

and

Interpretation

4-1

Probability Distribution

of a Binary Outcome Y

• In many situations, the response variablehas only two possible outcomes

– Disease (Y = 1) vs Not diseased (Y = 0)

– Employed (Y = 1) vs Unemployed (Y = 0)

• Response is binary or dichotomous

• Can model response using Bernoulli dist

Yi Probability1 Pr{Y1 = 1} = πi0 Pr{Y1 = 0} = 1− πi

• E{Yi} = πi

• V ar{Yi} = πi(1− πi)

4-2

Goal: express E{Y } as

function of a covariate X

• The simple regression is not appropriate

E{Yi} = β0 + β1Xi

It violates several assumptions:

(1) Does not enforce the constraint0 ≤ E{Yi} ≤ 1 is

(2) Non-normal (binary) distribution of ε | X:

When Yi = 0 : εi = 0− β0 − β1XiWhen Yi = 1 : εi = 1− β0 − β1Xi

(3) Non-constant variance

V ar{Yi} = πi(1− πi)= (β0 + β1Xi)(1− β0 − β1Xi)

4-3

Solution: a Generalized

Linear Model

• A generalized linear model is

E{Yi} = g(β0 + β1Xi), or

g−1(E{Yi}) = β0 + β1Xi

where g is a sigmoid function in (0,1).

– g is called the mean response function

– g−1 is called the link function

• A choice of g produces different models

– g(t) = Identity→ linear regression

– g(t) = Φ(t) = standard Normal CDF→ probit regresison

– g(t) = exp(t)1+exp(t)

= CDF of the logistic distrib.

→ logistic regresison

4-4

Motivation for Probit

Regression: Latent Variable

• Assume that the binary response is guidedby a non-observed continuous variable

• Example: linear model for blood pressure:

bp = β0 + β1age + ε

Only observe

Y =

{1 (disease), if blood pressure > c0 (healthy), if blood pressure ≤ c

Pr{Y = 1}= Pr{bp > c} = Pr{β0 + β1age + ε > c}= Pr{ε < β0 + β1age− c}

= Pr{ε

σ<β0 − cσ

+β1

σage}

= Pr{z < β′0 + β′1age}

= Φ(β′0 + β′1age)

4-5

Logistic Response Function

and Logistic Regression

• A sigmoidal response function

E{Yi} =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

=1

1 + exp(−β0 − β1Xi))

– A monotonic increasing/decreasing function

– Explicit functional form

– Restricts 0 ≤ E(Yi) ≤ 1

– Example of a nonlinear model

• Logit link function

log

(E{Yi}

1− E{Yi}

)= log

(πi

1− πi

)= β0 + β1Xi

4-6

Probability Distribution of Y

in Logistic Regression

• Yi are independent but not identicallydistributed Bernoulli random variables

Yiind∼ Bernoulli(πi) where

πi =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

– note no more error term!

• Probability density of Yi

f(Yi) = πYii (1− πi)1−Yi

• Least Squares Estimates are inappropriate

– use maximum likelihood for parameterestimation

4-7

Simple Logistic Regression:

Interpretation of b1• Fitted value for the individual i

– πi = eb0+b1Xi

1+eb0+b1Xi

• Fitted logistic response (i.e. fitted log odds)

– loge( odds(Xi)

)= loge

πi(Xi)1−πi(Xi)

= b0 + b1Xi

– b1 is the slope of the fitted logistic response

• b1 is interpreted as log(odds ratio)

− b1 = loge

(πi(Xi + 1)

1− πi(Xi + 1)

)− loge

(πi(Xi)

1− πi(Xi)

)= loge

(odds(Xi + 1)

odds(Xi)

)

– odds ratio(Xi) = exp(b1)

4-8

Estimation by

Maximum Likelihood

• Yiind∼ Bernoulli(πi) where


1 + exp(β0 + β1Xi)

• Log likelihood: loge(L) =

= log

{n∏i=1

πYii (1− πi)1−Yi

}

=n∑i=1

Yi log(πi) +n∑i=1

(1− Yi) log(1− πi)

=n∑i=1

Yi log(πi

1− πi) +

n∑i=1

log(1− πi)

=n∑i=1

Yi(β0 + β1Xi)−n∑i=1

log(1 + exp(β0 + β1Xi))

• MLEs do not have closed forms

4-9

Equivalent specification:

Binomial distribution

• Change in notation

– Data: (Yij, ni, Xi), i = 1,2, · · · , c

– Xi : predictor for observation i

– ni : # of Bernoulli trials in observation i

– Y ′i :=ni∑j=1

Yij

– Model:

Y ′iind∼ Binomial(ni, πi), where


1 + exp(β0 + β1Xi)

• Log-Likelihood: loge(L) =

= logc∏

i=1

{(niY ′i

)πY

′i (1− πi)ni−Y

′i

}=

c∑i=1

{Y ′i log(πi) + (ni − Y ′i ) log(1− πi) + log

(niY ′i

)}=

c∑i=1

{Y ′i log

πi

1− πi+ ni log(1− πi) + log

(niY ′i

)}4-10

Equivalence of Bernouilli and

Binomial Models

• Binomial Log-Likelihood equals BernouilliLog-Likelihood, up to a constant:

loge(L)Binomial =

=c∑

i=1

{Y ′i log(πi) + (ni − Y ′i ) log(1− πi)

}+ constant

=c∑

i=1

ni∑j=1

Yij log(πi) + (ni −ni∑j=1

Yij) log(1− πi)

+ constant

=c∑

i=1

ni∑j=1

{Yij log(

πi

1− πi) + log(1− πi)

}+ constant

= loge(L)Bernouilli + constant

• Both models lead to same parameter es-timates and inferences, but have differentdeviances.

4-11

Prospective and

Retrospective Studies

• Prospective study: fix predictors, observethe outcome

– Recruit patients with 2 genotypes, compare oc-currence of disease.

– Expensive; large variance for rare diseases

• Retrospective (=case-control) study: fixoutcome, observe predictors

– Recruit patients with and without the disease,compare the genotypes.

– Cheaper

• Log odds ratio based on logistic regressionis the same for both studies.

– Not true for other link functions

4-12

Prospective Model With

Retrospective Data:

Formulation

• Assume a prospective model:

π(Xi) = P (Yi = 1|Xi) =eβ0+β1Xi

1 + eβ0+β1Xi

• However, the subjects are selected retro-spectively.

• The random variable is Zi = 1{including subject i},conditional on Y .– Denoteθ0 = P (Zi = 1|Yi = 0) andθ1 = P (Z1 = 1|Yi = 1)

– Note that θ0 and θ1 are indenpent of Xi, other-wise introduce selection bias.

• Of interest in retrospective study isP (Yi = 1|Xi, Zi = 1)

4-13

Prospective Model With

Retrospective Data:

Estimation of log(OR)

• With retrospective sampling, we model:

P (Yi = 1|Xi, Zi = 1)

=P (Yi = 1, Zi = 1|Xi)

P (Zi = 1|Xi)

=P (Yi = 1, Zi = 1|Xi)

P (Yi = 0, Zi = 1|Xi) + P (Yi = 1, Zi = 1|Xi)

=θ1 × π(Xi)

θ0 × {1− π(Xi)}+ θ1 × π(Xi)

=θ1eβ0+β1Xi

θ0 + θ1eβ0+β1Xi

=elog(θ1/θ0)+β0+β1Xi

1 + elog(θ1/θ0)+β0+β1Xi

• Uncover the same β1 as in the prospective study

=⇒ log

{P (Yi = 1|Xi, Zi)

P (Yi = 0|Xi, Zi)

}= [log(θ1/θ0) + β0]+β1Xi

4-14

Multiple Logistic Regression

• Extension of the response function:

E{Yi} =exp(X′iβ)

1 + exp(X′iβ)

• Extension of the logit link function:

loge

(E{Yi}

1− E{Yi}

)= loge

(πi

1− πi

)= X′iβ

• The multivariate likelihood function:

logeL(β) =n∑i=1

Yi(X′iβ)−

n∑i=1

loge[1 + exp(X′iβ)]

• Interpretation of bj:

loge

(odds(Xj + 1)

odds(Xj)

)while other predictors are held fixed

4-15

Testing

4-16

Asymptotic Properties of β

• Asymptotic existence and uniqueness:

– P{β exists and is unique→ 1} as n→∞

• Consistency:

– β → β as n→∞

• Asymptotic Normality:

– βAss.∼ N (β, I(β)−1) as n→∞

• Asymptotic efficiency:

– The MLE has asymptotically smaller vari-ance than many other estimators

4-17

Inference About Individual

βj: Wald Test

• Test H0 : βj = 0 versus Ha : βj 6= 0.

• Test statistic z∗ =bj−0s{bj}

• Approximate variance s2{b}

s2{b} =

− ∂2 logeL(β)

∂βj∂βj′

β=b

−1

• Approximate distribution of z

– z∗ ∼ N (0,1). Alternatively, (z∗)2 ∼ χ21

– reject H0 if |z∗| > z1−α/2

– CI for βj: bj ± z1−α/2s{bj}

4-18

Simultaneous Inference

About Several βj = 0:

Likelihood Ratio Test

• Multivariate logistic regression

log

(E{Yi}

1− E{Yi}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Test H0 : β1 = β2 = · · · = βq = 0versus Ha: not all β1, β2, · · · , βq = 0

• Test statistic

G2 = −2 loge

[L(reduced model)

L(full model)

]= −2 [logeL(reduced model)− logeL(full model)]

• Approximate distr of G2 for large n

– Reject H0 if G2 > χ2(1− α, q)

4-19

Comments: Wald Test

Versus Likelihood Ratio Test

• Both tests are approximate, for large n

• Likelihood Ratio test: βj = 0, or severalβj = 0 simultaneously

– no other values of βj

– no one-sided tests

• Wald test: βj = βof interestj , or linear com-

binations of βj

• When testing a single H0 : βj = 0, thetests may lead to different conclusions

– due to the approximate nature of the tests

– unlike in linear regression

4-20

Quality of Fit:

Replicated data

4-21

Lack of Fit for Replicated

Data: Pearson χ2

H0 : log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yij}

1− E{Yij}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• c covariate configurations, each with nj cases

• Observed counts

– Oi0: observed # of 0’s in configuration i

– Oi1: observed # of 1’s in configuration i

• Expected counts

– Ei0 = ni(1− πi): expected # of 0’s in conf. i

– Ei1 = ni(πi): expected # of 1’s in conf. i

• Test statistic: reject H0 if

X2 =c∑

i=1

1∑k=0

(Ojk − Ejk)2

Ejk> χ2(1− α, c− p)

4-22


Data: Deviance

H0 : log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yij}

1− E{Yij}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• c covariate configurations, each with ni cases

– H0: model of interest; Ha: saturated model

• Model of interest

– E{Yij} = πi; E{Yij} = πi

• Saturated model

– E{Yij} = pi; E{Yij} = pi =

∑jYij

ni

• Test statistic: LR, also called deviance

• Deviance of a saturated model always = 0

4-23


Data: Deviance

G2 = DEV (X0, X1, . . . , Xp−1)

= −2 [ logeL(current model)− logeL(saturated model) ]

= −2c∑

i=1

ni∑j=1

Yij loge πi + (ni −ni∑j=1

Yij) loge(1− πi)

+2

c∑i=1

ni∑j=1

Yij loge pi + (ni −ni∑j=1

Yij) loge(1− pi)

= −2

c∑i=1

ni∑j=1

Yij loge

(πi

pi

)+ (ni −

ni∑j=1

Yij) loge

(1− πi1− pi

) • Reject H0 if G2 > χ2(1− α, c− p)

• Approximation χ2(1−α, c− p) can be poor

– The closer the distribution to Gaussian, and thecloser the link to identity, the better the approx-imation

– Unlike with the LR test, the quality of approxi-mation does not improve with the sample size

4-24

Note: Can Use Deviance for

LR Test of Nested Models

log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Test H0 : β1 = β2 = · · · = βq = 0versus Ha: not all β1, β2, · · · , βq = 0

• Likelihood Ratio test statistic G2 =

= −2 loge

[L(reduced model)

L(full model)

]= −2 [logeL(reduced model)− logeL(full model)]

= −2 [logeL(reduced model)− logeL(saturated model)]+2 [logeL(full model)− logeL(saturated model)]

= Deviance(reduced model)−Deviance(full model)

• Approximate distr of G2 for large n

• Reject H0 if G2 > χ2(1− α, q)

4-25

Quality of Fit:

Individual Observations

4-26

Non-Replicated Data:

Hosmer-Lemeshow

Goodness of Fit

H0 : log

(E{Yi}

1− E{Yi}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yi}

1− E{Yi}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Group cases based on values of estimated

probabilities πi into c groups

– E.g., find c = 9 groups based on percentiles

• Apply Pearson χ2 test to the groups

• Reject H0 if X2 > χ2(1− α, c− 2)

– showed by simulation that this distribution isappropriate

4-27

Can Write Pearson χ2

(But Not Use for Tests)

• Test statistic:

X2 =c∑

i=1

1∑k=0

(Oik − Eik)2

Eik

=c∑

i=1

(Oi0 − Ei0)2

Ei0+

c∑i=1

(Oi1 − Ei1)2

Ei1

• The corresponding quantities (KNNL p. 591):

– c = n, ni = 1

– Oi0 = 1− Yi, Oi1 = Yi

– Ei0 = 1− πi, Ei1 = πi

• Test statistic:

X2 =n∑i=1

[(1− Yi)− (1− πi)]2

1− πi+

n∑i=1

(Yi − πi)2

πi

=n∑i=1

(Yi − πi)2

1− πi+

n∑i=1

(Yi − πi)2

πi=

n∑i=1

(Yi − πi)2

πi(1− πi)

4-28

Can Write Deviance

(But Not Use for Tests)

• Test statistic G2 = DEV (X0, X1, . . . , Xp−1):

= −2c∑

i=1

ni∑j=1

Yij log

(πi

pi

)+ (ni −

ni∑j=1

Yij) log

(1− πi1− pi

) • The corresponding quantities (KNNL p. 592):

– c = n, ni = 1,ni∑j=1

Yij = Yi, pi =ni∑j=1

Yij/ni = Yi

• Test statistic:

G2 = −2n∑i=1

[Yi log

(πi

Yi

)+ (1− Yi) log

(1− πi1− Yi

) ]

= −2n∑i=1

[ Yi log πi + (1− Yi) log(1− πi)

−Yi logYi − (1− Yi) log(1− Yi) ]

= −2n∑i=1

[ Yi log πi + (1− Yi) log(1− πi) ]

4-29

Diagnostics: Residuals

• Logistic Regression Residuals

ei =

{1− πi, if Yi = 1−πi, if Yi = 0

• Pearson residual

rPi =Yi − πi√πi(1− πi)

– ei divided by the standard error of Yi

–n∑i=1

r2Pi

equals non-replicated Pearson X2

• Studentized Pearson Residual

rPi =Yi − πi√

πi(1− πi) · (1− hii)

– ei divided by the SE of ei → unit variance

– hii is the diagonal element of the hat matrix

H = W1

2X(X′WX)−1X′W1

2 , where

W = diag (πi(1− πi))

4-30

Diagnostics: Residuals

• Deviance Residuals

di = sign(Yi − πi)

√−2

[Yi log

πi

Yi+ (1− Yi) log

1− πi1− Yi

]= sign(Yi − πi)

√−2 [Yi log πi + (1− Yi) log(1− πi)]

– the signed square root of the contribution of Yito the model deviance

• Analysis of residuals

– unknown distribution of residuals under true model

– plot residual by predicted value. A flat lowesssmooth to this plot suggests good model

• Other summaries as in linear regression

– DFFITS, DFBETAS

– ∆χ2, ∆dev, Cook’s distance (see KNNL p. 598)

4-31

Graphical Check of the Fit

• Partition the observations into groups of

covariate patterns xi.

• Haldane (1956) recommended to plot

ηi = logyi + 0.5

ni − yi + 0.5

against covariate patterns xi.

• The plot should be roughly linear if the

model is appropriate for the data

• When all ni = 1 or all ni are small, one

can group the data with nearby x values to

make the plot

4-32

Overdispersion

4-33

Overdispersion

Yind∼ Binomial(n, π), π =

exp(X′β)

1 + exp(X′β)

• Implies E{Y } = π, V ar{Y } = nπ(1− π)

– Overdispersion: V ar{Y } > nπ(1− π)

– Underdispersion: V ar{Y } < nπ(1− π)

• Mechanisms of overdispersion:

– Suppose Y1, Y2, . . . , Yn are Bernoulli r.v.,E{Yi} = π.

– define Y =n∑i=1

Yi

– Can think of at least two situations when Y doesnot have a Binomial distribution (and thereforea different variance)

4-34

Overdispersion fromcorrelation

• Suppose Y1, Y2, . . . , Yn are not independent

• Suppose all pairs (Yi, Yj) have a same cor-

relation ρ

V ar(Y ) = Cov

(n∑i=1

Yi,

n∑i=1

Yi

)

=n∑i=1

V ar(Yi) +∑i6=j

Corr(Yi, Yj)√V ar(Yi)

√V ar(Yj)

= nπ(1− π) + n(n− 1)ρπ(1− π)> nπ(1− π)

• The variance exceeds the variance of the


4-35

Overdispersion fromclustered data

• Suppose Y =n∑i=1

Yi | π ∼ Binomial(π)

• Suppose π is a random variable,

E{π} = p, V ar{π} = p(1− p)

– Special case: π ∼ Beta(α, β)

E{Y } = E{E{Y | π}} = p

V ar{Y } = V ar{E{Y |π}}+ E{V ar{Y |π}}= V ar{nπ}+ E{nπ(1− π)}= n2V ar{π}+ np− n[V ar{π}+ p2]= np(1− p) + n(n− 1)V ar{π}> np(1− p)

• The variance exceeds the variance of the


4-36

Modeling Overdispersion

• Introduce additional parameter

E{Yi} = niπi, V ar{Yi} = φniπi{1− πi}

– When φ 6= 1, Yi follows a quasi-binomial dis-tribution. The distribution is characterized byits expectation and variance. The probabilitydistribution function is unspecified.

• φ = χ2/(n− p)

– χ2 is the Pearson lack of fit statistic(= sum of squared Pearson residuals with non-replicated data)

– p is the number of parameters in the model

– φ >> 1 indicates evidence of overdispersion.

• Since φ does not affect E{Yi}, modelingoverdispersion does not change β .

• SE{β} is multiplied by√φ.

4-37

Comparing Nested Models in

Presence of Overdispersion

• Regular likelihood-based approaches

(e.g. LRT, AIC) are not applicable.

• F test approximates deviance-based LR test

F =Dreduced −Dfulldfreduced − dffull

/φass., H0∼ Fdfreduced−dffull,dffull

• Assumes roughly equal covariate classes

• Modeling strategy:

– fit the full model (with all predictors)

– estimate φ

– compare nested models with F test to reducethe number of predictors

4-38

Prediction

and

Classification

4-39

Prediction of the Mean

• Point estimate for the link π′h:

π′h = X′hb

• Point estimate for the response πh:

πh =1

1 + exp(−π′h)=

1

1 + exp(−X′hb)

• Interval estimate for the link π′h:

s2{π′h} = X′hs2{b}X′h

(1− α)% CI for π′h : π′h ± z1−α/2s{π′} = (L, U)

• Approximate interval estimate for the re-sponse π′h:

(1− α)% CI for πh :

(1

1 + exp(−L),

1

1 + exp(−U)

)

• Use Bonferroni if multiple X are of interest

4-40

Measures of Agreement

• Have N observations

• Consider all pairs of distinct responses

– In this example t = 16× 14 = 224

• Compare predicted probabilities

– Concordant if πY=1 > πY=0

– Discordant if πY=1 < πY=0

– Tie if πY=1 = πY=0

• Measures of agreement

– Somers’ D : (#C - #D)/t

– Goodman-Kruskal Gamma : (#C - #D)/(#C + #D)

– Kendall’s Tau-a : (#C - #D)/(.5N(N-1))

– c : (#C +.5(t-#C-#D))/t

4-41

Prediction of a New

Observation (i.e.

Classification)

• Choose a cutoff c ∈ (0,1)

Yh =

{1, if πh > c0, if πh ≤ c

– Sensitivity:

# predicted ’1’ & true ’1’# true ’1’ =

n∑i=1

Yi·Yi

n∑i=1

Yi

– Specificity:

# predicted ’0’ & true ’0’# true ’0’ =

n∑i=1

(1−Yi)·(1−Yi)

n∑i=1

1−Yi

• Vary the cut-off c ∈ (0,1), and choose c tooptimize sensitivity and specificity

4-42

ROC curve

for classification

Vary c, and plot sensitivity vs 1-specificity.Evaluate models by area under the curve.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 − Specificity: # predicted '1' / # true '0'

Sen

sitiv

ity: #

pre

dict

ed '1

' / #

true

'1'

4-43

Evaluation of the Predictive

Ability of the Model

• Area under ROC can be used to compare

models

– Area = 1 → perfect classification

– Area = .5 → random classification

• Classification on the training set is overly

optimistic

• Use cross-validation to construct a more

accurate ROC curve

4-44

Variable Selection

4-45

Automatic Variable Selection

• Exhaustive search. Minimize:

−2 logeL(b)

AICp = −2 logeL(b) + 2p

BICp = −2 logeL(b) + p loge(n)

• Heuristic search

– forward selection; backward elimination; step-wise selection

– based on Wald statistic and Normal distribution

4-46

Variable Selection Should be

Done as Part of

Cross-Validation

• Example from Simon et al., JNCI, 2003.

• Simulated data with no structure

– 20 observations with random labels

– 6,000 possible but unrelated predictors

– Repeated 200 times

• Estimated predictive accuracy using

– no cross-validation

– selecting features on full dataset,then using cross-validation

– selecting features at each step of cross-validation

4-47

Variable Selection Should be

Done as Part of

Cross-Validation

Example from Simon et al., JNCI, 2003.

Feature selection in logistic regression

! Simulated 20 samples with random labels + 6,000 genes" repeated 200 times

! Estimated predictive accuracy using" no cross-validation; set aside validation set after gene selection

! Conclusion: cross-validation at all steps of feature selection is key

4

the samples are placed in the training set. The sample inthe test set is placed aside and not utilized at all in the de-velopment of the class prediction model. Using only thetraining set, the informative genes are selected and the pa-rameters of the model are fit to the data. Let us call M1 themodel developed with sample 1 in the test set. When thismodel is fully developed, it is used to predict the class ofsample 1. This prediction is made using the expressionprofile of sample 1, but obviously without using knowl-edge of the true class of sample 1. This predicted class iscompared to the true class label of sample 1. If they dis-agree, then the prediction is in error. Then a new trainingset–test set partition is created. This time sample 2 isplaced in the test set and all of the other samples, includingsample 1, are placed in the training set. A new model isconstructed from scratch using the samples in the newtraining set. Call this model M2 . Although the same algo-rithm for gene selection and parameter estimation is used,since model M2 is constructed from scratch on the newtraining set, it will in general not contain exactly the samegene set asM1. After creatingM2, it is applied to the expres-sionprofile of sample 2,whichwas omitted. If this predictedclass does not agree with the true class label of the secondsample, then the prediction is in error. The process is re-peated leaving each of the n biologically independent sam-ples out of the training set, one at a time. During the steps, ndifferentmodels are created and each one is used to predictthe class of the omitted sample. The number of predictionerrors is totaled and reported as the leave-one-out cross-validated estimate of the prediction error.

At the end of the LOOCV procedure, you have con-structed n different models. They were constructed in or-der only to estimate the prediction error associated withthe type of model constructed. The model that wouldbe used for future predictions is one constructed usingall n samples. That is the best model for future predictionand the one that should be reported in the publication.The cross-validated error rate is an estimate of the errorrate to be expected in use of this model for future samples,assuming that the relationship between class and expres-sion profile is the same for future samples as for the cur-rently available samples. With two classes, one can use asimilar approach to obtain cross-validated estimates of thesensitivity, specificity, and the negative and positive predic-tive values of the classification procedure. One could evenestimate an entire receiver operating characteristics curve.

The cross-validated prediction error is an estimate ofthe prediction error associated with application of the al-gorithm for model building to the entire dataset. A com-monly used invalid estimate is called the re-substitutionestimate. You use all the samples to develop a model.Then you predict the class of each sample using thatmodel. The predicted class labels are compared to thetrue class labels and the errors are totaled.

Simon et al15 performed a simulation to examine thebias in estimated error rates for class prediction. Two typesof LOOCV were studied: one with removal of the left-outspecimen before selection of differentially expressed genesand one with removal of the left-out specimen before com-putation of gene weights and the prediction rule but aftergene selection. They also computed the re-substitutionestimate of the error rate. In a simulated dataset, 20 geneexpression profiles of length 6,000 were randomly generatedfrom the same distribution. Ten profiles were arbitrarily as-signed to class 1 and the other 10 to class 2, creating anartificial separation of the profiles into two classes. Sinceno true underlying difference exists between the two classesclass prediction will perform no better than a random guessfor future biologically independent samples. Hence, theestimated error rates for simulated data sets should becentered around 0.5 (ie, 10 misclassifications of 20).

Figure 1 shows the observed number of misclassifica-tions resulting from each level of cross validation for 2,000simulated data sets. It is well known that the re-substitutionestimate of error is biased for small data sets and thesimulation confirms this, with an astounding 98.2% ofthe simulated data sets resulting in zero misclassificationseven though no true underlying difference exists betweenthe two groups. Moreover, the maximum number of mis-classified profiles using the resubstitution method wasonly one.

Cross validating the prediction rule after selection ofdifferentially expressed genes from the full data set doeslittle to correct the bias of the re-substitution estimator:90.2% of simulated data sets still result in zero misclassi-fications. It is not until gene selection is also subjected

0.00

0.05

Prop

ortio

n of

Sim

ulat

edDa

ta S

ets

No. of Misclassifications

0.100.90

0.95

1.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cross validation: none (resubstitution method)Cross validation: after gene selectionCross validation: before gene selection

Fig 1. The effect of various levels of cross validation on the estimated errorrate of a predictor. Two thousand datasets were simulated as described inthe text. Class labels were arbitrarily assigned to the specimens within eachdataset, and so poor classification accuracy is expected. Class predictionwas performed on each dataset as described in the supplemental infor-mation, varying the level of leave-one-out cross validation used in prediction.Vertical bars indicate the proportion of simulated data sets (of 2,000)resulting in a given number of misclassifications for a specified cross-validation strategy. Reprinted from Simon R, Radmacher MD, Dobbin K, et al:Pitfalls in the analysis of DNA microarray data: Class prediction methods.J Natl Cancer Inst 95:14-18, 2003.

Therapeutically Relevant Genomic Classifiers

www.jco.org 5

Copyright © 2005 by the American Society of Clinical Oncology. All rights reserved. Downloaded from jco.ascopubs.org on November 5, 2007 . For personal use only. No other uses without permission.

Simon, J. of the National Cancer Institute, 2003

Monday, November 16, 2009

• Conclusion

– Incorporating selection of predictors within thecross-validation procedure is key

4-48

Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression...

Documents

Transcript of Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression...