Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression...

49
Binary Response: Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4

Transcript of Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression...

Page 1: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Binary Response:

Logistic Regression

STAT 526

Professor Olga Vitek

March 29, 2011

4

Page 2: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Model Specification

and

Interpretation

4-1

Page 3: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Probability Distribution

of a Binary Outcome Y

• In many situations, the response variablehas only two possible outcomes

– Disease (Y = 1) vs Not diseased (Y = 0)

– Employed (Y = 1) vs Unemployed (Y = 0)

• Response is binary or dichotomous

• Can model response using Bernoulli dist

Yi Probability1 Pr{Y1 = 1} = πi0 Pr{Y1 = 0} = 1− πi

• E{Yi} = πi

• V ar{Yi} = πi(1− πi)

4-2

Page 4: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Goal: express E{Y } as

function of a covariate X

• The simple regression is not appropriate

E{Yi} = β0 + β1Xi

It violates several assumptions:

(1) Does not enforce the constraint0 ≤ E{Yi} ≤ 1 is

(2) Non-normal (binary) distribution of ε | X:

When Yi = 0 : εi = 0− β0 − β1XiWhen Yi = 1 : εi = 1− β0 − β1Xi

(3) Non-constant variance

V ar{Yi} = πi(1− πi)= (β0 + β1Xi)(1− β0 − β1Xi)

4-3

Page 5: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Solution: a Generalized

Linear Model

• A generalized linear model is

E{Yi} = g(β0 + β1Xi), or

g−1(E{Yi}) = β0 + β1Xi

where g is a sigmoid function in (0,1).

– g is called the mean response function

– g−1 is called the link function

• A choice of g produces different models

– g(t) = Identity→ linear regression

– g(t) = Φ(t) = standard Normal CDF→ probit regresison

– g(t) = exp(t)1+exp(t)

= CDF of the logistic distrib.

→ logistic regresison

4-4

Page 6: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Motivation for Probit

Regression: Latent Variable

• Assume that the binary response is guidedby a non-observed continuous variable

• Example: linear model for blood pressure:

bp = β0 + β1age + ε

Only observe

Y =

{1 (disease), if blood pressure > c0 (healthy), if blood pressure ≤ c

Pr{Y = 1}= Pr{bp > c} = Pr{β0 + β1age + ε > c}= Pr{ε < β0 + β1age− c}

= Pr{ε

σ<β0 − cσ

+β1

σage}

= Pr{z < β′0 + β′1age}

= Φ(β′0 + β′1age)

4-5

Page 7: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Logistic Response Function

and Logistic Regression

• A sigmoidal response function

E{Yi} =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

=1

1 + exp(−β0 − β1Xi))

– A monotonic increasing/decreasing function

– Explicit functional form

– Restricts 0 ≤ E(Yi) ≤ 1

– Example of a nonlinear model

• Logit link function

log

(E{Yi}

1− E{Yi}

)= log

(πi

1− πi

)= β0 + β1Xi

4-6

Page 8: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Probability Distribution of Y

in Logistic Regression

• Yi are independent but not identicallydistributed Bernoulli random variables

Yiind∼ Bernoulli(πi) where

πi =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

– note no more error term!

• Probability density of Yi

f(Yi) = πYii (1− πi)1−Yi

• Least Squares Estimates are inappropriate

– use maximum likelihood for parameterestimation

4-7

Page 9: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Simple Logistic Regression:

Interpretation of b1• Fitted value for the individual i

– πi = eb0+b1Xi

1+eb0+b1Xi

• Fitted logistic response (i.e. fitted log odds)

– loge( odds(Xi)

)= loge

πi(Xi)1−πi(Xi)

= b0 + b1Xi

– b1 is the slope of the fitted logistic response

• b1 is interpreted as log(odds ratio)

− b1 = loge

(πi(Xi + 1)

1− πi(Xi + 1)

)− loge

(πi(Xi)

1− πi(Xi)

)= loge

(odds(Xi + 1)

odds(Xi)

)

– odds ratio(Xi) = exp(b1)

4-8

Page 10: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Estimation by

Maximum Likelihood

• Yiind∼ Bernoulli(πi) where

πi =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

• Log likelihood: loge(L) =

= log

{n∏i=1

πYii (1− πi)1−Yi

}

=n∑i=1

Yi log(πi) +n∑i=1

(1− Yi) log(1− πi)

=n∑i=1

Yi log(πi

1− πi) +

n∑i=1

log(1− πi)

=n∑i=1

Yi(β0 + β1Xi)−n∑i=1

log(1 + exp(β0 + β1Xi))

• MLEs do not have closed forms

4-9

Page 11: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Equivalent specification:

Binomial distribution

• Change in notation

– Data: (Yij, ni, Xi), i = 1,2, · · · , c

– Xi : predictor for observation i

– ni : # of Bernoulli trials in observation i

– Y ′i :=ni∑j=1

Yij

– Model:

Y ′iind∼ Binomial(ni, πi), where

πi =exp(β0 + β1Xi)

1 + exp(β0 + β1Xi)

• Log-Likelihood: loge(L) =

= logc∏

i=1

{(niY ′i

)πY

′i (1− πi)ni−Y

′i

}=

c∑i=1

{Y ′i log(πi) + (ni − Y ′i ) log(1− πi) + log

(niY ′i

)}=

c∑i=1

{Y ′i log

πi

1− πi+ ni log(1− πi) + log

(niY ′i

)}4-10

Page 12: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Equivalence of Bernouilli and

Binomial Models

• Binomial Log-Likelihood equals BernouilliLog-Likelihood, up to a constant:

loge(L)Binomial =

=c∑

i=1

{Y ′i log(πi) + (ni − Y ′i ) log(1− πi)

}+ constant

=c∑

i=1

ni∑j=1

Yij log(πi) + (ni −ni∑j=1

Yij) log(1− πi)

+ constant

=c∑

i=1

ni∑j=1

{Yij log(

πi

1− πi) + log(1− πi)

}+ constant

= loge(L)Bernouilli + constant

• Both models lead to same parameter es-timates and inferences, but have differentdeviances.

4-11

Page 13: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prospective and

Retrospective Studies

• Prospective study: fix predictors, observethe outcome

– Recruit patients with 2 genotypes, compare oc-currence of disease.

– Expensive; large variance for rare diseases

• Retrospective (=case-control) study: fixoutcome, observe predictors

– Recruit patients with and without the disease,compare the genotypes.

– Cheaper

• Log odds ratio based on logistic regressionis the same for both studies.

– Not true for other link functions

4-12

Page 14: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prospective Model With

Retrospective Data:

Formulation

• Assume a prospective model:

π(Xi) = P (Yi = 1|Xi) =eβ0+β1Xi

1 + eβ0+β1Xi

• However, the subjects are selected retro-spectively.

• The random variable is Zi = 1{including subject i},conditional on Y .– Denoteθ0 = P (Zi = 1|Yi = 0) andθ1 = P (Z1 = 1|Yi = 1)

– Note that θ0 and θ1 are indenpent of Xi, other-wise introduce selection bias.

• Of interest in retrospective study isP (Yi = 1|Xi, Zi = 1)

4-13

Page 15: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prospective Model With

Retrospective Data:

Estimation of log(OR)

• With retrospective sampling, we model:

P (Yi = 1|Xi, Zi = 1)

=P (Yi = 1, Zi = 1|Xi)

P (Zi = 1|Xi)

=P (Yi = 1, Zi = 1|Xi)

P (Yi = 0, Zi = 1|Xi) + P (Yi = 1, Zi = 1|Xi)

=θ1 × π(Xi)

θ0 × {1− π(Xi)}+ θ1 × π(Xi)

=θ1eβ0+β1Xi

θ0 + θ1eβ0+β1Xi

=elog(θ1/θ0)+β0+β1Xi

1 + elog(θ1/θ0)+β0+β1Xi

• Uncover the same β1 as in the prospective study

=⇒ log

{P (Yi = 1|Xi, Zi)

P (Yi = 0|Xi, Zi)

}= [log(θ1/θ0) + β0]+β1Xi

4-14

Page 16: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Multiple Logistic Regression

• Extension of the response function:

E{Yi} =exp(X′iβ)

1 + exp(X′iβ)

• Extension of the logit link function:

loge

(E{Yi}

1− E{Yi}

)= loge

(πi

1− πi

)= X′iβ

• The multivariate likelihood function:

logeL(β) =n∑i=1

Yi(X′iβ)−

n∑i=1

loge[1 + exp(X′iβ)]

• Interpretation of bj:

loge

(odds(Xj + 1)

odds(Xj)

)while other predictors are held fixed

4-15

Page 17: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Testing

4-16

Page 18: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Asymptotic Properties of β

• Asymptotic existence and uniqueness:

– P{β exists and is unique→ 1} as n→∞

• Consistency:

– β → β as n→∞

• Asymptotic Normality:

– βAss.∼ N (β, I(β)−1) as n→∞

• Asymptotic efficiency:

– The MLE has asymptotically smaller vari-ance than many other estimators

4-17

Page 19: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Inference About Individual

βj: Wald Test

• Test H0 : βj = 0 versus Ha : βj 6= 0.

• Test statistic z∗ =bj−0s{bj}

• Approximate variance s2{b}

s2{b} =

− ∂2 logeL(β)

∂βj∂βj′

β=b

−1

• Approximate distribution of z

– z∗ ∼ N (0,1). Alternatively, (z∗)2 ∼ χ21

– reject H0 if |z∗| > z1−α/2

– CI for βj: bj ± z1−α/2s{bj}

4-18

Page 20: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Simultaneous Inference

About Several βj = 0:

Likelihood Ratio Test

• Multivariate logistic regression

log

(E{Yi}

1− E{Yi}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Test H0 : β1 = β2 = · · · = βq = 0versus Ha: not all β1, β2, · · · , βq = 0

• Test statistic

G2 = −2 loge

[L(reduced model)

L(full model)

]= −2 [logeL(reduced model)− logeL(full model)]

• Approximate distr of G2 for large n

– Reject H0 if G2 > χ2(1− α, q)

4-19

Page 21: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Comments: Wald Test

Versus Likelihood Ratio Test

• Both tests are approximate, for large n

• Likelihood Ratio test: βj = 0, or severalβj = 0 simultaneously

– no other values of βj

– no one-sided tests

• Wald test: βj = βof interestj , or linear com-

binations of βj

• When testing a single H0 : βj = 0, thetests may lead to different conclusions

– due to the approximate nature of the tests

– unlike in linear regression

4-20

Page 22: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Quality of Fit:

Replicated data

4-21

Page 23: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Lack of Fit for Replicated

Data: Pearson χ2

H0 : log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yij}

1− E{Yij}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• c covariate configurations, each with nj cases

• Observed counts

– Oi0: observed # of 0’s in configuration i

– Oi1: observed # of 1’s in configuration i

• Expected counts

– Ei0 = ni(1− πi): expected # of 0’s in conf. i

– Ei1 = ni(πi): expected # of 1’s in conf. i

• Test statistic: reject H0 if

X2 =c∑

i=1

1∑k=0

(Ojk − Ejk)2

Ejk> χ2(1− α, c− p)

4-22

Page 24: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Lack of Fit for Replicated

Data: Deviance

H0 : log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yij}

1− E{Yij}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• c covariate configurations, each with ni cases

– H0: model of interest; Ha: saturated model

• Model of interest

– E{Yij} = πi; E{Yij} = πi

• Saturated model

– E{Yij} = pi; E{Yij} = pi =

∑jYij

ni

• Test statistic: LR, also called deviance

• Deviance of a saturated model always = 0

4-23

Page 25: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Lack of Fit for Replicated

Data: Deviance

G2 = DEV (X0, X1, . . . , Xp−1)

= −2 [ logeL(current model)− logeL(saturated model) ]

= −2c∑

i=1

ni∑j=1

Yij loge πi + (ni −ni∑j=1

Yij) loge(1− πi)

+2

c∑i=1

ni∑j=1

Yij loge pi + (ni −ni∑j=1

Yij) loge(1− pi)

= −2

c∑i=1

ni∑j=1

Yij loge

(πi

pi

)+ (ni −

ni∑j=1

Yij) loge

(1− πi1− pi

) • Reject H0 if G2 > χ2(1− α, c− p)

• Approximation χ2(1−α, c− p) can be poor

– The closer the distribution to Gaussian, and thecloser the link to identity, the better the approx-imation

– Unlike with the LR test, the quality of approxi-mation does not improve with the sample size

4-24

Page 26: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Note: Can Use Deviance for

LR Test of Nested Models

log

(E{Yij}

1− E{Yij}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Test H0 : β1 = β2 = · · · = βq = 0versus Ha: not all β1, β2, · · · , βq = 0

• Likelihood Ratio test statistic G2 =

= −2 loge

[L(reduced model)

L(full model)

]= −2 [logeL(reduced model)− logeL(full model)]

= −2 [logeL(reduced model)− logeL(saturated model)]+2 [logeL(full model)− logeL(saturated model)]

= Deviance(reduced model)−Deviance(full model)

• Approximate distr of G2 for large n

• Reject H0 if G2 > χ2(1− α, q)

4-25

Page 27: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Quality of Fit:

Individual Observations

4-26

Page 28: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Non-Replicated Data:

Hosmer-Lemeshow

Goodness of Fit

H0 : log

(E{Yi}

1− E{Yi}

)= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1 vs

Ha : log

(E{Yi}

1− E{Yi}

)6= β0 + β1Xi1 + · · ·+ βp−1Xi,p−1

• Group cases based on values of estimated

probabilities πi into c groups

– E.g., find c = 9 groups based on percentiles

• Apply Pearson χ2 test to the groups

• Reject H0 if X2 > χ2(1− α, c− 2)

– showed by simulation that this distribution isappropriate

4-27

Page 29: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Can Write Pearson χ2

(But Not Use for Tests)

• Test statistic:

X2 =c∑

i=1

1∑k=0

(Oik − Eik)2

Eik

=c∑

i=1

(Oi0 − Ei0)2

Ei0+

c∑i=1

(Oi1 − Ei1)2

Ei1

• The corresponding quantities (KNNL p. 591):

– c = n, ni = 1

– Oi0 = 1− Yi, Oi1 = Yi

– Ei0 = 1− πi, Ei1 = πi

• Test statistic:

X2 =n∑i=1

[(1− Yi)− (1− πi)]2

1− πi+

n∑i=1

(Yi − πi)2

πi

=n∑i=1

(Yi − πi)2

1− πi+

n∑i=1

(Yi − πi)2

πi=

n∑i=1

(Yi − πi)2

πi(1− πi)

4-28

Page 30: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Can Write Deviance

(But Not Use for Tests)

• Test statistic G2 = DEV (X0, X1, . . . , Xp−1):

= −2c∑

i=1

ni∑j=1

Yij log

(πi

pi

)+ (ni −

ni∑j=1

Yij) log

(1− πi1− pi

) • The corresponding quantities (KNNL p. 592):

– c = n, ni = 1,ni∑j=1

Yij = Yi, pi =ni∑j=1

Yij/ni = Yi

• Test statistic:

G2 = −2n∑i=1

[Yi log

(πi

Yi

)+ (1− Yi) log

(1− πi1− Yi

) ]

= −2n∑i=1

[ Yi log πi + (1− Yi) log(1− πi)

−Yi logYi − (1− Yi) log(1− Yi) ]

= −2n∑i=1

[ Yi log πi + (1− Yi) log(1− πi) ]

4-29

Page 31: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Diagnostics: Residuals

• Logistic Regression Residuals

ei =

{1− πi, if Yi = 1−πi, if Yi = 0

• Pearson residual

rPi =Yi − πi√πi(1− πi)

– ei divided by the standard error of Yi

–n∑i=1

r2Pi

equals non-replicated Pearson X2

• Studentized Pearson Residual

rPi =Yi − πi√

πi(1− πi) · (1− hii)

– ei divided by the SE of ei → unit variance

– hii is the diagonal element of the hat matrix

H = W1

2X(X′WX)−1X′W1

2 , where

W = diag (πi(1− πi))

4-30

Page 32: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Diagnostics: Residuals

• Deviance Residuals

di = sign(Yi − πi)

√−2

[Yi log

πi

Yi+ (1− Yi) log

1− πi1− Yi

]= sign(Yi − πi)

√−2 [Yi log πi + (1− Yi) log(1− πi)]

– the signed square root of the contribution of Yito the model deviance

• Analysis of residuals

– unknown distribution of residuals under true model

– plot residual by predicted value. A flat lowesssmooth to this plot suggests good model

• Other summaries as in linear regression

– DFFITS, DFBETAS

– ∆χ2, ∆dev, Cook’s distance (see KNNL p. 598)

4-31

Page 33: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Graphical Check of the Fit

• Partition the observations into groups of

covariate patterns xi.

• Haldane (1956) recommended to plot

ηi = logyi + 0.5

ni − yi + 0.5

against covariate patterns xi.

• The plot should be roughly linear if the

model is appropriate for the data

• When all ni = 1 or all ni are small, one

can group the data with nearby x values to

make the plot

4-32

Page 34: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Overdispersion

4-33

Page 35: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Overdispersion

Yind∼ Binomial(n, π), π =

exp(X′β)

1 + exp(X′β)

• Implies E{Y } = π, V ar{Y } = nπ(1− π)

– Overdispersion: V ar{Y } > nπ(1− π)

– Underdispersion: V ar{Y } < nπ(1− π)

• Mechanisms of overdispersion:

– Suppose Y1, Y2, . . . , Yn are Bernoulli r.v.,E{Yi} = π.

– define Y =n∑i=1

Yi

– Can think of at least two situations when Y doesnot have a Binomial distribution (and thereforea different variance)

4-34

Page 36: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Overdispersion fromcorrelation

• Suppose Y1, Y2, . . . , Yn are not independent

• Suppose all pairs (Yi, Yj) have a same cor-

relation ρ

V ar(Y ) = Cov

(n∑i=1

Yi,

n∑i=1

Yi

)

=n∑i=1

V ar(Yi) +∑i6=j

Corr(Yi, Yj)√V ar(Yi)

√V ar(Yj)

= nπ(1− π) + n(n− 1)ρπ(1− π)> nπ(1− π)

• The variance exceeds the variance of the

Binomial distribution

4-35

Page 37: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Overdispersion fromclustered data

• Suppose Y =n∑i=1

Yi | π ∼ Binomial(π)

• Suppose π is a random variable,

E{π} = p, V ar{π} = p(1− p)

– Special case: π ∼ Beta(α, β)

E{Y } = E{E{Y | π}} = p

V ar{Y } = V ar{E{Y |π}}+ E{V ar{Y |π}}= V ar{nπ}+ E{nπ(1− π)}= n2V ar{π}+ np− n[V ar{π}+ p2]= np(1− p) + n(n− 1)V ar{π}> np(1− p)

• The variance exceeds the variance of the

Binomial distribution

4-36

Page 38: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Modeling Overdispersion

• Introduce additional parameter

E{Yi} = niπi, V ar{Yi} = φniπi{1− πi}

– When φ 6= 1, Yi follows a quasi-binomial dis-tribution. The distribution is characterized byits expectation and variance. The probabilitydistribution function is unspecified.

• φ = χ2/(n− p)

– χ2 is the Pearson lack of fit statistic(= sum of squared Pearson residuals with non-replicated data)

– p is the number of parameters in the model

– φ >> 1 indicates evidence of overdispersion.

• Since φ does not affect E{Yi}, modelingoverdispersion does not change β .

• SE{β} is multiplied by√φ.

4-37

Page 39: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Comparing Nested Models in

Presence of Overdispersion

• Regular likelihood-based approaches

(e.g. LRT, AIC) are not applicable.

• F test approximates deviance-based LR test

F =Dreduced −Dfulldfreduced − dffull

/φass., H0∼ Fdfreduced−dffull,dffull

• Assumes roughly equal covariate classes

• Modeling strategy:

– fit the full model (with all predictors)

– estimate φ

– compare nested models with F test to reducethe number of predictors

4-38

Page 40: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prediction

and

Classification

4-39

Page 41: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prediction of the Mean

• Point estimate for the link π′h:

π′h = X′hb

• Point estimate for the response πh:

πh =1

1 + exp(−π′h)=

1

1 + exp(−X′hb)

• Interval estimate for the link π′h:

s2{π′h} = X′hs2{b}X′h

(1− α)% CI for π′h : π′h ± z1−α/2s{π′} = (L, U)

• Approximate interval estimate for the re-sponse π′h:

(1− α)% CI for πh :

(1

1 + exp(−L),

1

1 + exp(−U)

)

• Use Bonferroni if multiple X are of interest

4-40

Page 42: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Measures of Agreement

• Have N observations

• Consider all pairs of distinct responses

– In this example t = 16× 14 = 224

• Compare predicted probabilities

– Concordant if πY=1 > πY=0

– Discordant if πY=1 < πY=0

– Tie if πY=1 = πY=0

• Measures of agreement

– Somers’ D : (#C - #D)/t

– Goodman-Kruskal Gamma : (#C - #D)/(#C + #D)

– Kendall’s Tau-a : (#C - #D)/(.5N(N-1))

– c : (#C +.5(t-#C-#D))/t

4-41

Page 43: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Prediction of a New

Observation (i.e.

Classification)

• Choose a cutoff c ∈ (0,1)

Yh =

{1, if πh > c0, if πh ≤ c

– Sensitivity:

# predicted ’1’ & true ’1’# true ’1’ =

n∑i=1

Yi·Yi

n∑i=1

Yi

– Specificity:

# predicted ’0’ & true ’0’# true ’0’ =

n∑i=1

(1−Yi)·(1−Yi)

n∑i=1

1−Yi

• Vary the cut-off c ∈ (0,1), and choose c tooptimize sensitivity and specificity

4-42

Page 44: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

ROC curve

for classification

Vary c, and plot sensitivity vs 1-specificity.Evaluate models by area under the curve.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 − Specificity: # predicted '1' / # true '0'

Sen

sitiv

ity: #

pre

dict

ed '1

' / #

true

'1'

4-43

Page 45: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Evaluation of the Predictive

Ability of the Model

• Area under ROC can be used to compare

models

– Area = 1 → perfect classification

– Area = .5 → random classification

• Classification on the training set is overly

optimistic

• Use cross-validation to construct a more

accurate ROC curve

4-44

Page 46: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Variable Selection

4-45

Page 47: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Automatic Variable Selection

• Exhaustive search. Minimize:

−2 logeL(b)

AICp = −2 logeL(b) + 2p

BICp = −2 logeL(b) + p loge(n)

• Heuristic search

– forward selection; backward elimination; step-wise selection

– based on Wald statistic and Normal distribution

4-46

Page 48: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Variable Selection Should be

Done as Part of

Cross-Validation

• Example from Simon et al., JNCI, 2003.

• Simulated data with no structure

– 20 observations with random labels

– 6,000 possible but unrelated predictors

– Repeated 200 times

• Estimated predictive accuracy using

– no cross-validation

– selecting features on full dataset,then using cross-validation

– selecting features at each step of cross-validation

4-47

Page 49: Binary Response: Logistic Regression - Purdue University · 2011. 3. 29. · Logistic Regression STAT 526 Professor Olga Vitek March 29, 2011 4. Model Speci cation and Interpretation

Variable Selection Should be

Done as Part of

Cross-Validation

Example from Simon et al., JNCI, 2003.

Feature selection in logistic regression

! Simulated 20 samples with random labels + 6,000 genes" repeated 200 times

! Estimated predictive accuracy using" no cross-validation; set aside validation set after gene selection

! Conclusion: cross-validation at all steps of feature selection is key

4

the samples are placed in the training set. The sample inthe test set is placed aside and not utilized at all in the de-velopment of the class prediction model. Using only thetraining set, the informative genes are selected and the pa-rameters of the model are fit to the data. Let us call M1 themodel developed with sample 1 in the test set. When thismodel is fully developed, it is used to predict the class ofsample 1. This prediction is made using the expressionprofile of sample 1, but obviously without using knowl-edge of the true class of sample 1. This predicted class iscompared to the true class label of sample 1. If they dis-agree, then the prediction is in error. Then a new trainingset–test set partition is created. This time sample 2 isplaced in the test set and all of the other samples, includingsample 1, are placed in the training set. A new model isconstructed from scratch using the samples in the newtraining set. Call this model M2 . Although the same algo-rithm for gene selection and parameter estimation is used,since model M2 is constructed from scratch on the newtraining set, it will in general not contain exactly the samegene set asM1. After creatingM2, it is applied to the expres-sionprofile of sample 2,whichwas omitted. If this predictedclass does not agree with the true class label of the secondsample, then the prediction is in error. The process is re-peated leaving each of the n biologically independent sam-ples out of the training set, one at a time. During the steps, ndifferentmodels are created and each one is used to predictthe class of the omitted sample. The number of predictionerrors is totaled and reported as the leave-one-out cross-validated estimate of the prediction error.

At the end of the LOOCV procedure, you have con-structed n different models. They were constructed in or-der only to estimate the prediction error associated withthe type of model constructed. The model that wouldbe used for future predictions is one constructed usingall n samples. That is the best model for future predictionand the one that should be reported in the publication.The cross-validated error rate is an estimate of the errorrate to be expected in use of this model for future samples,assuming that the relationship between class and expres-sion profile is the same for future samples as for the cur-rently available samples. With two classes, one can use asimilar approach to obtain cross-validated estimates of thesensitivity, specificity, and the negative and positive predic-tive values of the classification procedure. One could evenestimate an entire receiver operating characteristics curve.

The cross-validated prediction error is an estimate ofthe prediction error associated with application of the al-gorithm for model building to the entire dataset. A com-monly used invalid estimate is called the re-substitutionestimate. You use all the samples to develop a model.Then you predict the class of each sample using thatmodel. The predicted class labels are compared to thetrue class labels and the errors are totaled.

Simon et al15 performed a simulation to examine thebias in estimated error rates for class prediction. Two typesof LOOCV were studied: one with removal of the left-outspecimen before selection of differentially expressed genesand one with removal of the left-out specimen before com-putation of gene weights and the prediction rule but aftergene selection. They also computed the re-substitutionestimate of the error rate. In a simulated dataset, 20 geneexpression profiles of length 6,000 were randomly generatedfrom the same distribution. Ten profiles were arbitrarily as-signed to class 1 and the other 10 to class 2, creating anartificial separation of the profiles into two classes. Sinceno true underlying difference exists between the two classesclass prediction will perform no better than a random guessfor future biologically independent samples. Hence, theestimated error rates for simulated data sets should becentered around 0.5 (ie, 10 misclassifications of 20).

Figure 1 shows the observed number of misclassifica-tions resulting from each level of cross validation for 2,000simulated data sets. It is well known that the re-substitutionestimate of error is biased for small data sets and thesimulation confirms this, with an astounding 98.2% ofthe simulated data sets resulting in zero misclassificationseven though no true underlying difference exists betweenthe two groups. Moreover, the maximum number of mis-classified profiles using the resubstitution method wasonly one.

Cross validating the prediction rule after selection ofdifferentially expressed genes from the full data set doeslittle to correct the bias of the re-substitution estimator:90.2% of simulated data sets still result in zero misclassi-fications. It is not until gene selection is also subjected

0.00

0.05

Prop

ortio

n of

Sim

ulat

edDa

ta S

ets

No. of Misclassifications

0.100.90

0.95

1.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cross validation: none (resubstitution method)Cross validation: after gene selectionCross validation: before gene selection

Fig 1. The effect of various levels of cross validation on the estimated errorrate of a predictor. Two thousand datasets were simulated as described inthe text. Class labels were arbitrarily assigned to the specimens within eachdataset, and so poor classification accuracy is expected. Class predictionwas performed on each dataset as described in the supplemental infor-mation, varying the level of leave-one-out cross validation used in prediction.Vertical bars indicate the proportion of simulated data sets (of 2,000)resulting in a given number of misclassifications for a specified cross-validation strategy. Reprinted from Simon R, Radmacher MD, Dobbin K, et al:Pitfalls in the analysis of DNA microarray data: Class prediction methods.J Natl Cancer Inst 95:14-18, 2003.

Therapeutically Relevant Genomic Classifiers

www.jco.org 5

Copyright © 2005 by the American Society of Clinical Oncology. All rights reserved. Downloaded from jco.ascopubs.org on November 5, 2007 . For personal use only. No other uses without permission.

Simon, J. of the National Cancer Institute, 2003

Monday, November 16, 2009

• Conclusion

– Incorporating selection of predictors within thecross-validation procedure is key

4-48