Logistic regression & prediction score 16-17-12-09 · Nizam A. Allied regression analysis and other...
Transcript of Logistic regression & prediction score 16-17-12-09 · Nizam A. Allied regression analysis and other...
15/12/52
1
Logistic regression analysis&
developing a clinical prediction score
Ammarin Thakkinstian, Ph.D. Section for Clinical Epidemiology
and Biostatistics (SCEB)
• Part I– Logistic regression analysis
• Part II– Developing a clinical prediction score
15/12/52
2
Objective
• Construct the logit equation • Estimate the probability of event, the
adjusted odds ratio and its 95% confidence interval
• Interpret the results of logistic regression analysisanalysis
• Assess goodness of fit of the logit model & diagnostic measuring
Objective
• Develop a prediction score model using th l it ti & ROC l ithe logit equation & ROC curve analysis
• Calibrate the cut-off or threshold • Validate a prediction score model
15/12/52
3
Reference• Pagano M. and Gauvreau K. Principle of
Biostatistics California: Duxbury PressBiostatistics. California: Duxbury Press 1993; 379 - 424.
• Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Allied regression analysis and other multivariable methods, 3rd edition. Washington: Duxbury Press 1998; 39 - 212.
• Hosmer DW, Lemeshow S. Applied logistic regression, 2ndedition. New York: John Weiley& Sons, Inc 2000.
Outline of talk• Construct logistic equation
Si l l i ti d l– Simple logistic model– Multiple logistic model
• Model selection – Assessing a goodness of fit of the model – Diagnostic measure g
• Creating a clinical prediction score – Derivative phase– Validation phase
15/12/52
4
When will we apply the logistic equation
Assessing association bet een factors• Assessing association between factors and outcome in which
• Outcome – Dichotomous only
• DM/Non-Dm, HT/Non-HT, CKD/non-CKD, , , ,Retinopathy/Non-Ratinophaty,
– Factors • Can be either continuous or categorical variables
Example I.
Factors associate with acute stroke• Design: Case-control study • Outcome variable: Case vs Control
– Case is patient who is diagnosed as h h i i h i t khaemorhagic or ischemic stroke
– Control is subject who has never had history of stroke
15/12/52
5
• Interested variables – Age, gender, BMI, Waist-hip ratio – Smoking, alcohol consumption – Physical activity – History of disease
• DM• HT • High Cholesterol, LDL, HDL, Trig
• Variables (cont)– Genetic factors
• tissue-type plasminogen activator (t-PA)• R353Q polymorphism of the Factor VII gene • Platelet glycoprotein (GP 1bα) gene
– Thr/Met & Kozak polymorphisms
15/12/52
6
Example II. Factors associate with retinopathy in diabetic type 2 patients
• Design – Cross-sectional study
• Outcome– Retinopathy vs Non-retinopathy
• Variables – Demographic data
• Age, gender BMI/Waist-hip ratio, smoking, alcoholAge, gender BMI/Waist hip ratio, smoking, alcohol – History of disease
• HT • Abnormal lipid profile
– Clinical data • SBP/DBPSBP/DBP • Kidney function (GFR or Cr) • HA1C• Medication
– ACR-I, ARB
15/12/52
7
Example III. Risk factors of chronic kidney disease (CKD)
• Design – Cross-sectional study
• Outcome – CKD versus non-CKD
• Variables – Age, gender, BMI/Waist-hip ratio – Alcohol consumption – Smoking – Exercise & Physical activity – History of illness
• DM, HT, Abnormal lipid profile, kidney stone , , p p , y– Medication used
• NSAID, Cyclo-oxygenase type 2 inhibitor (Cox-2), Traditional medicine
15/12/52
8
Example IV. A clinical decision rule to prioritize polysomnography (PSG) in patients with suspected sleep apneapatients with suspected sleep apnea • Design
– Prospective data collection on consecutive patients referred to a sleep centre.
– All consecutive new patients from February 2001 to fApril 2003 were included in the study. Data from
February 2001 to December 2002 were used to derive the decision rule, whereas data collected from January 2003 to April 2003 were used for validation of the rule.
• Setting– The Newcastle Sleep Disorders Centre,
University of Newcastle, NSW, Australia.• Patients
– Consecutive adult patients who had been scheduled for initial diagnostic PSG.
• Study ObjectivesT d i d lid t li i l d i i l th t– To derive and validate a clinical decision rule that can help to prioritize patients who are on waiting lists for PSG.
• Variables
15/12/52
9
15/12/52
10
Association between age & Sleep apnea
mean=531
Scatter plot of age and SA
SA
mean=430
20 40 60 80age
15/12/52
11
Group Age SA Non-SA N Mean P
1 < 30 22 53 75 0.29
2 30-44 146 99 245 0.60
3 45-60 225 79 304 0.74
4 60+ 176 37 213 0.83
7.8
.91
Probability of having SA according to age group
.3.4
.5.6
.7P
roba
bilit
y0
.1.2
<30 37.5 47.5 > 60 age group
15/12/52
12
• Mean value of SA given age group • E(Y|X) • Expected value (mean) of SA given X
0 ≤E(Y|X) ≤ 1
Logit equation:
=⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛+−+
==
∑k
k
jjj Xββ
)p(Y
exp1
11
10
∑+
∑+
=
=
+
= k
jjj
jjj
Xββ
Xββ
e
e
10
10
1
15/12/52
13
∑+
∑+
=
=
+
−=− k
jjj
k
jjj
Xββ
Xββ
e
ep1
0
10
1
11
∑+
∑+∑+
=
==
+
−+= k
jjj
k
jjj
k
jjj
Xββ
XββXββ
e
ee
10
1010
1
1
1
∑+=
+
= k
jjj Xββ
e1
0
1
1
∑+
∑+
+=−
∴=
=
Xββ
Xββ
e
e
pp
k
jjj
k
jjj
10
10
11
1
∑
∑+
∑+
=
+
=
=
Xββ
Xββ
Xββ
e
e
p
k
k
jjj
k
jjj
10
10
1
∑=
∑+
+=
=−
=
k
jjj
Xββ
xββ
ep
p jjj
10
10
ln1
ln
15/12/52
14
Simple logistic regression
• Fit equation
snorebbP
P10ln +=⎥⎦
⎤⎢⎣⎡
P 101 ⎥⎦⎢⎣ −
Performing analysis in STATA
xi: logit SA i.snore, nologi.snore _Isnore_1-2 (naturally coded; _Isnore_2 omitted)
Logistic regression Number of obs = 837LR chi2(1) = 86.63Prob > chi2 = 0.0000
Log likelihood = -481.49775 Pseudo R2 = 0.0825
------------------------------------------------------------------------------SA | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------_Isnore_1 | 1.571043 .1717837 9.15 0.000 1.234354 1.907733
_cons | -.3846743 .1440453 -2.67 0.008 -.6669979 -.1023508------------------------------------------------------------------------------
15/12/52
15
Interpretation
• Patients with a history of snoring have the logit of sleep apnea 1.57 higher than patients without a history of snoring.
Interpretation
• The logit of sleep apnea for patients with & ith t hi t f i i th fwithout a history of snoring is therefore
equated
57.138.0)]([ln +−=+snoreSAodds
38.0)]([ln −=−snoreSAodds
15/12/52
16
InterpretationsnoreSAoddssnoreSAodds ++−=− −+
571
57.138.038.0)]([ln)]([ln
SAddsnoreSAodds
snoreSAoddssnoreSAodds
=
=⎥⎥⎦
⎤
⎢⎢⎣
⎡
−
=
+
+
)57.1exp()]([)]([
57.1.)]([)]([ln
57.1
ORsnoreSAoddssnoreSAodds
snoreSAodds
=−
=
−
+
)]([)]([
81.4
)p()]([
where
Testing association
1ORor 1 == 0: βHo
• Wald test
6221β 30.461.062.21 ===
seβZ
15/12/52
17
Testing association
• Likelihood ratio test
86.6 = 481.5)+2(-524.8- =G 10 ][2 LLLLG −−=
1-2 df with ~G 2χ
Estimate probability of having event
191
157.138.01
ln +−=−
∧
∧
xp
p
29.31
19.1
19.1
=
=−
=
∧
∧
∧
ep
p
pfor Solve
sides, bothfor logarithm-anti Taking
77.029.429.3
29.329.4
29.329.3
=
=
=
−=
∧
∧
∧∧
p
p
pxp
pfor Solve
15/12/52
18
Multiple logistic regression
• Multiple factors associate with the outcome of interest
• Osteoporotic hip fractureA BMI f C ti t id l h l– Age, BMI, use of Corticosteroid, alcohol consumption, calcium intake, etc
Multiple logistic regression
• CKD– Age, Gender, BMI, use of NSAID, diabetes,
HT, Chol
• SASA– Age, gender, BMI, snore, stop breathing, etc
15/12/52
19
Multiple logistic regression
• Consider > 1 factor simultaneously • Cumulative factors can better predict
event than one factor • Control confounding effects, i.e., assess
effect of each factor controlling for other factorsfactors
ppxβxβxβxβxββDDit ++++++=
⎥⎥⎦
⎤
⎢⎢⎣
⎡−
+
...log 443322110
Steps of analysis
• Model selection – Only variables can well explain the interested
event • Clinical significance• Statistical significance
– Not too many (but not too small) variables y ( )
15/12/52
20
Model selection
• i) Univariate analysis ) y• age_gr , sex, BMI_gr, snore, stop_bre,
choking, awake_re, kick_leg, accident, smoker, alcohol, ht, dm allergie
Factors Group P value
SAn = (%)
Non-SA n = (%)
TABLE 1. Patients’ characteristics between SA and non-SA groups
Age , mean (SD)< 30
30 - 4445 - 59
> 60
GenderMaleFemale
BMI, mean (SD)< 25
25 - 29.930 - 39.9
> 40
15/12/52
21
Snoring YesNo
Stopping breathing YesNo
Ch kiChoking YesNo
Waking up refreshed YesSometimeNo
L ki kiLeg kicking YesSometimeNo
Accident due to sleepinessYesNo
FactorsGroup
P valueSAn = (%)
Non-SAn = (%)
ESS score, median (range)
Smoking YesEx-smokeEx smoke No
Alcohol consumptionYesNo
HypertensionYesNoNo
Diabetes mellitusYesNo
Allergy YesNo
15/12/52
22
Model selection
• ii) Multivariate analysis by simultaneously id i i bl 0 15 i t thconsidering variables p < 0.15 into the
model
AgegrβAgegrβAgegrβ
breStopββSASAit
_log
443322
10
+++
++=⎥⎥⎦
⎤
⎢⎢⎣
⎡−
+
ppxβSnoreβ
BMIgrβBMIgrβBMIgrβSexβ
...9
483726
5
+
++++
Confounder versus Interaction• Confounders• Confounders
• Crude OR versus Adjusted OR
15/12/52
23
15/12/52
24
Effect modifier
15/12/52
25
Model selection
Backward– Backward – Forward
Performance of the model
• Goodness of fit (Calibration)• How similar are the predicted and observed
outcomes?
15/12/52
26
Model classification• How well the model discriminate SA from
non-SA subjects? ff/• Assign the cut-off/threshold
• Construct 2x2 or kx2 tables• Estimate predictive values
– SenS– Spec
– PPV, NPV– Accuracy – Area under ROC
15/12/52
27
Model classification
• Area under the ROC– Summary statistics that can tell us whether
the logit model can discriminate disease from non-disease subjects.
– Plots sensitivity versus 1-specificity (false positive) for the whole range of estimated
b bilitiprobabilities
.75
1.00
0.25
0.50
0.S
ensi
tivity
0.00
0.00 0.25 0.50 0.75 1.001 - Specificity
Area under ROC curve = 0.8101
15/12/52
28
Interpretation of ROC
Diagnostic measures • Outliers
- Pearson’s chi-square residual q
)ˆ1(ˆ
)ˆ()ˆ,(
jjj
jjjjj ππm
πmyπyr
−
−=
square sum Residual
)ˆ1(ˆ)ˆ(
)ˆ,(2
2
jjj
jjjjj ππm
πmyπyr
−
−=
q
15/12/52
29
Outliers
- Deviance residual
2/1
)ˆ1(
)(ln)(ˆln2)ˆ,(
⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−+⎟
⎟
⎠
⎞
⎜⎜
⎝
⎛±=
jj
jjjj
jj
jjjj πm
ymym
πmy
yπyd)1( ⎥
⎦⎢⎣ ⎪⎭⎪⎩
⎟⎠
⎜⎝
⎟⎠
⎜⎝ jjjj πmπm
• Leverage hjj values • Reflects distance of Xj from the centre mean
Outliers
Reflects distance of Xj from the centre mean• The higher the hjj, the longer distance that
where
)( 2/112/1 VXVXXXVH ′′= −
[ ])(ˆ1)(ˆv
andmatrix diagonal JxJV
j xxm jjj ππ −=
=
15/12/52
30
Influence of outliers
• Influence on prediction value of Y• Including/excluding the pattern/s that are
outlier would change Y values • Pearson residual change
2r–
2)1( jj
j
hr
χ−
=Δ 2
• Deviance residual change
)1(
2
jj
jj h
dD
−=Δ
)( jj
15/12/52
31
Influence on estimate coefficients
( ) ( )( )2
)()(ˆˆˆˆˆ
jj
jjj
hr
ββVXXβββ −′′
−=Δ −−
2)1( j
jj
hhr
−=
15
Delta Pearson chi-square versus predicted probability
510
H-L
dX
^20
0 .2 .4 .6 .8 1Pr(SA)
15/12/52
32
10
Delta D versus Probability
5H-L
dD
0
0 .2 .4 .6 .8 1Pr(SA)
1.5
Delta B versus Probability
.51
Preg
ibon
's d
beta
0
0 .2 .4 .6 .8 1Pr(SA)
15/12/52
33
66
155
10H
-L d
X^2
0
0 .2 .4 .6 .8 1Pr(SA)
Create scoring scheme using coefficients of each variable Factors Coefficients Score for individual
Stopping breathingYesNo
0.90
……………………..
Age> 60 2.2 ……………………..> 60
45 - 5930 - 44
< 30
2.21.51.00
……………………..
BMI> 40
30 - 39.925 - 29.9
< 25
2.31.51.10
……………………..
SnoringSnoringYesNo
0.90
……………………..
GenderMale
Female1.10
……………………..
Total score ……………………..
15/12/52
34
Calculate score
gen score_full = _b[_cons] + /// b[ Istop bre 1]* Istop bre 1 + ///_b[_Istop_bre_1]*_Istop_bre_1 + ///
_b[_Iage_gr_2]*_Iage_gr_2 + /// _b[_Iage_gr_3]*_Iage_gr_3 + ///
_b[_Iage_gr_4]*_Iage_gr_4 + ///
_b[_IBMI_gr_29]*_IBMI_gr_29 + ///_b[_IBMI_gr_39]*_IBMI_gr_39 + /// _b[_IBMI_gr_40]*_IBMI_gr_40 + ///
b[ Isex 2]* Isex 2 ///_b[_Isex_2]*_Isex_2 + ///
_b[_Isnore_1]*_Isnore_1
Discrimination performance roctab SA score, detail------------------------------------------------------------------------------
CorrectlyCutpoint Sensitivity Specificity Classified LR+ LR-p y p y------------------------------------------------------------------------------( >= 3.890326 ) 91.92% 50.00% 78.49% 1.8383 0.1617( >= 3.895265 ) 91.74% 51.12% 78.73% 1.8768 0.1616( >= 3.896797 ) 89.28% 54.48% 78.14% 1.9612 0.1968( >= 3.940307 ) 89.28% 55.22% 78.38% 1.9939 0.1941( >= 3.990148 ) 88.93% 55.22% 78.14% 1.9861 0.2005( >= 4.049621 ) 88.58% 55.22% 77.90% 1.9782 0.2069( >= 4.051153 ) 87.70% 57.09% 77.90% 2.0437 0.2155( >= 4.090991 ) 87.35% 57.46% 77.78% 2.0534 0.2202( .09099 ) 8 .35% 5 . 6% . 8% .053 0. 0( >= 5.355929 ) 55.54% 85.07% 64.99% 3.7209 0.5226( >= 5.440022 ) 48.51% 88.43% 61.29% 4.1934 0.5823( >= 5.441554 ) 48.33% 89.93% 61.65% 4.7972 0.5746( >= 5.455751 ) 48.15% 89.93% 61.53% 4.7798 0.5765( >= 5.474413 ) 47.28% 90.30% 61.05% 4.8731 0.5839( >= 5.635747 ) 40.95% 91.42% 57.11% 4.7715 0.6459( >= 5.649945 ) 40.77% 91.42% 56.99% 4.7510 0.6479
15/12/52
35
( >= 5.651477 ) 38.66% 92.91% 56.03% 5.4537 0.6602( >= 5.67371 ) 37.79% 92.91% 55.44% 5.3298 0.6696( >= 5.867904 ) 36.73% 93.66% 54.96% 5.7905 0.6755( >= 5.883634 ) 36.03% 93.66% 54.48% 5.6797 0.6830( > 6 137812 ) 22 85% 95 90% 46 24% 5 5664 0 8046( >= 6.137812 ) 22.85% 95.90% 46.24% 5.5664 0.8046( >= 6.237287 ) 18.45% 96.64% 43.49% 5.4950 0.8438------------------------------------------------------------------------------------------------------------------------------------------------------------
ROC -Asymptotic Normal--Obs Area Std. Err. [95% Conf. Interval]
--------------------------------------------------------837 0.8101 0.0165 0.77763 0.84249
Model selection based on model classification
• ROC curve analysis • Comparing area under ROC curves
15/12/52
36
Calibrate cutoff
• Score’s distributionScore s distribution – Tertile, quantile
• Yuden index – Sen+spec-1p
• LR+
Validation
• Internal validation – Data are from the same setting
• Split data • Bootstrap • Period
• External validation – Generalization – Data are from different setting