Post on 22-Dec-2015
BIOST 536 Lecture 4 1
Lecture 4 – Logistic regression: estimation and confounding Linear model
20 1 1 2 2* * ... * where ~ (0, )p pY X X X N
210 ,...,,, p fixed parameters to be estimated
Then
pp XXXY *...**)(E 22110
pp XXXY *ˆ...*ˆ*ˆˆˆ22110
Find p ˆ...,,ˆ,ˆ10 such that
iii YY 2)ˆ( is minimized, "least squares criterion"
There is a closed form solution for the estimates of beta, i.e. only matrix operations are
required and there is no iteration
Also need an estimate of 2 but this is easily obtained
1
)ˆ(ˆ
2
2
pn
YYi
ii
The least squares estimates of beta are the same as the maximum likelihood estimate of
beta
BIOST 536 Lecture 4 2
Logistic regression estimation Modeling a binary outcome
Modeling the P(Y=1|X=x), not a continuous Y Linear model has problems since the left-hand side 0 ≤ Pr ≤ 1, but
the ’s can be anywhere between (-∞,∞) so does not do well
Instead assume that P(Y=1|X=x) depends on X only through the linear combination
Need to link Z to P(Y=1|Z=z) so we use a logistic function
0 1 1 2 2( | ) * * ... *p pP Y X X X X
0 1 1 2 2* * ... * and ( , )p pZ X X X Z
( 1| )1
z
z
eP Y Z z
e
0
.2.4
.6.8
1P
roba
bilit
y
-5 0 5z
BIOST 536 Lecture 4 3
Logistic function Symmetric – could model either Pr(Y=1) or Pr(Y=0) and the
effect of covariates would be the same Epidemiologists prefer logistic model since the OR is easily
derived Mathematically convenient form Maximum likelihood equations have a simple form
Need to solve these equations by iteration Iteration rarely goes awry if the data are not too sparse
0 1 1 2 2
0 1 1 2 2
ˆ ˆ ˆ ˆ* * ... *
ˆ ˆ ˆ ˆ* * ... *ˆ ( )
1
p p
p p
x x x
x x x
ep x
e
1
ˆ( ) 0 for each covariate n
ij i i ji
x y p x
BIOST 536 Lecture 4 4
Link functions Logistic function very close to the probit model (symmetric)
Probit model used more in classical diagnostic testing and ROC analysis
Call “logistic” and “probit” link functions that link the P(Y|X) though the linear combination
Other link functions include something called the “complimentary log-log” (asymmetrical)
Some general regression programs in Stata expect you to specify the link function
We will assume a logistic link function here but can test in Stata using the linktest command
( 1| ) ( ) when z ~ (0,1)P Y Z z z N
0 1 1 2 2* * ... *p pZ X X X
BIOST 536 Lecture 4 5
Simple estimation example
0 1 1
0 1 1
*X
*XE(Y|X) Pr( 1 X)1
eY
e
0
0
1)0X1(Pr 0
e
eY
01
11)0X0(Pr 0
eY
0 1
0 11Pr( 1 X 1)1
eY
e
0 11
1Pr( 0 X 1) 1
1Y
e
Exposed (X=1) Unexposed (X=0) Total
Case (Y=1) 34 18 52
Control (Y=0) 66 82 148
Total 100 100 200
Adopt a logistic model so
For the unexposed, probability of a case is
For the unexposed, the probability of a control is
For the exposed, the probability of a case is
For the exposed, the probability of being a control is
BIOST 536 Lecture 4 6
The likelihood for our data is
661
341
820
180 1
34
1001
18
100
The binomial coefficient
18
100 is the number of ways of choosing 18 cases out of the 100
unexposed, roughly 3.066 10 19
Fortunately, we can drop the binomial coefficients since they do not affect the beta estimates
Maximize 661
341
820
180 11 or equivalently
0 0 1
0 0 0 1 0 1
18 3482 661 1
L1 1 1 1
e e
e e e e
It turns out to more convenient to maximize the log likelihood, log L
In many cases we need to iterate to find the beta estimates that make this a maximum
BIOST 536 Lecture 4 7
The log likelihood is a well-behaved surface that is a function of the betas
853.0ˆ516.1ˆ10
0
0
ˆ 1.516
ˆ 1.516Pr( 1 X 0) 0.18
11
e eY
ee
34.011
)1X1(Pr853.0516.1
853.0516.1
ˆˆ
ˆˆ
10
10
e
e
e
eY
The maximum log likelihood is -111.2429 achieved at
For the unexposed, the estimate is
which is just the proportion of cases in the unexposed groupFor the exposed, the estimate is
which is just the proportion of cases in the exposed group
BIOST 536 Lecture 4 8
0: 10 H
0
0 0
0 0 0 0
18 3482 661 1
L1 1 1 1
e e
e e e e
0
Now do this under the null hypothesis that exposure does not make a difference, i.e.
Then the likelihood depends on only
Then the log likelihood is a function of
-2.2 -1.7 -1.2 -0.7 -0.2
-140
-135
-130
-125
-120
-115
The maximum log likelihood is -114.6114 achieved at
046.1ˆ0
Ignoring exposure, the estimatedprobability of being a case is
26.011
)1(Pr046.1
046.1
ˆ
ˆ
0
0
e
e
e
eY
which is the overall proportion of cases
BIOST 536 Lecture 4 9
Tests comparing nested models Want to decide if the more complex model is significantly better
than the simpler model Possible tests comparing nested models (complex model
includes all covariates of the simpler model)1. Likelihood ratio test – direct comparison of the difference in log-
likelihoods Preferred test Does not change with reparametrization
2. Score test – test computed at the null hypothesis values Very similar to LR test Sometimes can be computed when the LR test cannot Many of our common tests are score tests
3. Wald test – depends on the normality of the distribution of the estimates Can change with reparametrization P-value given in Stata output for individual variables
BIOST 536 Lecture 4 10
. cci 34 18 66 82 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 34 18 | 52 0.6538 Controls | 66 82 | 148 0.4459 -----------------+------------------------+------------------------ Total | 100 100 | 200 0.5000 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.346801 | 1.162749 4.819547 (exact) Attr. frac. ex. | .5738881 | .1399689 .7925116 (exact) Attr. frac. pop | .3752345 | +------------------------------------------------- chi2(1) = 6.65 Pr>chi2 = 0.0099 . list | case exp cnt | |------------------| 1. | 1 1 34 | 2. | 1 0 18 | 3. | 0 1 66 | 4. | 0 0 82 | . logistic case [fw=cnt], coef Logistic regression Number of obs = 200 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -114.61138 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | -1.045969 .1612065 -6.49 0.000 -1.361927 -.7300097 ------------------------------------------------------------------------------ . predict probnull, pr . est store A
BIOST 536 Lecture 4 11
. logistic case exp [fw=cnt], coef Logistic regression Number of obs = 200 LR chi2(1) = 6.74 Prob > chi2 = 0.0094 Log likelihood = -111.2429 Pseudo R2 = 0.0294 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- exp | .8530533 .3351327 2.55 0.011 .1962052 1.509901 _cons | -1.516347 .2602896 -5.83 0.000 -2.026506 -1.006189 ------------------------------------------------------------------------------ . predict probalt, pr . est store B . lrtest A B Likelihood-ratio test LR chi2(1) = 6.74 (Assumption: A nested in B) Prob > chi2 = 0.0094 . logistic Logistic regression Number of obs = 200 LR chi2(1) = 6.74 Prob > chi2 = 0.0094 Log likelihood = -111.2429 Pseudo R2 = 0.0294 ------------------------------------------------------------------------------ case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- exp | 2.346801 .7864899 2.55 0.011 1.216777 4.526284 ------------------------------------------------------------------------------ . list case exp cnt probnull probalt | case exp cnt probnull probalt | |---------------------------------------| 1. | 1 1 34 .26 .34 | 2. | 1 0 18 .26 .18 | 3. | 0 1 66 .26 .34 | 4. | 0 0 82 .26 .18 |
BIOST 536 Lecture 4 12
Summary about estimation Maximum likelihood estimation is preferred for binary outcome data The log likelihood depends on the betas that in turn depend on what
covariates are in the model In some cases the beta estimates are related to familiar values, but
usually have to iterate to get the estimates Difference in log likelihoods can sometimes test one model against
another Odds ratios turn out to be related to the logistic regression coefficients
BIOST 536 Lecture 4 13
Example Study of identification of domestic violence (DV) identification in a
medical setting ( PI: Robert Thompson, MD ; Co-PI: Fred Rivara, MD) Clinics were randomized to be either intervention (2 clinics) or control
clinics (3 clinics) Intervention clinics received training in DV detection and some support
services; clinics enrollees received materials Questions:
1. Did the intervention improve detection ?
2. Did it improve the rate of physicians asking about DV?
Sample of patients chosen based on sentinel conditions for DV
Intervention Control Totals
Asked about DV 278 125 403
Not asked about DV 1094 1895 2989
Totals 1372 2020 3392
BIOST 536 Lecture 4 14
Intervention group has a 20.3% rate versus 6.2% in the control clinics yields
a rate ratio of 3.27 and an observed odds ratio 85.31251094
1895278
Compare the two binomial proportions
Suppose the dataset looks like the following for variables Y, N, TRT
278 1372 1
125 2020 0
Consider a logistic model for TRT
110
110
X*
X*
1)X(Pr
e
eY
110
110
110
110
X*
X*
X*
X*
log
1
1
1log
1log)(logit
e
e
e
e
110 X*)(logit
BIOST 536 Lecture 4 15
. blogit y n trt Logit Estimates Number of obs = 3392 chi2(1) = 152.31 Prob > chi2 = 0.0000 Log Likelihood = -1160.3808 Pseudo R2 = 0.0616 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 1.348686 .1141858 11.811 0.000 1.124886 1.572485 _cons | -2.71866 .092343 -29.441 0.000 -2.899649 -2.537671 ------------------------------------------------------------------------------ . blogit y n trt, or Logit Estimates Number of obs = 3392 chi2(1) = 152.31 Prob > chi2 = 0.0000 Log Likelihood = -1160.3808 Pseudo R2 = 0.0616 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 3.852358 .4398845 11.811 0.000 3.079864 4.81861 ------------------------------------------------------------------------------
CI is a Wald confidence interval based on )ˆ(*96.1ˆ
11 see
Top right-hand 2 is a likelihood ratio test of 0...: 210 pH
Consider fitting in Stata . infile y n trt using d:\biost536\data\dv1.dat (2 observations read) . list y n trt 1. 278 1372 1 2. 125 2020 0
Blocked logit model blogit { # cases} { denom } { covariates } , [or]
BIOST 536 Lecture 4 16
STATA includes standard epidemiologic comparisons (EPITAB)
Incidence risk ratio (using n as the person-time variable) . ir y trt n trt | Exposed Unexposed | Total -----------------+------------------------+---------- y | 278 125 | 403 n | 1372 2020 | 3392 -----------------+------------------------+---------- Incidence Rate | .2026239 .0618812 | .118809 | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Inc. rate diff. | .1407427 | .1145701 .1669153 Inc. rate ratio | 3.274402 | 2.64199 4.07705 (exact) Attr. frac. ex. | .6946008 | .6214974 .7547246 (exact) Attr. frac. pop | .4791539 | +----------------------------------------------- (midp) Pr(k>=278) = 0.0000 (exact) (midp) 2*Pr(k>=278) = 0.0000 (exact)
BIOST 536 Lecture 4 17
Cumulative incidence risk ratio (using the 2x2 cell entries) . csi 278 125 1094 1895 , exact or | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 278 125 | 403 Noncases | 1094 1895 | 2989 -----------------+------------------------+---------- Total | 1372 2020 | 3392 | | Risk | .2026239 .0618812 | .118809 | | | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .1407427 | .1170199 .1644655 Risk ratio | 3.274402 | 2.681872 3.997846 Attr. frac. ex. | .6946008 | .6271261 .7498653 Attr. frac. pop | .4791539 | Odds ratio | 3.852358 | 3.080882 4.816955 (Cornfield) +----------------------------------------------- 1-sided Fisher's exact P = 0.0000 2-sided Fisher's exact P = 0.0000
Note that we get the same risk differences and same rate ratios as with the
incidence risk ratio, but different confidence intervals
The odds ratio we obtain is the same as derived from the logistic regression
BIOST 536 Lecture 4 18
For the second question concerning case finding we have
Intervention Control Totals
DV identified 36 35 71
DV not identified 1336 1985 3321
Totals 1372 2020 3392
Intervention group has a 2.6% rate versus 1.7% in the control clinics
yields a rate ratio of 1.51 and an observed 53.1 . list y n trt 1. 36 1372 1 2. 35 2020 0 . blogit y n trt, or Logit Estimates Number of obs = 3392 chi2(1) = 3.11 Prob > chi2 = 0.0780 Log Likelihood = -343.2194 Pseudo R2 = 0.0045 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 1.528229 .3667798 1.767 0.077 .9547672 2.44613 ------------------------------------------------------------------------------
We do not see a significant difference, but could be due to the small
sample - try the incidence ratio with an exact test
BIOST 536 Lecture 4 19
. csi 36 35 1336 1985 , exact or | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 36 35 | 71 Noncases | 1336 1985 | 3321 -----------------+------------------------+---------- Total | 1372 2020 | 3392 | | Risk | .0262391 .0173267 | .0209316 | | | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .0089123 | -.0012817 .0191064 Risk ratio | 1.514369 | .9558286 2.399294 Attr. frac. ex. | .339659 | -.0462127 .5832107 Attr. frac. pop | .1722214 | Odds ratio | 1.528229 | .9585712 2.436415 (Cornfield) +----------------------------------------------- 1-sided Fisher's exact P = 0.0497 2-sided Fisher's exact P = 0.0868
No evidence of a significant improvement in case-finding
Problems in this analysis:
1. Three control clinics may not be similar
2. Two intervention clinics may not have responded in similar ways
3. Clinics may not have been comparable at baseline
4. Have not adjusted for potential confounding variables
5. Did not account for the oversampling of those with sentinel conditions
BIOST 536 Lecture 4 20
Confounding Confounding variable is related to both disease and exposure
occurrence that modifies the relationship between exposure and disease
We can adjust for the confounder explicitly by modeling or implicitly through stratification
Two necessary relationships:
1. Confounder must be related to exposure in the data
2. Confounder must be independently related to disease in the population
Example 1: Suppose (1) Age exposure to menopausal estrogens and (2) Age endometrial cancer Should we control for age?
Example 2: (1) menopausal estrogens endometrial hyperplasia and (2) endometrial hyperplasia endometrial cancer Do we want to control for endometrial hyperplasia in studying the
association between exposure to menopausal estrogens and endometrial cancer ?
BIOST 536 Lecture 4 21
Failure to account for confounding can increase or decrease the odds ratio
Hypothetical data from Breslow & Day, Volume 1
BIOST 536 Lecture 4 22
Observational cohort study
Total sample size ( N1 + N2 ) fixed by design
Individual confounder totals N1 , N2 may be fixed by design if the confounder is known
Individual exposure totals ( m 11 + m 12 ) and ( m 01 + m 02 ) may be fixed by design if exposure is
known at that time
M 11 , m 12 , m 01 , m 02 may all be fixed by design
Random variable is disease incidence given exposure and confounder status, i.e.
)(~ pBinomialX D where p is the probability of disease given exposure and confounder values
Disease (2 levels) Exposure (2 levels) Confounder (2 levels)
Confounder - Confounder + E E E E
D 1a 1b 11n D 2a 2b 12n
D 1c 1d 01n D 2c 2d 02n
11m 01m 1N
12m 02m 2N
disease no 0
disease 1DX
unexposed 0
exposed 1EX
confounder 0
confounder 1CX
BIOST 536 Lecture 4 23
Model in terms of probabilities
1. )0,1|1Pr( CED XXX estimated by 11
110ˆ
map
2. )0,0|1Pr( CED XXX estimated by 01
100ˆ
mbp
3. )1,1|1Pr( CED XXX estimated by 12
211ˆ
map
4. )1,0|1Pr( CED XXX estimated by 02
201ˆ
mbp
)0,0|0Pr()0,0|1Pr(
)0,1|0Pr()0,1|1Pr(1
CEDCED
CEDCED
XXXXXX
XXXXXX
11
11
01
1
01
1
11
1
11
1
1ˆcb
da
md
mb
mc
ma
Similarly for confounder 1CX then 22
222ˆ
cb
da
BIOST 536 Lecture 4 24
Knowing the two estimates 1 and 2 , then we need only two other probabilities
to describe the full model
Reparametrization 01001110010021 ˆˆˆˆˆˆˆˆ pppppp
Can be written in terms of odds ratios specific to each level of the confounder and also the
probability of disease given non-exposure at each level of the confounder
Now rewrite in terms of the logistic model
ECEC
ECEC
e
eX D X*XXX
X*XXX
CE *2
*1
*2
*1
1)X,X1(Pr
* *1 2logit( ) log X X X * X
1 C E C E
Intercept (1*)
Main effects for exposure () and the confounder (2*)
Interaction of exposure and confounder ()
Log odds ratio is the difference between two logits (exposed versus unexposed)
BIOST 536 Lecture 4 25
Confounder negative
1log logit Pr( 1| 1, 0) logit Pr( 1| 0, 0)D E C D E CX X X X X X
* *1 1 1 1log e
Confounder positive
2log logit Pr( 1| 1, 1) logit Pr( 1| 0, 1)D E C D E CX X X X X X
* * * *2 1 2 1 2 2log e
If this model is correct, would the Mantel-Haenszel approach give an alternative estimate of the odds ratio?
BIOST 536 Lecture 4 26
Estimates:
*1 estimates )0,0|1Pr(logit CED XXX
*2 estimates
)0,0|1Pr(logit)1,0|1Pr(logit CEDCED XXXXXX
estimates 1ˆlog
estimates 2ˆlog - 1ˆlog
Reparametrization 01001110*
2*
1 ˆˆˆˆˆˆˆˆ pppp
So this is another way to obtain estimates of the probabilities, e.g.
ˆˆˆˆ
ˆˆˆˆ
11 *2
*1
*2
*1
1ˆ
e
ep
Prefer the initial parametrization because we may want to test whether = 0 (no interaction) or = =0 (no association of exposure with outcome for either confounder level) The α’s give us the baseline levels for the two confounder levels
BIOST 536 Lecture 4 27
Case-control studyDisease (2 levels) Exposure (2 levels) Confounder (2 levels)
Confounder - Confounder + E E E E
D 1a 1b 11n D 2a 2b 12n
D 1c 1d 01n D 2c 2d 02n
11m 01m 1N
12m 02m 2N
disease no 0
disease 1DX
unexposed 0
exposed 1EX
confounder 0
confounder 1CX
Also need to include sampling fractions
samplednot 0
sampled 1Z
),,|1Pr( ZXXX CED depends also on the sampling scheme
),1|1Pr(1 CDX XXZC
sampling fraction for cases
probability of being sampled given case status and confounding variables
BIOST 536 Lecture 4 28
),0|1Pr(0 CDX XXZC
sampling fraction for controls
probability of being sampled given control status and confounding variables
Typically, CC XX 01 since cases are scarce, often 1 1CX
With case-control data we only know exposure status if they get sampled
)1,,|0Pr(
)1,,|1Pr(log
1log)(logit
ZXXX
ZXXX
CED
CED
),,0Pr(
),,1Pr(
),,0|1Pr(
),,1|1Pr(log
)1,,,0Pr(
)1,,,1Pr(log
CED
CED
CED
CED
CED
CED
XXX
XXX
XXXZ
XXXZ
ZXXX
ZXXX
ECECX
X
CEDX
X
C
C
C
C XXX
X*XXXlog
),|1Pr(logitlog
*2
*1
0
1
0
1
1st part depends on the sampling and 2nd on exposure and confounders
BIOST 536 Lecture 4 29
Case-control log odds ratio is the difference between two logits (exposed versus unexposed)
Confounder negative
1log logit Pr( 1| 1, 0, 1) logit Pr( 1| 0, 0, 1)D E C D E CX X X Z X X X Z
*
100
10*1
00
101 logloglog
Confounder positive
2log logit Pr( 1| 1, 1, 1) logit Pr( 1| 0, 1, 1)D E C D E CX X X Z X X X Z
* * * *11 112 1 2 1 2
01 01
log log log
Therefore, the same odds ratios are obtained from a case-control study as a cohort study
The probability model can be estimated from a cohort study where
ECEC X*XXX1
log)(logit *2
*1
BIOST 536 Lecture 4 30
The case-control probability model is
* *10 10111 2
00 01 00
1 2
logit( ) log1
log log log X X X * X
X X X * X
C E C E
C E C E
* *10 10111 1 2 2
00 01 00
log log log
In the case-control model we estimate 1 and 2 , but the sampling fractions
are usually not known so we cannot estimate 1* and 2
*
Cannot estimate the absolute probability of disease given exposure from a case-control study
1. If 00
10
01
11
then *
22 and the actual effect of the confounder can be estimated
2. If 0111 and 0010 then *
11 and *
22 and the actual absolute estimate
of disease probability can be made
BIOST 536 Lecture 4 31
Example of sampling proportions
Age is the confounder equally divided into young and old
Population looks like the following:
Young Old Total
Cases 100 200 300
Not cases 900 800 1700
If we take all cases, i.e. 11110 then we have twice as many old cases as young cases
Want to choose 300 controls from those without disease
Choice 1: Frequency matching - 200 old and 100 young controls
25.0
800
20011.0
900
1000100
Lose the ability to evaluate age effect in a case-control study
Choice 2: Matching without regard to age
176.0
1700
3000100
Expected number of elderly controls 800*.176 = 141
Expected number of young controls 900*.176 = 159
Can still evaluate the age effect in a case-control study, but may lose efficiency for assessing exposure since we still need to adjust for age
BIOST 536 Lecture 4 32
Summary about confounding We can control for confounding by stratification or modeling For a cohort sample with a binary exposure, binary confounder, and a
binary outcome the probability model is
For a case-control sample with a binary exposure, binary confounder, and a binary outcome the probability model is
but the sampling fractions may be unknown The odds ratios can be estimated from either cohort or case-control
studies, but absolute risk probabilities can be made only from a cohort study unless the sampling probabilities are known
ECEC X*XXX1
log)(logit *2
*1
1 2so log and log
ECEC X*XXX1
log)(logit 21
1 2so log and log
*
200
10
01
112
*1
00
101 logloglog