Post on 02-Oct-2021
Logistic regression analysis
Thomas Alexander Gerds
Department of Biostatistics, University of Copenhagen
1 / 51
Carpenter et al.
2 / 51
Carpenter et al.
3 / 51
Regression
The type of the outcome variable determines which kind of modelis relevant:
Quantitative (continuous) outcome
I Linear regresssionI Association parameters: differences between mean values
0-1 (binary) outcome
I Logistic regressionI Association parameters: odds ratio, differences between
log(odds)
4 / 51
Categorical explanatory variable
Group 1, . . . , K (especially binary: K=2)
Linear regression, continuous outcome Y
mean(Y|group k) - mean(Y|reference group)
E.g., the average systolic blood pressure was higher in malescompared to females
Logistic regression, binary outcome
Odds ratio =odds(group k)
odds(reference group)
E.g., the risk (odds) of coronary heart disease was higher in malescompared to females
5 / 51
Quantitative (continuous) explanatory variables
Linear regression, continuous outcome YDifferences in mean values per unit of X:
mean(Y|x+1)-mean(Y|x)
E.g., the average systolic blood pressure increased with age
Logistic regression, binary outcomeDifferences in log(odds) per unit of X
Odds ratio =odds(x+1)odds(x)
E.g., the risk (odds) of coronary heart disease increased with age
6 / 51
Linearity in regression models
For a continuous explanatory variable X, linearity means that theeffect of a unit change of X on the outcome does not depend onthe value of X .
Linear regression, continuous outcome Y
mean(Y |45+ 1)−mean(Y |45) = mean(Y |46+ 1)−mean(Y |46)= · · · = mean(Y |61+ 1)−mean(Y |61)
Logistic regression, binary outcome
odds(45+1)odds(45)
=odds(46+1)odds(46)
= · · · = odds(61+1)odds(61)
Linearity is a model assumption which should be investigated.
7 / 51
Binary outcome regressionIf the outcome variable is binary:
Yi =
{1 if i is diseased0 if i is not diseased
then linear regressionYi = β0 + β1Xi
is not good.
The regression line will gobelow 0 and above 1.
● ●● ●● ●●●
●
●
●
● ●●●
●
●
●
●●● ●
●
●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●
●
●
●
●
●
●●●●●● ●● ●
●
●● ●● ● ●● ●●● ●● ●
●
●● ●●●
●
●● ●●●● ● ●● ●● ● ●●
●●
●● ●● ●● ●●● ●●●
●
● ●● ● ●
●
●●●
●
●●
●
●●●● ●●
● ●
●
●
●
●
●● ● ●●●●
●
●●● ●●● ●● ●● ● ●● ●●● ●●●
●
● ● ●●● ●●
●
● ●●
●
●●● ●● ●●● ●● ●●
●●
● ●
●
● ●● ● ●●●
●
● ●
●●
● ●●● ●
●
●●●
●
●●●● ●● ● ●● ●●
●
●
●
● ●●
●●
●●
●
●● ● ● ●● ● ●
●
● ●
●
● ●● ●● ●●
●
●
●
● ●●
●
●●●● ●●
●
● ● ●●●● ●●
●
●
●
●
●
●●● ●●● ●
●
● ●●
● ●● ●
●●●●
● ●
●●●
●
●
●
●
●
● ●● ●●●
●
●●● ●● ●
●
●●
●●
● ●●●●●
●
●
●
●● ●
●
● ●●●● ●
●
●
●
●●●
●
●●●
● ●
● ●● ●●
●
●●● ●● ●
●
●●●● ●●
●
●●● ●●
●
● ●●●●
●
●●●
●
●
●●
● ● ● ●● ● ●●● ●●●●
●
● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●
●
● ●●● ●●
●
●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●
●
●●
●
●● ●
●
●● ●●
●
●
●
●●●● ● ●
●
● ●●
● ●
●
●
● ●●
●
● ●●
●
●● ●
●
●● ●●● ●
●
●● ●● ●●
●
● ●●
●
●●
●
● ●
●
● ● ● ●●
●
●● ●●● ●●●● ● ●●● ●● ●
●
● ●●●●
●
●●
●
●
● ●
●
●
●● ●
●
●● ●●● ●● ●● ●● ● ●●● ● ●●●
●
●
●
●
●●
●
●
● ●● ●
●
●● ● ●●●
●●
●
●
●●● ● ●● ●●
●
● ●● ● ●●● ●● ●● ●●●
●
●● ● ●● ●●● ●
● ●●
● ●● ●●
●● ●
●● ●●
●
●●
●
● ●●
●
● ●● ● ●●● ●
●
●
●
●●● ●●●● ● ●
●
●
●
●●
●
●
● ●
●●
●
●● ●
●
●
●
●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●
●
● ●● ●● ●●●●
●
●●●● ●
● ●
● ●
●●●
●● ●●
●
● ● ●●● ●
●
●
●
● ●● ●●
●
●
●
●
●
●●●
●●
●● ●●
●●
●● ● ●●●● ● ●●●● ●● ●●● ●
● ●
● ● ●
●
●
● ●
●
●● ●
●
●
●●● ●●●
● ●
●●
●
● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●
● ●
● ●● ● ●
● ●●
●
●
●●●
●
●●●● ● ●●●●
●
● ●●●
●
●● ● ●● ●● ● ●●
●
● ●
●
●
●
●
●
●●●
●
●● ● ●● ●●●●
●
●●●
●
●
●
●
●
● ●● ●●
●●●
●● ●●● ●
●
●
●
●●
●
● ● ●●● ●● ●
●
●● ●● ● ●● ●● ●●● ●
●
●
●
●
●
● ●●●● ●●● ●
● ●
● ●
●
●● ● ●●●● ● ●
●
●● ●●
●
●● ●●
●
●
●
●● ●● ●●● ●●● ●●
●
●● ●●
● ●
●●● ● ●
●
●● ●● ●● ●●●●● ● ●●●
●
●● ●● ● ●●● ●●●●
●
● ●●
●●●
●●●●
●
●●
●
●● ●● ●● ●●
●●
●● ●●● ●● ●
●
● ● ●●● ●●
●
● ●●
●
●●
● ●● ●● ●
● ●● ●●● ●
●
●●● ●●●● ●
●
● ●●●
●
●
●
●●●
●
● ●●
●
●● ● ●
●
●● ●● ● ● ●●●
●
● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●
●
●●●
● ●
●
●
●● ●● ●●● ●
●
● ●● ●●●● ●● ●●
●
●●
● ●
●
●
● ● ●●
●
● ●●● ●●● ●● ● ●● ●●● ●
●
● ● ● ●●● ●●
●●●
●
●
●●●●
●
● ●●● ●● ● ●●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
● ●● ●
●
● ●●
●●
●●● ●● ●● ● ●●
●
● ●
●
● ●● ●
●●
●
●
●
●●
●●● ●● ● ●● ●● ●●●●● ●●●
●
●● ● ●●
● ●
●● ●
−25 %
0 %
25 %
50 %
75 %
100 %
125 %
20 40 60 80
8 / 51
Binary outcome regressionIf the outcome variable is binary:
Yi =
{1 if i is diseased0 if i is not diseased
then linear regressionYi = β0 + β1Xi
is not good.
The regression line will gobelow 0 and above 1.
● ●● ●● ●●●
●
●
●
● ●●●
●
●
●
●●● ●
●
●● ●● ●●● ● ●● ●● ●● ●● ● ●● ●● ●● ● ●
●
●
●
●
●
●●●●●● ●● ●
●
●● ●● ● ●● ●●● ●● ●
●
●● ●●●
●
●● ●●●● ● ●● ●● ● ●●
●●
●● ●● ●● ●●● ●●●
●
● ●● ● ●
●
●●●
●
●●
●
●●●● ●●
● ●
●
●
●
●
●● ● ●●●●
●
●●● ●●● ●● ●● ● ●● ●●● ●●●
●
● ● ●●● ●●
●
● ●●
●
●●● ●● ●●● ●● ●●
●●
● ●
●
● ●● ● ●●●
●
● ●
●●
● ●●● ●
●
●●●
●
●●●● ●● ● ●● ●●
●
●
●
● ●●
●●
●●
●
●● ● ● ●● ● ●
●
● ●
●
● ●● ●● ●●
●
●
●
● ●●
●
●●●● ●●
●
● ● ●●●● ●●
●
●
●
●
●
●●● ●●● ●
●
● ●●
● ●● ●
●●●●
● ●
●●●
●
●
●
●
●
● ●● ●●●
●
●●● ●● ●
●
●●
●●
● ●●●●●
●
●
●
●● ●
●
● ●●●● ●
●
●
●
●●●
●
●●●
● ●
● ●● ●●
●
●●● ●● ●
●
●●●● ●●
●
●●● ●●
●
● ●●●●
●
●●●
●
●
●●
● ● ● ●● ● ●●● ●●●●
●
● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ●●●●●● ●
●
● ●●● ●●
●
●● ●● ● ●● ●●●●● ●● ●●●●● ● ● ●●
●
●●
●
●● ●
●
●● ●●
●
●
●
●●●● ● ●
●
● ●●
● ●
●
●
● ●●
●
● ●●
●
●● ●
●
●● ●●● ●
●
●● ●● ●●
●
● ●●
●
●●
●
● ●
●
● ● ● ●●
●
●● ●●● ●●●● ● ●●● ●● ●
●
● ●●●●
●
●●
●
●
● ●
●
●
●● ●
●
●● ●●● ●● ●● ●● ● ●●● ● ●●●
●
●
●
●
●●
●
●
● ●● ●
●
●● ● ●●●
●●
●
●
●●● ● ●● ●●
●
● ●● ● ●●● ●● ●● ●●●
●
●● ● ●● ●●● ●
● ●●
● ●● ●●
●● ●
●● ●●
●
●●
●
● ●●
●
● ●● ● ●●● ●
●
●
●
●●● ●●●● ● ●
●
●
●
●●
●
●
● ●
●●
●
●● ●
●
●
●
●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ●
●
● ●● ●● ●●●●
●
●●●● ●
● ●
● ●
●●●
●● ●●
●
● ● ●●● ●
●
●
●
● ●● ●●
●
●
●
●
●
●●●
●●
●● ●●
●●
●● ● ●●●● ● ●●●● ●● ●●● ●
● ●
● ● ●
●
●
● ●
●
●● ●
●
●
●●● ●●●
● ●
●●
●
● ●● ● ● ●● ●● ●●● ●●● ● ●●●● ● ●
● ●
● ●● ● ●
● ●●
●
●
●●●
●
●●●● ● ●●●●
●
● ●●●
●
●● ● ●● ●● ● ●●
●
● ●
●
●
●
●
●
●●●
●
●● ● ●● ●●●●
●
●●●
●
●
●
●
●
● ●● ●●
●●●
●● ●●● ●
●
●
●
●●
●
● ● ●●● ●● ●
●
●● ●● ● ●● ●● ●●● ●
●
●
●
●
●
● ●●●● ●●● ●
● ●
● ●
●
●● ● ●●●● ● ●
●
●● ●●
●
●● ●●
●
●
●
●● ●● ●●● ●●● ●●
●
●● ●●
● ●
●●● ● ●
●
●● ●● ●● ●●●●● ● ●●●
●
●● ●● ● ●●● ●●●●
●
● ●●
●●●
●●●●
●
●●
●
●● ●● ●● ●●
●●
●● ●●● ●● ●
●
● ● ●●● ●●
●
● ●●
●
●●
● ●● ●● ●
● ●● ●●● ●
●
●●● ●●●● ●
●
● ●●●
●
●
●
●●●
●
● ●●
●
●● ● ●
●
●● ●● ● ● ●●●
●
● ●● ● ●● ● ● ●● ●●● ●● ●●● ●● ●●●
●
●●●
● ●
●
●
●● ●● ●●● ●
●
● ●● ●●●● ●● ●●
●
●●
● ●
●
●
● ● ●●
●
● ●●● ●●● ●● ● ●● ●●● ●
●
● ● ● ●●● ●●
●●●
●
●
●●●●
●
● ●●● ●● ● ●●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
● ●● ●
●
● ●●
●●
●●● ●● ●● ● ●●
●
● ●
●
● ●● ●
●●
●
●
●
●●
●●● ●● ● ●● ●● ●●●●● ●●●
●
●● ● ●●
● ●
●● ●
−25 %
0 %
25 %
50 %
75 %
100 %
125 %
20 40 60 80
8 / 51
(Multiple) logistic regression
We denote the probability of the event Yi = 1 for a subject withexplanatory variables Xi , Zi , . . . as
P(Yi = 1|Xi ,Zi , . . . ) = pi .
The idea is to use the logit function. Instead of pi which isbounded between 0 and 1 we apply linear regression to log(odds):
logit(pi ) = log
(pi
1− pi
)= a+ b1Zi + b2Xi + . . .
log(
pi1−pi
)can take both negative and positive values.
9 / 51
Warm-up exercisesComplete the following table
pi oddsi logit(pi )
0.001%-7.0-4.5
2.1%8%
50%3.8
99%11.5
Hint: the following functions take a vector as argument
logit <- function(p){log(p/(1 - p))}expit <- function(x){exp(x)/(1 + exp(x))}
10 / 51
Example: Framingham study
I SEX: 1 for males, 2 for femalesI AGE: age (years) at baseline (45-62)I FRW: "Framingham relative weight" (pct.) at baseline
(52-222; 11 persons have missing values)I SBP: systolic blood pressure at baseline (mmHg) (90-300)I DBP: diastolic blood pressure at baseline (mmHg) 50-160)I CHOL: cholesterol at baseline (mg/100ml) (96-430)I CIG: cigarettes per day at baseline (0-60; 1 person has missing
value)I CHD: 0 if no "coronary heart disease" during follow-up, 1 if
"coronary heart disease" at baseline (prevalent cases), x=2-10if "coronary heart disease" was diagnosed at follow-up no.x
11 / 51
Framingham study: data preparationlibrary(data.table)framingham <- fread("data/Framingham.csv")
## remove prevalent casesframingham <- framingham[CHD!=1,]
## define factor levels/labelsframingham[,Smoke:=factor(CIG>0,levels=c(FALSE,TRUE),labels=c("No","Yes"))]
framingham[,Sex:=factor(SEX,levels=c(1,2),labels=c("Male","Female"))]
## define binary outcome variableframingham[,Y:=factor(CHD>1,levels=c(FALSE,TRUE),labels=c("no
CHD","CHD"))]framingham
ID SEX AGE FRW SBP DBP CHOL CIG CHD Smoke Sex Y1: 1070 2 45 93 100 62 220 0 0 No Female no CHD2: 1081 1 48 93 108 70 340 0 0 No Male no CHD3: 1123 2 45 91 160 100 171 0 0 No Female no CHD4: 1215 1 50 110 110 70 224 0 0 No Male no CHD5: 1267 1 48 85 110 70 229 25 0 Yes Male no CHD
---1359: 6432 1 47 113 155 105 175 5 5 Yes Male CHD1360: 6434 1 59 98 124 84 227 20 2 Yes Male CHD1361: 6437 2 55 111 108 74 231 0 0 No Female no CHD1362: 6440 1 49 114 110 80 218 20 0 Yes Male no CHD1363: 6442 2 51 95 152 90 199 1 0 Yes Female no CHD
12 / 51
Framingham outcome
i = subject number: 1, . . . , 1406
Xi = age of subject i
Zi = gender of subject i
Vi = smoking status of subject i
Yi =
{1 subject i develops coronary heart diseased (CHD)0 subject i does not develop CHD
pi = P(Yi = 1|Xi ,Vi ,Zi , ...) = probability of CHD of subject i
pi(1− pi )
= odds of CHD of subject i
13 / 51
A binary explanatory variable
Zi =
{1 if i is a man0 if i is a woman
Simple logistic regression:
log
(pi
1− pi
)= a+ bZi =
{a females
a+ b males
That means,
b = (a+ b)− a = log(odds for males)− log(odds for females)
= log
(Odds for malesOdds for females
)and
−b = a− (a+ b) = log
(Odds for femalesOdds for males
)
14 / 51
A binary explanatory variable
Zi =
{1 if i is a man0 if i is a woman
Simple logistic regression:
log
(pi
1− pi
)= a+ bZi =
{a females
a+ b males
That means,
b = (a+ b)− a = log(odds for males)− log(odds for females)
= log
(Odds for malesOdds for females
)and
−b = a− (a+ b) = log
(Odds for femalesOdds for males
)14 / 51
Exercise: 2 by 2 contingency table
framingham[,table(Sex,Y)]
no CHD CHDMale 479 164Female 616 104
I use the tools for 2x2 tablesI compute the odds ratio with 95% confidence limits and
corresponding p-valueI report and interprete the result in a sentence
15 / 51
Logistic regression in R
fit1 <- glm(Y∼Sex, data=framingham, family=binomial)
I Y ∼ Sex tells R that Y is the outcome and Sex the explanatoryvariable
I data=framingham tells R where to find Y and SexI glm means generalized linear modelI family=binomial tells R that the outcome is binary and the
logit link should be used
16 / 51
Logistic regression in Rfit1 <- glm(Y∼Sex,data=framingham,family=binomial)summary(fit1)
Call:glm(formula = Y ~ Sex, family = binomial, data = framingham)
Deviance Residuals:Min 1Q Median 3Q Max
-0.7674 -0.7674 -0.5586 -0.5586 1.9672
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.07183 0.09047 -11.847 < 2e-16 ***SexFemale -0.70702 0.13937 -5.073 0.000000392 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1324.9 on 1361 degrees of freedomAIC: 1328.9
Number of Fisher Scoring iterations: 4
17 / 51
Confidence intervals for the odds ratio
library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)
Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1
Female 0.49 [0.38;0.65] <0.0001
Note : 0.49 = exp(−0.71)
Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).
18 / 51
Confidence intervals for the odds ratio
library(Publish)fit1 <- glm(Y∼Sex,data=framingham,family=binomial)publish(fit1)
Variable Units OddsRatio CI.95 p-valueSex Male 1.00 [1.00;1.00] 1
Female 0.49 [0.38;0.65] <0.0001
Note : 0.49 = exp(−0.71)
Women have a significantly lower risk to develop coronary heart diseasethan men (odds ratio: 0.49, 95%-CI: [0.38; 0.65], p-value <0.0001).
18 / 51
Changing the reference level
framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)
Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1
Male 2.03 [1.54;2.66] <0.0001
Note : 2.03 = exp(0.71)
Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).
19 / 51
Changing the reference level
framingham[,sex:=relevel(Sex,"Female")]fit1a <- glm(Y∼sex,data=framingham,family=binomial)publish(fit1a)
Variable Units OddsRatio CI.95 p-valuesex Female 1.00 [1.00;1.00] 1
Male 2.03 [1.54;2.66] <0.0001
Note : 2.03 = exp(0.71)
Men have a significantly higher risk to develop coronary heart disease thanwomen (odds ratio: 2.03, 95%-CI: [1.5; 2.7], p-value <0.0001).
19 / 51
Two explanatory variables:
Zi =
{1 if i male0 female
and Vi =
{1 if i smokes0 otherwise
Data can be summarized as two 2 by 2 tables in two ways
Males (Z=1) Females (Z=0)V = 0 V = 1 V = 0 V = 1
Y = 0 191 288 Y = 0 423 192Y = 1 57 107 Y = 1 77 27
Smokers (V = 1) Non-smokers (V = 0)Males Females Males Females
Y = 0 288 192 Y = 0 191 423Y = 1 107 27 Y = 1 57 77
20 / 51
Cochran-Mantel-Haenszel test
In this way, we can study the effect of smoking adjusted for sex:
ORMantel-Haenszel = 0.97; p > 0.05
and also study the effect of Sex adjusted for smoking:
ORMantel-Haenszel = 2.03; p < 0.05
ConclusionsI there is no significant effect of smoking adjusted for sexI there is a significant effect of sex adjusted for smoking
21 / 51
Logistic regression model: two binary variables
log(
pi1−pi
)= a+ b1Zi + b2Vi
=
a Female non-smokera+ b1 Male non-smokera+ b2 Female smokera+ b1 + b2 Male smoker
Note: b1 = (a+ b1)− a= (a+ b1 + b2)− (a+ b2)
= logOR (males vs. females for given smoking status)
and b2 = (a+ b2)− a= (a+ b1 + b2)− (a+ b1)
= logOR (smokers vs. non-smokers for given gender)
22 / 51
Logistic regression resultsfit2=glm(Y∼Sex+Smoke,data=framingham,family=binomial)summary(fit2)
Call:glm(formula = Y ~ Sex + Smoke, family = binomial, data = framingham)
Deviance Residuals:Min 1Q Median 3Q Max
-0.7716 -0.7607 -0.5564 -0.5564 1.9708
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.09215 0.12717 -8.588 < 2e-16 ***SexFemale -0.69521 0.14635 -4.750 0.00000203 ***SmokeYes 0.03296 0.14457 0.228 0.82---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1350.8 on 1361 degrees of freedomResidual deviance: 1324.5 on 1359 degrees of freedom
(1 observation deleted due to missingness)AIC: 1330.5
Number of Fisher Scoring iterations: 4
23 / 51
Extracting odds ratios with confidence intervals
publish(fit2,intercept=TRUE)
Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref
Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref
Yes 1.03 [0.78;1.37] 0.8196
Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.
24 / 51
Extracting odds ratios with confidence intervals
publish(fit2,intercept=TRUE)
Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref
Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref
Yes 1.03 [0.78;1.37] 0.8196
Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).
Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.
24 / 51
Extracting odds ratios with confidence intervals
publish(fit2,intercept=TRUE)
Variable Units Missing OddsRatio CI.95 p-value(Intercept) 0.34 [0.26;0.43] <0.0001Sex Male 0 Ref
Female 0.50 [0.37;0.66] <0.0001Smoke No 1 Ref
Yes 1.03 [0.78;1.37] 0.8196
Logistic regression adjusted for smoking status showed a decreasein odds of CHD of 50% (CI-95%: [37%;66%]) in women comparedto men (p<0.0001).Exercise: Based on this model, compute the risk of CHD for anon-smoking woman, a non-smoking man, a smoking woman and asmoking man.
24 / 51
Simple logistic regression: categorical explanatory variable:
Categorize age into 4 intervals:
45-48, 49-52, 53-56, 57-62
Summarize in 2 by 4 table
X = 0 X = 1 X = 2 X = 345-48 49-52 53-56 57-62
Y = 0 308 298 254 235 1095Y = 1 51 61 64 92 268
359 359 318 327 1363
(Note: both males and females)
25 / 51
ANOVA: χ2 test
We may test whether the risk of CHD differs between the 4 agegroups using a chi-square test statistic - in this case with 3 degreesof freedom:
Null hypothesis:
Odds(age45− 48) = Odds(age49− 52)= Odds(age53− 56) = Odds(age57−62)
∑ (OBS − EXP)2
EXP= 23.29 ∼ χ2
3, P < 0.001
Conclusion: CHD-risk differs significantly between the age groups.
26 / 51
Logistic regression: categorical variable with 4 levels:
log
(pi
1− pi
)=
a age 45− 48
a+ b1 age 49− 52a+ b2 age 53− 56a+ b3 age 57− 62
Reference category 45-48
a = log (Odds(45− 48))
b1 = log
(Odds(49− 52)Odds(45− 48)
)b2 = log
(Odds(53− 56)Odds(45− 48)
)b3 = log
(Odds(57− 62)Odds(45− 48)
)
27 / 51
Resultsframingham[,AgeCut:=cut(AGE,
c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]
fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)
Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001
AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001
Notes:
1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for
group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.
28 / 51
Resultsframingham[,AgeCut:=cut(AGE,
c(40,48,52,56,99),labels=c("45-48","49-52","53-56","57-62"))]
fit3=glm(Y∼AgeCut,data=framingham,family=binomial)publish(fit3,intercept=1L)
Variable Units OddsRatio CI.95 p-value(Intercept) 0.17 [0.12;0.22] < 0.0001
AgeCut 45-48 Ref49-52 1.24 [0.82;1.85] 0.3042553-56 1.52 [1.02;2.28] 0.0415157-62 2.36 [1.61;3.46] < 0.0001
Notes:
1. The interpretation depends on the cut-off values2. Not all comparisons are in the table, for example the odds ratio for
group 49-52 vs 53-56 is not. But, it can be computed and you knowhow.
28 / 51
Quantitative explanatory factor
It is often more natural to include the variable AGE (in years) as aquantitative explanatory factor in the model (i.e., NO grouping)
log
(pi
1− pi
)= a+ b · agei
a = log(odds(age=0))b = log(odds(age=a))− log(odds(age=a+1))
Interpretation: For each year
exp(b) = odds ratio
is the factor by which odds for CHD increases with each one unitincrease of age (here 1 year).
29 / 51
Resultsfit5=glm(Y∼AGE,data=framingham,family=binomial)summary(fit5)
Call:glm(formula = Y ~ AGE, family = binomial, data = framingham)
Deviance Residuals:Min 1Q Median 3Q Max
-0.8600 -0.7052 -0.6082 -0.5224 2.0294
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.88431 0.77372 -6.313 0.000000000274 ***AGE 0.06581 0.01446 4.550 0.000005374208 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1330.2 on 1361 degrees of freedomAIC: 1334.2
Number of Fisher Scoring iterations: 4
30 / 51
ResultsOne year change in age
fit5=glm(Y∼AGE,data=framingham,family=binomial)publish(fit5,intercept=1L)
Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001
AGE 1.07 [1.04;1.10] < 0.0001
10-year change in age
framingham[,age10:=AGE/10]fit5=glm(Y∼age10,data=framingham,family=binomial)publish(fit5,intercept=1)
Variable Units OddsRatio CI.95 p-value(Intercept) 0.01 [0.00;0.03] < 0.0001
age10 1.93 [1.45;2.56] < 0.0001
31 / 51
Exercises
If we substract from each person’s age the value 50:
framingham[,Age50:=AGE-50]fit5a=glm(Y∼Age50,data=framingham,family=binomial)publish(fit5a,intercept=1)
Variable Units OddsRatio CI.95 p-value(Intercept) 0.20 [0.17;0.24] < 0.0001
Age50 1.07 [1.04;1.10] < 0.0001
1. Report the coronary heart disease risk of a person aged 50.2. Report the association between age and risk of coronary heart
disease in a sentence with confidence interval and p-value.
32 / 51
Multiple logistic regression
Additive effects of several explanatory variables:
log
(pi
1− pi
)= a+ b1Zi + b2Xi + . . .
Multiple logistic regression is a way to control confounding:
The effect on the outcome (odds ratio) of each explanatory variableis mutually adjusted for the other explanatory variables.
I The model assumes that the effect (odds ratio) of Z on Y isthe same for all values of X.
I In other words: the effect of Z on Y is not modified by thevalues of X (no statistical interaction).
33 / 51
Illustration of what "mutually adjusted" means
Additive model (no statistical interactions)
log
(pi
1− pi
)= a+ b1Zi + b2Xi
Effect of sex Zi (0 = female, 1 = male) adjusted for age (Xi)
odds(age=50, male)odds(age=50, female)
=exp(a+ b1 + b250)
exp(a+ b250)= exp(a+ b1 + b250− a− b250)= exp(b1).
The result is the same for age 46 and age 61 and all other ages.
34 / 51
Illustration of what "mutually adjusted" means (continued)Effect of age (Xi) for males:
odds(age=51, male)odds(age=50, male)
=exp(a+ b1 + b251)exp(a− b1 + b250)
= exp(a+ b1 + b251− a− b1 − b250)= exp(b2).
The result is the same for females:
odds(age=51, female)odds(age=50, female)
=exp(a+ b251)exp(a− b250)
= exp(a+ b251− a− b250)= exp(b2).
Linearity means that the result is the same for a comparison of age63 and age 62 and all other one year differences.
35 / 51
Resultsfit.add=glm(Y ∼ AGE + Sex, family = binomial,
data = framingham)summary(fit.add)
Call:glm(formula = Y ~ AGE + Sex, family = binomial, data = framingham)
Deviance Residuals:Min 1Q Median 3Q Max
-0.9910 -0.6927 -0.5958 -0.4500 2.1913
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.59208 0.78019 -5.886 0.00000000396 ***AGE 0.06672 0.01458 4.575 0.00000475151 ***SexFemale -0.71613 0.14052 -5.096 0.00000034612 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1303.5 on 1360 degrees of freedomAIC: 1309.5
Number of Fisher Scoring iterations: 4
36 / 51
Results
fit.add=glm(Y ∼ AGE + Sex, family = binomial, data =framingham)
publish(fit.add)
Variable Units OddsRatio CI.95 p-valueAGE 1.07 [1.04;1.10] <0.0001Sex Male 1.00 [1.00;1.00] 1
Female 0.49 [0.37;0.64] <0.0001
Logistic regression was used to investigate gender differences inodds (risks) of CHD adjusted for age.
The age adjusted odds ratio was 0.49 (95%-CI: [0.37;0.64])showing that the risks of CHD were significantly lower for womencompared to men (p<0.0001).
37 / 51
Predicted risks based on logistic regression model
A logistic regression model can be used to predict personalized risks:
log
(pi
1− pi
)= a+ b1Zi + b2Xi + . . .
is equivalent to
pi =exp(a+ b1Zi + b2Xi + . . . )
1+ exp(a+ b1Zi + b2Xi + . . . )
The risks (and risk ratios) depend on all explanatory variablessimultaneously.
38 / 51
Predicted risks based on logistic regression modelPrediction makes most sense for new data
mydata=expand.grid(AGE=c(50,55),Sex=factor(c("Female","Male")))setDT(mydata)mydata
AGE Sex1: 50 Female2: 55 Female3: 50 Male4: 55 Male
mydata[,risk:=predict(fit.add,newdata=mydata,type="response")]mydata
AGE Sex risk1: 50 Female 0.12213812: 55 Female 0.16263533: 50 Male 0.22162844: 55 Male 0.2844255
39 / 51
Visualization of predicted risks
mydata2 <- setDT(expand.grid(AGE=seq(45,62,1),Sex=factor(c("Female","Male"))))
mydata2[,risk:=predict(fit.add,newdata=mydata2,type="response")]
library(ggplot2)ggplot(mydata2,aes(x=AGE,y=risk,group
=Sex,colour=Sex))+geom_line()+ylim(c(0,1))+xlab("Age (years)")+ylab("Risk of CHD")
0.00
0.25
0.50
0.75
1.00
45 50 55 60
Age (years)
Ris
k of
CH
D
Sex
Female
Male
40 / 51
Example with more variables
framingham[,Chol10:=CHOL/10]fit.multi <- glm(Y ∼ AGE + Sex + Chol10 + SBP + Smoke,
family = binomial,data = framingham)publish(fit.multi)
Variable Units Missing OddsRatio CI.95 p-valueAGE 0 1.06 [1.02;1.09] 0.0004181Sex Male 0 Ref
Female 0.38 [0.28;0.52] < 0.0001Chol10 0 1.05 [1.02;1.08] 0.0026086
SBP 0 1.02 [1.01;1.02] < 0.0001Smoke No 1 Ref
Yes 1.19 [0.88;1.60] 0.2510447
41 / 51
Exercise
1. Report the effect of cholesterol on coronary heart disease fromthe multiple logistic regression model
2. Predict the coronary heart disease risks of four smokingfemales all aged 50 and with 150 SBP but with differentcholesterol values:I person 1: 235, person 2: 245, person 3: 351, person 4: 361
mydata=data.frame(AGE=50,Sex=factor("Female",levels(framingham$Sex)),Smoke=factor("Yes",levels(framingham$Smoke)),SBP=150,Chol10=c(23.5,24.5,35.1,36.1))
1. Compute the risk ratios for 10 unit cholesterol changes from245 to 235 and from 361 to 351
2. Repeat 2. and 3. for a male person
42 / 51
Statistical interaction = Effectmodification
The effect of X on Y depends on Z
43 / 51
Effect modification
A statistical interaction (effect modification) requires 3variablesI two explanatory variables X,ZI one outcome Y
In logistic regression the odds ratio which describes the effect of Xon the odds of Y=1 depends on the value of Z
SymmetryIf the effect of variable X on Y is modified by Z then also the effectof Z on Y is modified X.
ExampleThe age (Z) effect on the CHD-risk (Y) may depend on sex (X).
44 / 51
Statistical interaction in R
summary(glm(Y ∼ AGE * SEX, family = binomial, data =framingham))
Alternative notation:
summary(glm(Y ∼ AGE + SEX + AGE:SEX, family = binomial, data = framingham))
45 / 51
Resultssummary(glm(Y ∼ AGE * SEX, family = binomial, data = framingham
))
Call:glm(formula = Y ~ AGE * SEX, family = binomial, data = framingham)
Deviance Residuals:Min 1Q Median 3Q Max
-0.9171 -0.7284 -0.6074 -0.4010 2.3029
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.091694 2.361000 0.039 0.9690AGE -0.007736 0.044223 -0.175 0.8611SEX -3.544593 1.604311 -2.209 0.0271 *AGE:SEX 0.052967 0.029871 1.773 0.0762 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1351.2 on 1362 degrees of freedomResidual deviance: 1300.4 on 1359 degrees of freedomAIC: 1308.4
Number of Fisher Scoring iterations: 4
46 / 51
Statistical interaction
publish(glm(Y ∼ AGE * Sex, family = binomial, data =framingham))
Variable Units OddsRatio CI.95 p-valueAGE: Sex(Male) 1.05 [1.01;1.09] 0.01629
AGE: Sex(Female) 1.10 [1.05;1.15] < 0.0001
Notes:
I The main effects for AGE and Sex have no interpretation (andare therefore not shown).
I One year more in age increases the odds by 5% in males andby 10% in females.
47 / 51
Predicted risk of the model with an additive effect of ageand sex (no effect modification)
Age (years)
Pre
dict
ed C
HD
ris
k
MaleFemale
0 %
25 %
50 %
75 %
100
%
45 50 55 60
48 / 51
Predicted risk of the model with an interaction between ageand sex (with effect modification)
Age (years)
Pre
dict
ed C
HD
ris
k
MaleFemale
0 %
25 %
50 %
75 %
100
%
45 50 55 60
49 / 51
Testing for statistical interactionfit.add=glm(Y ∼ AGE + SEX, family = binomial, data = framingham)fit.int=glm(Y ∼ AGE * SEX, family = binomial, data = framingham)anova(fit.add,fit.int,test="Chisq")
Analysis of Deviance Table
Model 1: Y ~ AGE + SEXModel 2: Y ~ AGE * SEX
Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 1360 1303.52 1359 1300.4 1 3.1676 0.07511 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There is no statistically significant modification of the age effect by gender(p>0.05).
50 / 51
Take home messages
I (Multiple) logistic regression describes associations betweenone or several explanatory variables and the risk of an event(binary outcome).
I Analysis of an exposure of interest can be adjusted forpotential confounders
I In an additive model (no interactions), odds ratios do notdepend on the other explanatory variables
I Risks and risk ratios predicted by the model depend on theother explanatory variables
I Linearity and absence of interaction are assumptions whichshould be investigated
51 / 51