Introduction to Logistic Regression Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren,...
-
Upload
jennifer-waddell -
Category
Documents
-
view
216 -
download
3
Transcript of Introduction to Logistic Regression Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren,...
Introduction to Logistic Regression
Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren, Viviane Bremer
Objectives
• When do we need to use logistic regression
• Principles of logistic regression
• Uses of logistic regression
• What to keep in mind
Chlamorea
• Sexually transmitted infection –Virus recently identified
–Leads to general rash, blush, pimples and feeling of shame
–Increasing prevalence with age
–Risk factors unknown so far
Case control study
• Population of Berlin
• 150 cases, 150 controls
• Hypothesis: Consistent use of condoms protects against chlamorea
• Questionnaire with questions on demographic characteristics, sexual behaviour
• OR, t-test
Results bivariate analysis
Cases n=150
Controls n=150
Odds ratio
Used condoms at last sex 40 90 0.17
Did not use condoms 110 60 Ref
Results bivariate analysis
Cases n=150
Controls n=150
Odds ratio
Single 125 50 4.7
Currently in a relationship 25 100 Ref
Results bivariate analysis
Cases n=150
Controls n=150
T-test
nr partners during last year 4 2 p=0.001
Mean age in years 39 26 p=0.001
Confounding?
a
c
b
d
OR raw
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORia1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1c1
b1d1
a2
c2
b2
d2
OR1
OR2
ai
ci
bi
di ORi
a1
c1
b1
d1
a2
c2
b2
d2
OR1
OR2
a3
c3
b3
d3OR3
ai
ci
bi
diOR4
Chlamorea and condom use
Single statusAgegroup
Number of partnersStratification
Let’s go one step back
Simple linear regression
Age SBP Age SBP Age SBP
22 131 41 139 52 128 23 128 41 171 54 105 24 116 46 137 56 145 27 106 47 111 57 141 28 114 48 115 58 153 29 123 49 133 59 157 30 117 49 128 63 155 32 122 50 183 67 176 33 99 51 130 71 172 35 121 51 133 77 178 40 147 51 144 81 217
Table 1 Age and systolic blood pressure (SBP) among 33 adult women
80
100
120
140
160
180
200
220
20 30 40 50 60 70 80 90
SBP (mm Hg)
Age (years)
adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974
Simple linear regression
• Relation between 2 continuous variables (SBP and age)
• Regression coefficient 1
–Measures association between y and x–Amount by which y changes on average when x
changes by one unit–Least squares method
y
x
11xβαy Slope
α
What if we have more than one independent variable?
Multiple risk factors
• Objective:To attribute to each risk factors the respective effect (RR) it has on the occurrence of disease.
Types of multivariable analysis• Multiple models
–Linear regression–Logistic regression–Cox model–Poisson regression–Loglinear model–Discriminant analysis…
• Choice of the tool according objectives, study design and variables
Multiple linear regression
• Relation between a continuous variable and a set of i variables
• Partial regression coefficients i
–Amount by which y changes when xi changes by one unit and all the other xi remain constant
–Measures association between xi and y adjusted for all other xi
• Example–Number of partners in relation to age & income
xβ ... xβ xβαy ii2211
Multiple linear regression
Predicted Predictor variables
Response variable Explanatory variablesOutcome variable CovariablesDependent Independent variables
xβ ... xβ xβα y ii2211
y (number of partners) = α + β1 age + β2 income + β3 gender
What if our outcome variable is dichotomous?
Logistic regression (1)Table 2 Age and chlamorea
How can we analyse these data?
• Compare mean age of diseased and non-diseased
–Non-diseased: 26 years
–Diseased: 39 years (p=0.0001)
• Linear regression?
Dot-plot: Data from Table 2P
rese
nce
of
Ch
lam
ore
a
Logistic regression (2)Table 3 Prevalence (%) of chlamorea according to age group
Dot-plot: Data from Table 3
0
20
40
60
80
100
0 2 4 6 8
Diseased %
Age group
Logistic function (1)
0.0
0.2
0.4
0.6
0.8
1.0Probability of disease
x
Logistic function
• Logistic regression models the logit of the outcome=natural logarithm of the odds of the outcome
Probability of the outcome (p)
Probability of not having the outcome (1-p)ln
ii2211 xβ ... xβ xβαP-1
P ln
Logistic function
= log odds of disease in unexposed
= log odds ratio associated with being exposed
e = odds ratio
ii2211 xβ ... xβ xβαP-1
P ln
Multiple logistic regression
• More than one independent variable–Dichotomous, ordinal, nominal, continuous …
• Interpretation of i – Increase in log-odds for a one unit increase in x i with
all the other xis constant–Measures association between xi and log-odds
adjusted for all other xi
ii2211 xβ ... xβ xβαP-1
P ln
Uses of multivariable analysis
• Etiologic models–Identify risk factors adjusted for
confounders–Adjust for differences in baseline
characteristics
• Predictive models –Determine diagnosis–Determine prognosis
Fitting equation to the data
• Linear regression: –Least squares
• Logistic regression: –Maximum likelihood
Elaborating eβ
• eβ = OR What if the independent variable
is continuous?
what’s the effect of a change in x by more than one unit?
The Q fever example
• Distance to farm as independent continuous variable counted in meters–β in logistic regression was -0.00050013 and
statistically significant
• OR for each 1 meter distance is 0.9995 –Too small to use
• What’s the OR for every 1000 meters?
–e1000*β = e-1000*0.00050013 = 0.6064
Continuous variables
• Increase in OR for a one unit change in exposure variable
• Logistic model is multiplicative OR increases exponentially with x–If OR = 2 for a one unit change in exposure
and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8
• Verify if OR increases exponentially with x –When in doubt, treat as qualitative variable
Coding of variables (2)
• Nominal variables or ordinal with unequal classes:–Preferred hair colour of partners:
» No hair=0, grey=1, brown=2, blond=3
–Model assumes that OR for blond partners = OR for grey-haired partners3
–Use indicator variables (dummy variables)
Indicator variables: Hair colour
• Neutralises artificial hierarchy between classes in variable “hair colour of partners"
• No assumptions made• 3 variables in model using same reference • OR for each type of hair adjusted for the
others in reference to “no hair”
Classes
• Relationship between number of partners during last year and chlamorea
– Code number of partners: 0-1 = 1, 2-3 = 2, 4-5 = 3
• Compatible with assumption of multiplicative model – If not compatible, use indicator variables
Code nr partners
Cases Controls OR
1 20 40 1.0
2 22 30 1.5
3 12 11 2.2
1.52 2.2
Risk factors for Chlamorea
No condom use
Chlamorea
SexHair colourAgegroupSingleVisiting barsNumber of partners
Unconditional Logistic RegressionTerm Odds
Ratio 95% C.I. Coef. S. E. Z-Statistic
P-Value
# partners 1,2664 0,2634 10,7082 0,2362 0,9452 0,5486 0,5833
Single (Yes/No) 1,0345 0,3277 3,2660 0,0339 0,5866 0,0578 0,9539
Hair colour (1/0) 1,6126 0,2675 9,7220 0,4778 0,9166 0,5213 0,6022
Hair colour (2/0) 0,7291 0,0991 5,3668 -0,3159 1,0185 -0,3102 0,7564
Hair colour (3/0) 1,1137 0,1573 7,8870 0,1076 0,9988 0,1078 0,9142
Visiting bars 1,5942 0,4953 5,1317 0,4664 0,5965 0,7819 0,4343
Used no Condoms 9,0918 3,0219 27,3533 2,2074 0,5620 3,9278 0,0001
Sex (f/m) 1,3024 0,2278 7,4468 0,2642 0,8896 0,2970 0,7665
CONSTANT * * * -3,0080 2,0559 -1,4631 0,1434
Last but not least
Why do we need multivariable analysis?
• Our real world is multivariable
• Multivariable analysis is a tool to determine the relative contribution of all factors
Sequence of analysis
• Descriptive analysis–Know your dataset
• Bivariate analysis–Identify associations
• Stratified analysis–Confounding and effect modifiers
• Multivariable analysis–Control for confounding
What can go wrong
• Small sample size and too few cases
• Wrong coding
• Skewed distribution of independent variables–Empty “subgroups”
• Collinearity–Independent variables express the same
Do not forget
• Rubbish in - rubbish out
• Check for confounders first
• Number of subjects >> variables in the model
• Keep the model simple–Statisticians can help with the model but
you need to understand the interpretation
• You will need several attempts to find the “best” model
• If in doubt…
Really call a statistician !!!!
References
• Norman GR, Steiner DL. Biostatistics. The Bare Essentials. BC Decker, London, 2000
• Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989
• Schwartz MH. Multivariable analysis. Cambridge University Press, 2006