Post on 11-Jan-2017
Regression models 1
Choosing regression modelsAn elementary introduction
Stephen Senn
Regression models 2
Explanation
• I am not presenting these things because I think you don’t know them
• I am presenting them because the people you work with don’t know them
• And you need to explain these things to them
Regression models 3
Outline
• Basic considerations in modelling• Choosing predictors• Transformation of the predictor(s)• Transformation of the outcome• Advice
Regression models 4
Basic considerationsThinking before you model
Regression models 5
Some Modelling Tasks• Choose a generally suitable probability model• Choose a set of suitable predictors• Consider whether these need to be transformed• Consider whether the outcome needs to be transformed• Choose a technique for fitting the model• Fit the model• Assess goodness of fit of model• Make causal inferences• Issue predictions
Regression models 6
Factors Affecting Choice of Model
• Purpose of model– Causal, predictive, classification
• Design of study– Designed experiment, observational study, survey,
• Temporal sequence• Prior knowledge• Type of data
– Continuous measurements, binary, ordinal, counts, censored life-times
• Case ascertainment• Results of model fitting
Regression models 7
Preliminaries
• Choosing good regression models is not a question of throwing some data at a stepwise selection algorithm
• Two things are important– Being clear about the purpose– Insight (which in turn is based on)
• Experience• Understanding• Logic
Regression models 8
Two Extremes
Causal analysis• The putative causal factor(s) must
be in the model• Other factors are in the model
because they help us understand the causal factor(s)
• They are of no interest in themselves
• We pay particular attention to the significance of the putative causal factor(s)
Predictive modelling• We are trying to find predictors of
some outcome• It is their joint value as predictors
that is important• We simply want the most
predictive model• We compare entire models to
judge which is best
Regression models 9
Example
• Modelling the effect of treatment in a clinical trial• Treatment must be in any model whether or not it
is significant• Other factors will be in the model to help me
improve my estimate of the effect of treatment– They are of little interest in themselves– They are nearly always predetermined
Regression models 10
Does Smoking Cause Lung Cancer?A Tale of Two Statisticians
Works in public health• I wish to establish whether
it is causal• If so I can warn smokers to
quit and this will benefit their health
• It is important for me to rule out possible confounding factors
Works in life insurance• I don’t care if it is causal or
not• The data show that smokers
are much more likely to get lung cancer
• That’s enough for me to take account of it in setting the premiums
Regression models 11
Warning
• Regression models are there to help you use your insight, experience and prior knowledge to understand your datasets
• They are not a substitute for scientific understanding
Regression models 12
Choosing predictorsIt’s not just a matter of significance
Regression models 13
An Example• Multicentre trial of asthma comparing formoterol, salbutamol and
placebo for their effects on forced expiratory volume in one second (FEV1).
• Randomisation stratified by steroid use (yes/no) and centre• Sex, age, height of patient and baseline FEV1 also measured• Definitely in the model
– Blocking factors: centre & steroid use– Treatment factor (3 levels: formoterol, salbutamol, placebo)
• Possibly in the model– Covariates: sex, age, height of patient and baseline FEV1
– NB sex, age, height are very predictive of baseline FEV1 also therefore if you put them all in the model none may be significant
– This does not matter
Regression models 14
Temporal Sequence I
• If we are interested in causal inferences it is usually inappropriate to include variables that were measured later in a model than putative causal variables that were measured earlier.
• The later variables cannot have caused the earlier variables and so should not be included.
Regression models 15
Example• It is desired to study whether the type of school attended
(private or state school) affects students’ chances of success in final degree examinations at university
• Data are obtained for a large group of students• In addition to information on degree results and type of
school attended, information is obtained on – sex of student, – high school results – parents’ income
• Which of these factors is it inappropriate to include in the model and why?
Regression models 16
Temporal Sequence II• The same does not apply if the purpose of
the model is simply classification• It may then be helpful to have factors in the
model even if they are measured after the “outcome variable of interest”
• Indeed they can be included even if they have been “caused” by the variable of interest
Regression models 17
Example• We wish to develop a model for classifying
patients who present with abdominal pain as either suffering from appendicitis or non-specific abdominal pain
• We use location of pain, degree of pain, absence/presence of nausea, body temperature as “predictor” variable– Even though these are consequences of rather
than causes of appendicitis
Regression models 18
Prior Knowledge• Frequently when fitting models we already have strong opinions about
the effect of some factors even if we are ignorant about others.– For example we may be examining the effect of a previously
unstudied environmental exposure on health– we know, however, that age is an important determinant of health
• We will tend to put factors we believe are important in the model irrespective of their significance according to the current data set.
• Similarly, implicitly, there will always be a host of factors we believe are irrelevant.
• We will not put these in the model on prior grounds
Regression models 19
Type of Data and Choice of Basic Model
Type of Data• Continuous measurement• Count data• Binary data• Ordered categorical• Censored lifetimes• Multinomial
Possible Basic Model• General linear model (Normal outcomes)• Poisson regression• Logistic regression• Proportional odds • Proportional hazards• Log-linear
Regression models 20
Case Ascertainment• The way in which data are obtained (ascertained) can
affect the way that we build a model• For example in a case-control study we sample by
outcome (cases and controls) and then measure how these two differ by exposure – Example
• Case: lung cancer, Control: other cancer• Exposure: smoker versus non-smoker
• We cannot model relative risk using such data• We can only model (log) odds ratios• For a cohort study where we sample by exposure we could
model either
Regression models 21
Social Status: Longer life expectancy for Oscar winnersA study of actors and actresses found that Oscar winners lived, on average, almost four years longer than nominees who went home empty-handed, reports the March issue of the Harvard Health Letter. Actors aren’t the only people who reap benefits. Dr. Donald Redelmeier of Toronto’s Sunnybrook and Women’s College Health Sciences Centre found that Oscar-winning directors live longer than non-winners, and male directors live 4.5 years longer on average than actors. These findings add to a large body of evidence delineating connections between social status and health and longevity, reports the Harvard Health Letter. Redelmeier theorizes that an Oscar on the mantel moves the winner up the Hollywood pecking order. Winners find it easier to get work, and when they do, they’re better appreciated and better paid.
Regression models 22
Not Harvard Health Publications
A study has shown that getting a telegram from The Queen can add 20 years to your life
An extensive study of individuals who have received telegrams from The Queen has shown that an astonishing proportion of them have lived to be 100. Age at death of a control group of non-recipients was typically 20 years less.
Researchers have postulated that esteem is an important determinant of health
Joked lead researcher, Prof Morton Gullible, ‘our advice to her Majesty is send yourself a telegram, Ma'am’
Regression models 23
Results of Model Fitting• Statisticians have developed a number of
techniques for assessing the adequacy of various models using the data in hand– Standard errors, significance tests on coefficients– Analysis of variance/ deviance on factors– Goodness of fit generally– Residual plots– AIC, BIC
• These are important tools but are by no means the only tools for assessing the adequacy of a model
Regression models 24
Transforming predictorsThe X Files
Regression models 25
Luxembourg Temperature Example
Data on temperatures in Luxembourg
Month Normal temperatures deg CJanuary 0.6February 1.4March 4.7April 7.7May 12.4June 15.1July 17.5August 17.3September 13.5October 8.9November 4.0December 1.8
Regression models 26
Modelling the temperatureNote that in the yearly rhythm, January follows December even though January is point 1 and December point 12.
The data are periodic and we need a model that reflects this.
The simplest periodic pattern is a sine wave.
= level (the average temperature)b = amplitude (the difference max to average) = phase (governs point at which maximum is reached)
Regression models 27
Fitting a sine wave
A sine wave model can be fitted by using the fact that
This is linear in . Hence by regressing Y on two variables we can obtain a periodic fit.
Note that X must be transformed from linear to angular measure. So we can write
if we measure in degrees or
radians
Regression models 28
3 parameters fit 12 points rather well
Regression models 29
Transforming the outcomeBeing wise about Ys
Regression models 30
An Example of a One-way Layout
• Four experimental p38 kinase inhibitors• Vehicle and marketed product as controls• Thrombaxane B2 (TXB2) is used as a
marker of COX-1 activity • Six rats per group were treated for a total of
36 rats• At the end of the study rats are sacrificed
and TXB2 is measured.
Regression models 31
Regression models 32
GenStat® ANOVA(Original data)
Analysis of variance Variate: TXB2 Source of variation d.f. s.s. m.s. v.r. F pr.Treatment 5 184596. 36919. 6.31 <.001Residual 30 175439. 5848. Total 35 360035.
A2WAY [TREATMENTS=Treatment] TXB2
Regression models 33
GenStat plot of residuals
Regression models 34
Regression models 35
GenStat ANOVA(log transformed)
A2WAY [TREATMENTS=Treatment] logTXB2
Analysis of variance Variate: logTXB2 Source of variation d.f. s.s. m.s. v.r.Treatment 5 62.6760 12.5352 40.09Residual 30 9.3800 0.3127 Total 35 72.0559
Signal to noise ratio is now much higher
Regression models 36
GenStat plot of residuals
Regression models 37
Homogeneity of Variances(Bartlett’ Test: GenStat)
Untransformed*** Bartlett's Test for homogeneity of variances *** Chi-square 50.87 on 5 degrees of freedom: probability < 0.001 Log-transformed *** Bartlett's Test for homogeneity of variances *** Chi-square 8.95 on 5 degrees of freedom: probability 0.111
Regression models 38
Data-filtering examplesor find the flaw
• A 20 year follow-up study of women in an English village found higher survival amongst smokers than non-smokers
• Transplant receivers on highest doses of cyclosporine had higher probability of graft rejection than on lower doses
• Left-handers observed to die younger on average than right-handers
• Obese infarct survivors have better prognosis than non-obese
Regression models 39
AdviceStatistics is a way of improving your thinking, not a substitute for it
Regression models 40
Advice• Think before you model• Purpose is key
– Causal– Predictive– Classification
• Think about time• Think about case ascertainment• Testing is a small part of discerning• Don’t use stepwise regression as a substitute for
understanding