PKA & LTS, Sect. 3.1, 3.1.1 Regression models I
Transcript of PKA & LTS, Sect. 3.1, 3.1.1 Regression models I
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Faculty of Health Sciences
Regression modelsBinary covariate, Quantitative outcome, 19-4-2012
Lene Theil SkovgaardBiostatistisk Afdeling
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Binary covariate, Quantitative outcome
PKA & LTS, Sect. 3.1, 3.1.1T-tests
I Example: Body mass index and Vitamin DI Summary statisticsI Estimation of group differencesI Confidence intervals and testsI Check of model assumptions
Home pages: http://biostat.ku.dk/~pka/regrmodels12E-mail: [email protected]
2 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
One categorical covariate
Nature of Dichotomous/Binary General categoricaloutcome Two levels Three or more levels
Quantitative “T-tests” AnovaBinary 2*2 tables (k+1)*2 tables
Survival data log-rank test (k+1)-samplelog-rank test
3 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Examples of binary covariates
I Treatmenttypically randomized studies,controllable
I Gendernot controllabledifferences may be due to ....
I Body staturecategorized quantitative variable,e.g. weight or body mass index:
BMI = weight in kgheight in m, squared .
4 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Body stature
BMI may be categorizedI in 2 levels:
I normal weight (18.5 < BMI < 25)I overweight (BMI ≥ 25)
I in 3 levels:I normal weight (18.5 < BMI < 25)I slight overweight (25 ≤ BMI < 30)I obese (BMI ≥ 30)
5 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
The Vitamin D example
Idea:What is the association (if any) between body mass index andvitamin D status?
Covariate Body stature in two groups,normal weight vs. overweight
Outcome vitamin D status, as measured by 25-hydroxy(25OHD) in serum (in nmol/L)
xi body mass index (BMI) for the ith womanyi vitamin D status (S25OHD) for the ith woman
6 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Scatterplot of vitamin D vs. stature group
Reasonably symmetric distributions,maybe slightly skewed towards large values
7 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Model for vitamin D
The yis are independent, with mean values
E(yi) ={
m0 if subject i is normal weight (18.5 < BMI < 25),m1 if subject i is overweight (BMI ≥ 25).
= m0 + (m1 −m0)I (xi ≥ 25)= a + bI (xi ≥ 25)
where we have defined the new parameters
a = m0, b = m1 −m0.
8 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Statistical model
Specification ofI Independence assumptionsI One or more interesting parameters, e.g. mean valueI Specification of such parameters, as functions of certain
covariatesI Class of distribution, e.g. Binomial, Normal etc.
9 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Statistical analysis
I Model checking: Are the assumptions fulfilled?
I Estimation:Which values of the parameters are best suited for explainingthe observations?How precisely can we conclude?
I Hypothesis test (Model reduction):Can we do almost as well with a simpler model?(e.g. some parameters set to zero)
10 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Summary statistics for Vitamin D status
Group Number Average Median Standard Deviationj nj mj Mj sj
0: Normal weight 16 56.138 52.350 21.9411: Overweight 25 42.804 41.100 17.562
11 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Measures of location
Average : y = 1n Σyi = 1
n (y1 + · · ·+ yn)
sensitive to deviations from symmetry
Median : The middle observation, 50% quantile
robust with respect to the shape of the distribution
12 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Measures of variation
I variance, s2 = 1n−1Σ(yi − m)2 = 1
n−1Σ(yi − y)2
I standard deviation, s =√variance
I special quantiles/percentiles:I quartiles: 25% and 75% quantilesI 1%, 2 1
2%, 5% quantiles etc.
13 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Interpretation of the standard deviation, s
“Most” of the observations can be found in the interval
y ± approx.2× s
i.e. the probability that a randomly chosen subject from apopulation has a value in this interval is “large”...If the variable is Normally distributed, this interval contains approx.95% of future observations. If not....In order to use the above interval, we should at least havereasonable symmetry....
14 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Reference regions
Regions containing 95% of the ’typical’ (middle) observations, i.e.(21
2%-quantile - 9712%-quantile)
If a distribution fits well to a Normal distribution N (m, s2), thenthese quantiles can be directly calculated as follows:
212%-quantile: m − 1.96s ≈ y − 1.96s
9712%-quantile: m + 1.96s ≈ y + 1.96s
and the reference region is therefore calculated as
y ± approx.2× s = (y − approx.2× s, y + approx.2× s)
15 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
The Normal distribution, simulations
mean = mediansymmetry around mean
16 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Uses of the standard deviation, s
I Reference regions:In order for these to be sensible, the distribution has to beclose to a Normal distribution
I Precision/uncertainty in parameter estimates:such as m, or rather m1 − m2not so dependent upon a Normal distributionif we have a large sample sizeWe refer to the standard deviation of a parameter estimate(often called a standard error)
17 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Estimation of group difference
Back to the example:
E(y) = a + bI (xi ≥ 25)
The regression coefficient, b denotes in this case the differencebetween group means b = m1 −m0, and therefore
b = y1 − y0
This is an unbiased estimate, meaning thatit has the correct mean value
E(b) = E(y1 − y0) = m1 −m0 = b
18 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Confidence intervals
For reasonable large samples, the estimate b has a Normaldistribution, and a 95% confidence interval for the differenceb = m1 −m0 is therefore
b ± 1.96 · SD(b) = y1 − y0 ± 1.96 · SD(y1 − y0)
where SD(b) is the standard deviation of b
Parameter Estimate SD of Estimate 95% Confidence Interval
m0 56.138 5.485 (45.387, 66.889)m1 42.804 3.512 (35.920, 49.688)
b = m1 −m0 -13.334 6.199 (-25.484, -1.184)
19 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Technicalities
I For small samples, the normality assumption for the estimatemay be poor
I Instead, we use a Student t distribution, here withn0 + n1 − 2 = 16 + 25− 2 = 39 degrees of freedom
I Most software use this t distribution as the standard
b = m1 −m0 Estimate SD 95% CI t P-Value
Normal –13.334 6.199 (–25.484, –1.184) 2.15 0.032t –13.334 6.199 (–25.875, –0.793) 2.15 0.038
The confidence interval becomes a little wider
20 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Test of hypothesis of equal means
H0 : b = 0 or, equivalently H0 : m0 = m1.
Test statistic (Walds test): t = bSD(b)
= y1−y0SD(y1−y0)
Here, the test statistic becomes
t = −13.3346.199 = −2.15
What can we say from this?
21 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Distribution of test statistic
If we repeatedly compared two identical groups,which values of t would be get?
For large samples, this distribution is N (0, 1), and t is thereforelarge/uncommon, if it exceeds ±1.96
If |t| ≥ 1.96, we reject the hypothesis at a 5% level(the significance level)
22 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
The P-valueThe probability of this or worseprovided that the hypothesis is true,worse meaning something speaking more against the hypothesisthan what we observed
23 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Philosophy of P-value
If the probability of observing something worse is very small, itmust be pretty badThe P-value is the tail probabilityIf P is below the significance value, the test is significant, and thehypothesis is rejected
24 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Tests vs. confidence intervals
A significant test statistic :Confidence interval not including 0
A non-significant test statistic :Confidence interval including 0
Tests and confidence intervals are equivalent
Here, it is exact,sometimes it is only approximate
25 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Traditional model assumptions
I Independencechecked by knowledge
I Equal standard deviations (variance homogeneity)checked from graphics (or test)here just the graph on p.7
I Normalitychecked from graphics (or test)
26 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
The assumption of Normality
Check of Normality assumption should be based onvisual inspection of residuals
Residual: observation (yi) minus expected value (mi):
ri = yi − mj(i),
where j(i) denotes the group containing subject i.
27 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Residual plots
28 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Fewer assumptions lead to more vague conclusions
I Unequal standard deviations:Welch test is conservative
I Normality unreasonable:Mann-Whitney nonparametric test is conservative
29 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Transformation of the outcome
may resolve both inhomogeneity of standard deviations anddeviation from Normality.By far the most common transformation: Logarithms
y∗i = log10(yi)
Group Number Average Median SDNormal weight 16 1.720 1.719 0.164Overweight 25 1.593 1.614 0.193Difference 0.127 0.058
30 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Scatterplot of logarithms
31 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Back transformation
of mean difference: 10−0.127 = 0.75What is the interpretation of this?Since mean ≈ median (for the logarithms),and medians transform nicely
log10(M1)− log10(M0) = log10
(M1
M0
)= −0.127
and hence the backtransformed quantity estimatesthe ratio of the medians, M1
M0= 0.75
32 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Conclusion from logarithmic analysis
I Overweight women have a 25% lower median S25OHDcompared to the normal weight women
I The confidence interval for this ratio is(10−0.245, 10−0.009) = (0.57, 0.98),so that we cannot rule out that
I overweight women may have a substantially (43%) lowerS25OHD than normal weight women
I there may be hardly any difference at all (only 2% lower)
33 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
Check of normality on log-scale
34 / 35
u n i v e r s i t y o f c o p e n h a g e n d e pa rt m e n t o f b i o s tat i s t i c s
The paired t-test
If the two groups are not really groups, but rather two situationsfor the same individuals
(e.g., before and after an intervention),
data are paired and the relevant method of comparison is thepaired t-test.
This will be dealt with later on in the course (may 7)
35 / 35