Department of Social Policy and Social Work, University of York [email protected]
Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.
-
Upload
beverly-hopkins -
Category
Documents
-
view
213 -
download
0
Transcript of Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.
2
Outline
What are Linear models
Basic assumptions
"Refresher" on different types of model:
Single Linear Regression
2 sample t-test
Anova (one-way & two-way)
Ancova
Multiple Linear Regression
3
What are linear models ?Put simply they are Models
attempting to "explain" one response variable by combining linearly several predictor variables.
22110 XXY
In theory, the response variable and the predictor variables
can either be continuous, discrete or categorical.
This makes them particularly versatile.
Indeed, many well known statistical procedures are linear models.
e.g. Linear regression, Two-sample t-test,
One-way & Two-way ANOVA, ANCOVA, ...
4
What are linear models ?For the time being, we are going to assume that
(1) the response variable is continuous.
(2) the residuals (ε) are normally distributed and ...
(3) ... independently and identically distributed.
These "three" assumptions define classical linear models.
We will see in later lectures ways to bypass these assumptions
to be able to cope with an even wider range of situations
(generalized linear models, mixed-effects models).
But let's take things one step at a time.
5
Starting with a simple exampleSingle Linear regression
1 response variable (continuous) vs.
1 explanatory variable either continuous or discrete (but ordinal !).
Attempts to fit a straight line, y = a + b.x
> library(faraway)
> data(seatpos)
> attach(seatpos)
> model <- lm(Weight ~ Ht)
> model
Coefficients:
(Intercept) Ht
-292.992 2.653
XY 10
6
But how "significant" is this trend ?Can be approached in a number of ways
1) R2 the proportion of variance explained.
TotalSS
RSSR 12
Residual Sum of Squares
Total Sum of Squares
2ˆii yy
2yyi
fitted values
average value
R2= 0.686
Deviance
RSS = 14853 TotalSS = 47371
7
1) R2 the proportion of variance explained.
TotalSS
SS
TotalSS
RSSTotalSS
TotalSS
RSSR reg
12
2ˆ yySS ireg
Sum of Squares due to regression
SSreg = 32518RSS = 14853 TotalSS = 47371
8
1) R2 the proportion of variance explained.
> summary(model)Call:lm(formula = Weight ~ Ht)Residuals: Min 1Q Median 3Q Max -29.388 -12.213 -4.876 4.945 59.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***Ht 2.6533 0.2989 8.878 1.35e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.31 on 36 degrees of freedomMultiple R-squared: 0.6865, Adjusted R-squared: 0.6777 F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10
9
2) model parameters and their standard errors
> summary(model)
... some outputs left out
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***
Ht 2.6533 0.2989 8.878 1.35e-10 ***
...
How many standard errors away from 0 are the estimates
878.82989.06533.2
10
> anova(model)
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Ht 1 32518 32518 78.816 1.351e-10 ***
Residuals 36 14853 413
...
3) F-test
SSreg
RSS
MSreg= SSreg / dfreg
F = MSreg / MSE
The F-value so obtained must be compared
to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom
MSE = RSS / rdf
11
How strong is this trend ?
95% Confidence Interval around Slope
95% future observations band22
31
13
12
Are the assumptions met ?1: Is the response variable continuous ? YES !
2: Are the residuals normally distributed ?
> library(nortest) ## Package of Normality tests> ad.test(model$residuals) ## Anderson-Darling
Anderson-Darling normality test
data: mod$residuals A = 1.9511, p-value = 4.502e-05> plot(model, which=2) ## qqplot of std.residuals
Answer: NO !due to a few outliers ?
13
3a : Are the residuals independent ?
> plot(model, which=1) ## residuals vs fitted values.
A bad example : non-linear trend
Answer: looks OK !
14
3b : Are the residuals identically distributed ?
> plot(model, which=3) ## sqrt(abs(standardized(residuals)))
vs fitted values.
Continuing our bad example
Answer: Not perfect, but OK !
Bad OK
15
Bad Bad
Another bad example of residuals
Heteroscedasticity
Bad
16
Is our model OK ?despite having outliers
Possible approaches1) Repeat the analysis, removing the outliers,
and check the model parameters.
> summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***
Ht 2.6533 0.2989 8.878 1.35e-10 ***
> model2 <- lm(Weight ~ Ht, subset=seq(38)[-22])
> summary(model2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -213.8000 47.4169 -4.509 7.00e-05 ***
Ht 2.1731 0.2813 7.727 4.52e-09 ***
R2= 0.686
R2= 0.630
minus obs. #22
17
Possible approaches
2) Examine directly the impact each observation has on the model output.
Depends on (a) leverage (hii)
"How extreme each x-value is compared to the others"
Has the potential to influence the fit strongly.
2
21
xx
xx
nh
i
iii
When there is only one predictor variable:
Σ hii = 2
In general (for later):
Σ hii = p (no. parameters. in the model)
18
However: the leverage is only part of the story
The direct impact of an observation depends also on (b) its residual
A point with strong leverage but small residual is not a
problem: the model adequately accounts for it.
A point with weak leverage but large residual is not a problem
either: the model is weakly affected by it.
A point with strong leverage and large residual, however,
strongly influences the model.
Removing the latter point will usually modify
the model output to some extent.
19
Influential Observations
> plot(model, which=5) ## standardized residuals vs Leverage.
22
31
13
20
A direct measure of influencecombining both Leverages and Residuals
Cook's distance
MSEp
yyD
n
jijj
i .
ˆˆ2
1)(
fitted values when point i is omitted
original fitted values
Mean Square Error
No. of Parameters = RSS / (n-p)
21
Combining both Leverages and Residuals
Cook's distance
2
2
1. ii
iiii
h
h
MSEpD
Raw residuals Leverage values
Any Di value above 0.5 deserves a closer look and
Any Di value above 1 merits special attention.
ii
iiii h
h
p
rD
1
2
Standardised residuals
22
Combining both Leverages and Residuals
Cook's distance
Any Di value above 0.5 deserves a closer look and
Any Di value above 1 merits special attention.
23
Remove point #22 ?Although point #22 is influential, it does not invalidate the model.
We should probably keep it,
but note that the regression slope
may be shallower than the model
suggests.
A "reason" for the presence of these
outliers can be suggested:
some obese people were included
in the survey.
Depending on the purpose of the model,
you may want to keep or remove these outliers.
22
31
13
24
Next : " 2-sample t-test "> data(energy) ## in ISwR package> attach(energy)> plot(expend ~ stature)> stripchart(expend ~ stature,
method="jitter")first, let's have a look at the classical t-test
SEDM
xxt 12
t-value Standard Error of the Difference
of Means
22
21 SEMSEMSEDM
Standard Error of the Mean
25
Classical 2-sample t-test
> t.test(expend ~ stature)
Welch Two Sample t-test
data: expend by stature t = -3.8555, df = 15.919, p-val = 0.001411alternative hypothesis: true difference
in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778
unequal variance
for the difference
Assuming Unequal Variance
26
Classical 2-sample t-test
> t.test(expend~stature, var.equal=T)
Two Sample t-test
data: expend by stature t = -3.9456, df = 20, p-value = 0.000799alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778
Equal Variance
slightly more significant
Assuming Equal Variance
27
" 2-sample t-test " as a linear model
The t-test could "logically" translate into :
obeseleanY 0
However:
One of these β parameters is superfluous,
and actually makes the model fitting
through matrix algebra impossible.
So, instead it is usually translated into:
when X = "lean" when X = "obese"
β0
obesedefaultY
for "all" X when X = "obese"
βΔlean
βΔobese
βdefault
βΔobese
28
" 2-sample t-test " as a linear model
> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remember : linear models assume Equal Variance
doesn't look symmetric
difference between lean and obese averages
Average for lean category
factor w/ only two levels: lean & obese
29
> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance
Standard Error of the Mean (SEM) for lean
Standard Error of the Difference of the Means (SEDM)
30
> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.304 on 20 degrees of freedomMultiple R-squared: 0.4377, Adjusted R-squared: 0.4096 F-statistic: 15.57 on 1 and 20 DF, p-value: 0.000799
" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance
Same values as classical t-test (equal variance)
31
> anova(mod)Analysis of Variance Table
Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature 1 26.485 26.485 15.568 0.000799 ***Residuals 20 34.026 1.701
" 2-sample t-test " as a linear model
SSmodel
F = MSmodel / MSERSS
MSmodel= SSmodel / dfmodel
MSE = RSS / rdf
RSS = 34.026 TSS = 60.511 SSmodel = 26.485
32
Are the assumptions met ? 1 : Is the response variable continuous ? YES !
2 : Are the residuals normally distributed ?
> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling
Anderson-Darling normality test
data: mod$residuals A = 0.9638, p-value = 0.01224
> plot(mod, which=2) ## qqplot of std.residuals
Answer: NO !
33
3 : Are the residuals independent
and identically distributed ?
> plot(mod, which=1) ## residuals vs fitted values.
Answer: perhaps not identically distributed
> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))
vs fitted values.
lean obese
34
Transforming the response variable to produce normal residuals
Can be optimised using the Box-Cox method.
> library(MASS)> boxcox(mod, plotit=T)
The method transforms the response y into gλ(y)where:
0log
01
)(
y
y
yg
optimal solution
λ ≈ - 1/2
y
yyg
22
2/1
11
)(
35
Is this transformation good enough ?
Transforming the response variable to produce normal residuals
> new.y <- 2 – 2/sqrt(expend)> mod2 <- lm(new.y ~ stature)> ad.test(residuals(mod2))
Anderson-Darling normality test
data: residuals(mod2) A = 0.5629, p-value = 0.1280
> plot(mod2, which=2)> plot(mod, which=2)
## qqplot of std.residuals.
Perhaps the assumption of equal variance is not warranted ??
36
Is this transformation good enough ?
Transforming the response variable to produce normal residuals
> var.test(new.y ~ stature) ## only works if stature is ## a factor with only 2 levels
F test to compare two variances
data: new.y by stature F = 1.5897, num df = 12, denom df = 8, p-value = 0.5201alternative hypothesis:
true ratio of variances is not equal to 1 95 percent confidence interval: 0.3785301 5.5826741 sample estimates:ratio of variances 1.589701
37
Is this transformation good enough ?
Transforming the response variable to produce normal residuals
> summary(mod) ## untransformed yCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***
> summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.29049 0.01301 99.176 < 2e-16 ***statureobese 0.08264 0.02034 4.062 0.000608 ***
38
Parametric extension of the t-test to more than 2 groupswhose sample sizes can be unequal
Next : One-Way Anova
> data(coagulation) ## in faraway package ## blood coagulation times among ## 24 animals fed one of 4 diets> attach(coagulation)> plot(coag ~ diet)> stripchart(coag ~ diet,
method="jitter")
39
One-Way Anova as an F-test> res.aov <- aov(coag ~ diet) ## classical ANOVA> summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6 ---
RSS = 112 TSS = 340 SSmodel = 228
40
One-Way Anova as a linear model> mod <- lm(coag ~ diet) ## as a linear model> summary(mod) ## some output left outCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 ***dietD -1.071e-14 1.449e+00 -7.39e-15 1.000000
Residual standard error: 2.366 on 20 degrees of freedomMultiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05
> anova(mod) ## or summary(aov(mod))Analysis of Variance Table
Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6
AxAB xx AC xx AD xx
not very useful
41
Are the assumptions met ? 1 : Is the response variable continuous ? No, discrete !
2 : Are the residuals normally distributed ?
> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling
Anderson-Darling normality test
data: mod$residuals A = 0.301, p-value = 0.5517
> plot(mod, which=2) ## qqplot of std.residuals
Answer: Yes !
42
3 : Are the residuals independent
and identically distributed ?
> plot(mod, which=1) ## residuals vs fitted values.
Answer: OK
> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))
vs fitted values.
A D B C
3 obs.
> library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F)group 3 0.6492 0.5926 20
43
44
More Classical GraphsHistogram + Theoretical curve Boxplot Stripchart
Barplot Pie chart 3D models