Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.

Lecture 2

Linear Models I

Olivier MISSA, [email protected]

Advanced Research Skills

mailto:[email protected]

2

Outline

What are Linear models

Basic assumptions

"Refresher" on different types of model:

Single Linear Regression

2 sample t-test

Anova (one-way & two-way)

Ancova

Multiple Linear Regression

3

What are linear models ?Put simply they are Models

attempting to "explain" one response variable by combining linearly several predictor variables.

22110 XXY

In theory, the response variable and the predictor variables

can either be continuous, discrete or categorical.

This makes them particularly versatile.

Indeed, many well known statistical procedures are linear models.

e.g. Linear regression, Two-sample t-test,

One-way & Two-way ANOVA, ANCOVA, ...

4

What are linear models ?For the time being, we are going to assume that

(1) the response variable is continuous.

(2) the residuals (ε) are normally distributed and ...

(3) ... independently and identically distributed.

These "three" assumptions define classical linear models.

We will see in later lectures ways to bypass these assumptions

to be able to cope with an even wider range of situations

(generalized linear models, mixed-effects models).

But let's take things one step at a time.

5

Starting with a simple exampleSingle Linear regression

1 response variable (continuous) vs.

1 explanatory variable either continuous or discrete (but ordinal !).

Attempts to fit a straight line, y = a + b.x

> library(faraway)

> data(seatpos)

> attach(seatpos)

> model <- lm(Weight ~ Ht)

> model

Coefficients:

(Intercept) Ht

-292.992 2.653

XY 10

6

But how "significant" is this trend ?Can be approached in a number of ways

1) R2 the proportion of variance explained.

TotalSS

RSSR 12

Residual Sum of Squares

Total Sum of Squares

2ˆii yy

2yyi

fitted values

average value

R2= 0.686

Deviance

RSS = 14853 TotalSS = 47371

7


TotalSS

SS

TotalSS

RSSTotalSS

TotalSS

RSSR reg

12

2ˆ yySS ireg

Sum of Squares due to regression

SSreg = 32518RSS = 14853 TotalSS = 47371

8


> summary(model)Call:lm(formula = Weight ~ Ht)Residuals: Min 1Q Median 3Q Max -29.388 -12.213 -4.876 4.945 59.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***Ht 2.6533 0.2989 8.878 1.35e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.31 on 36 degrees of freedomMultiple R-squared: 0.6865, Adjusted R-squared: 0.6777 F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10

9

2) model parameters and their standard errors

> summary(model)

... some outputs left out

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***

Ht 2.6533 0.2989 8.878 1.35e-10 ***

...

How many standard errors away from 0 are the estimates

878.82989.06533.2

10

> anova(model)

Analysis of Variance Table

Response: Weight

Df Sum Sq Mean Sq F value Pr(>F)

Ht 1 32518 32518 78.816 1.351e-10 ***

Residuals 36 14853 413

...

3) F-test

SSreg

RSS

MSreg= SSreg / dfreg

F = MSreg / MSE

The F-value so obtained must be compared

to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom

MSE = RSS / rdf

11

How strong is this trend ?

95% Confidence Interval around Slope

95% future observations band22

31

13

12

Are the assumptions met ?1: Is the response variable continuous ? YES !

2: Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(model$residuals) ## Anderson-Darling

Anderson-Darling normality test

data: mod$residuals A = 1.9511, p-value = 4.502e-05> plot(model, which=2) ## qqplot of std.residuals

Answer: NO !due to a few outliers ?

13

3a : Are the residuals independent ?

> plot(model, which=1) ## residuals vs fitted values.

A bad example : non-linear trend

Answer: looks OK !

14

3b : Are the residuals identically distributed ?

> plot(model, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

Continuing our bad example

Answer: Not perfect, but OK !

Bad OK

15

Bad Bad

Another bad example of residuals

Heteroscedasticity

Bad

16

Is our model OK ?despite having outliers

Possible approaches1) Repeat the analysis, removing the outliers,

and check the model parameters.

> summary(model)

Coefficients:


(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***

Ht 2.6533 0.2989 8.878 1.35e-10 ***

> model2 <- lm(Weight ~ Ht, subset=seq(38)[-22])

> summary(model2)

Coefficients:


(Intercept) -213.8000 47.4169 -4.509 7.00e-05 ***

Ht 2.1731 0.2813 7.727 4.52e-09 ***

R2= 0.686

R2= 0.630

minus obs. #22

17

Possible approaches

2) Examine directly the impact each observation has on the model output.

Depends on (a) leverage (hii)

"How extreme each x-value is compared to the others"

Has the potential to influence the fit strongly.

2

21

xx

xx

nh

i

iii

When there is only one predictor variable:

Σ hii = 2

In general (for later):

Σ hii = p (no. parameters. in the model)

18

However: the leverage is only part of the story

The direct impact of an observation depends also on (b) its residual

A point with strong leverage but small residual is not a

problem: the model adequately accounts for it.

A point with weak leverage but large residual is not a problem

either: the model is weakly affected by it.

A point with strong leverage and large residual, however,

strongly influences the model.

Removing the latter point will usually modify

the model output to some extent.

19

Influential Observations

> plot(model, which=5) ## standardized residuals vs Leverage.

22

31

13

20

A direct measure of influencecombining both Leverages and Residuals

Cook's distance

MSEp

yyD

n

jijj

i .

ˆˆ2

1)(

fitted values when point i is omitted

original fitted values

Mean Square Error

No. of Parameters = RSS / (n-p)

21

Combining both Leverages and Residuals

Cook's distance

2

2

1. ii

iiii

h

h

MSEpD

Raw residuals Leverage values

Any Di value above 0.5 deserves a closer look and

Any Di value above 1 merits special attention.

ii

iiii h

h

p

rD

1

2

Standardised residuals

22

Combining both Leverages and Residuals

Cook's distance

Any Di value above 0.5 deserves a closer look and

Any Di value above 1 merits special attention.

23

Remove point #22 ?Although point #22 is influential, it does not invalidate the model.

We should probably keep it,

but note that the regression slope

may be shallower than the model

suggests.

A "reason" for the presence of these

outliers can be suggested:

some obese people were included

in the survey.

Depending on the purpose of the model,

you may want to keep or remove these outliers.

22

31

13

24

Next : " 2-sample t-test "> data(energy) ## in ISwR package> attach(energy)> plot(expend ~ stature)> stripchart(expend ~ stature,

method="jitter")first, let's have a look at the classical t-test

SEDM

xxt 12

t-value Standard Error of the Difference

of Means

22

21 SEMSEMSEDM

Standard Error of the Mean

25

Classical 2-sample t-test

> t.test(expend ~ stature)

Welch Two Sample t-test

data: expend by stature t = -3.8555, df = 15.919, p-val = 0.001411alternative hypothesis: true difference

in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778

unequal variance

for the difference

Assuming Unequal Variance

26

Classical 2-sample t-test

> t.test(expend~stature, var.equal=T)

Two Sample t-test

data: expend by stature t = -3.9456, df = 20, p-value = 0.000799alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778

Equal Variance

slightly more significant

Assuming Equal Variance

27

" 2-sample t-test " as a linear model

The t-test could "logically" translate into :

obeseleanY 0

However:

One of these β parameters is superfluous,

and actually makes the model fitting

through matrix algebra impossible.

So, instead it is usually translated into:

when X = "lean" when X = "obese"

β0

obesedefaultY

for "all" X when X = "obese"

βΔlean

βΔobese

βdefault

βΔobese

28


> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Remember : linear models assume Equal Variance

doesn't look symmetric

difference between lean and obese averages

Average for lean category

factor w/ only two levels: lean & obese

29



" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance

Standard Error of the Mean (SEM) for lean

Standard Error of the Difference of the Means (SEDM)

30



Residual standard error: 1.304 on 20 degrees of freedomMultiple R-squared: 0.4377, Adjusted R-squared: 0.4096 F-statistic: 15.57 on 1 and 20 DF, p-value: 0.000799

" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance

Same values as classical t-test (equal variance)

31

> anova(mod)Analysis of Variance Table

Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature 1 26.485 26.485 15.568 0.000799 ***Residuals 20 34.026 1.701


SSmodel

F = MSmodel / MSERSS

MSmodel= SSmodel / dfmodel

MSE = RSS / rdf

RSS = 34.026 TSS = 60.511 SSmodel = 26.485

32

Are the assumptions met ? 1 : Is the response variable continuous ? YES !

2 : Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling


data: mod$residuals A = 0.9638, p-value = 0.01224

> plot(mod, which=2) ## qqplot of std.residuals

Answer: NO !

33

3 : Are the residuals independent

and identically distributed ?

> plot(mod, which=1) ## residuals vs fitted values.

Answer: perhaps not identically distributed

> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

lean obese

34

Transforming the response variable to produce normal residuals

Can be optimised using the Box-Cox method.

> library(MASS)> boxcox(mod, plotit=T)

The method transforms the response y into gλ(y)where:

0log

01

)(

y

y

yg

optimal solution

λ ≈ - 1/2

y

yyg

22

2/1

11

)(

35

Is this transformation good enough ?


> new.y <- 2 – 2/sqrt(expend)> mod2 <- lm(new.y ~ stature)> ad.test(residuals(mod2))


data: residuals(mod2) A = 0.5629, p-value = 0.1280

> plot(mod2, which=2)> plot(mod, which=2)

## qqplot of std.residuals.

Perhaps the assumption of equal variance is not warranted ??

36



> var.test(new.y ~ stature) ## only works if stature is ## a factor with only 2 levels

F test to compare two variances

data: new.y by stature F = 1.5897, num df = 12, denom df = 8, p-value = 0.5201alternative hypothesis:

true ratio of variances is not equal to 1 95 percent confidence interval: 0.3785301 5.5826741 sample estimates:ratio of variances 1.589701

37



> summary(mod) ## untransformed yCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***

> summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.29049 0.01301 99.176 < 2e-16 ***statureobese 0.08264 0.02034 4.062 0.000608 ***

38

Parametric extension of the t-test to more than 2 groupswhose sample sizes can be unequal

Next : One-Way Anova

> data(coagulation) ## in faraway package ## blood coagulation times among ## 24 animals fed one of 4 diets> attach(coagulation)> plot(coag ~ diet)> stripchart(coag ~ diet,

method="jitter")

39

One-Way Anova as an F-test> res.aov <- aov(coag ~ diet) ## classical ANOVA> summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6 ---

RSS = 112 TSS = 340 SSmodel = 228

40

One-Way Anova as a linear model> mod <- lm(coag ~ diet) ## as a linear model> summary(mod) ## some output left outCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 ***dietD -1.071e-14 1.449e+00 -7.39e-15 1.000000

Residual standard error: 2.366 on 20 degrees of freedomMultiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05

> anova(mod) ## or summary(aov(mod))Analysis of Variance Table

Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6

AxAB xx AC xx AD xx

not very useful

41

Are the assumptions met ? 1 : Is the response variable continuous ? No, discrete !

2 : Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling


data: mod$residuals A = 0.301, p-value = 0.5517

> plot(mod, which=2) ## qqplot of std.residuals

Answer: Yes !

42

3 : Are the residuals independent

and identically distributed ?

> plot(mod, which=1) ## residuals vs fitted values.

Answer: OK

> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

A D B C

3 obs.

> library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F)group 3 0.6492 0.5926 20

44

More Classical GraphsHistogram + Theoretical curve Boxplot Stripchart

Barplot Pie chart 3D models

Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.

Documents

Transcript of Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.