Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.

44
Lecture 2 Linear Models I Olivier MISSA, [email protected] Advanced Research Skills

Transcript of Lecture 2 Linear Models I Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills.

Page 1: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

Lecture 2

Linear Models I

Olivier MISSA, [email protected]

Advanced Research Skills

Page 2: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

2

Outline

What are Linear models

Basic assumptions

"Refresher" on different types of model:

Single Linear Regression

2 sample t-test

Anova (one-way & two-way)

Ancova

Multiple Linear Regression

Page 3: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

3

What are linear models ?Put simply they are Models

attempting to "explain" one response variable by combining linearly several predictor variables.

22110 XXY

In theory, the response variable and the predictor variables

can either be continuous, discrete or categorical.

This makes them particularly versatile.

Indeed, many well known statistical procedures are linear models.

e.g. Linear regression, Two-sample t-test,

One-way & Two-way ANOVA, ANCOVA, ...

Page 4: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

4

What are linear models ?For the time being, we are going to assume that

(1) the response variable is continuous.

(2) the residuals (ε) are normally distributed and ...

(3) ... independently and identically distributed.

These "three" assumptions define classical linear models.

We will see in later lectures ways to bypass these assumptions

to be able to cope with an even wider range of situations

(generalized linear models, mixed-effects models).

But let's take things one step at a time.

Page 5: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

5

Starting with a simple exampleSingle Linear regression

1 response variable (continuous) vs.

1 explanatory variable either continuous or discrete (but ordinal !).

Attempts to fit a straight line, y = a + b.x

> library(faraway)

> data(seatpos)

> attach(seatpos)

> model <- lm(Weight ~ Ht)

> model

Coefficients:

(Intercept) Ht

-292.992 2.653

XY 10

Page 6: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

6

But how "significant" is this trend ?Can be approached in a number of ways

1) R2 the proportion of variance explained.

TotalSS

RSSR 12

Residual Sum of Squares

Total Sum of Squares

2ˆii yy

2yyi

fitted values

average value

R2= 0.686

Deviance

RSS = 14853 TotalSS = 47371

Page 7: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

7

1) R2 the proportion of variance explained.

TotalSS

SS

TotalSS

RSSTotalSS

TotalSS

RSSR reg

12

2ˆ yySS ireg

Sum of Squares due to regression

SSreg = 32518RSS = 14853 TotalSS = 47371

Page 8: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

8

1) R2 the proportion of variance explained.

> summary(model)Call:lm(formula = Weight ~ Ht)Residuals: Min 1Q Median 3Q Max -29.388 -12.213 -4.876 4.945 59.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***Ht 2.6533 0.2989 8.878 1.35e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.31 on 36 degrees of freedomMultiple R-squared: 0.6865, Adjusted R-squared: 0.6777 F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10

Page 9: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

9

2) model parameters and their standard errors

> summary(model)

... some outputs left out

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***

Ht 2.6533 0.2989 8.878 1.35e-10 ***

...

How many standard errors away from 0 are the estimates

878.82989.06533.2

Page 10: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

10

> anova(model)

Analysis of Variance Table

Response: Weight

Df Sum Sq Mean Sq F value Pr(>F)

Ht 1 32518 32518 78.816 1.351e-10 ***

Residuals 36 14853 413

...

3) F-test

SSreg

RSS

MSreg= SSreg / dfreg

F = MSreg / MSE

The F-value so obtained must be compared

to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom

MSE = RSS / rdf

Page 11: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

11

How strong is this trend ?

95% Confidence Interval around Slope

95% future observations band22

31

13

Page 12: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

12

Are the assumptions met ?1: Is the response variable continuous ? YES !

2: Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(model$residuals) ## Anderson-Darling

Anderson-Darling normality test

data: mod$residuals A = 1.9511, p-value = 4.502e-05> plot(model, which=2) ## qqplot of std.residuals

Answer: NO !due to a few outliers ?

Page 13: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

13

3a : Are the residuals independent ?

> plot(model, which=1) ## residuals vs fitted values.

A bad example : non-linear trend

Answer: looks OK !

Page 14: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

14

3b : Are the residuals identically distributed ?

> plot(model, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

Continuing our bad example

Answer: Not perfect, but OK !

Bad OK

Page 15: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

15

Bad Bad

Another bad example of residuals

Heteroscedasticity

Bad

Page 16: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

16

Is our model OK ?despite having outliers

Possible approaches1) Repeat the analysis, removing the outliers,

and check the model parameters.

> summary(model)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***

Ht 2.6533 0.2989 8.878 1.35e-10 ***

> model2 <- lm(Weight ~ Ht, subset=seq(38)[-22])

> summary(model2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -213.8000 47.4169 -4.509 7.00e-05 ***

Ht 2.1731 0.2813 7.727 4.52e-09 ***

R2= 0.686

R2= 0.630

minus obs. #22

Page 17: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

17

Possible approaches

2) Examine directly the impact each observation has on the model output.

Depends on (a) leverage (hii)

"How extreme each x-value is compared to the others"

Has the potential to influence the fit strongly.

2

21

xx

xx

nh

i

iii

When there is only one predictor variable:

Σ hii = 2

In general (for later):

Σ hii = p (no. parameters. in the model)

Page 18: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

18

However: the leverage is only part of the story

The direct impact of an observation depends also on (b) its residual

A point with strong leverage but small residual is not a

problem: the model adequately accounts for it.

A point with weak leverage but large residual is not a problem

either: the model is weakly affected by it.

A point with strong leverage and large residual, however,

strongly influences the model.

Removing the latter point will usually modify

the model output to some extent.

Page 19: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

19

Influential Observations

> plot(model, which=5) ## standardized residuals vs Leverage.

22

31

13

Page 20: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

20

A direct measure of influencecombining both Leverages and Residuals

Cook's distance

MSEp

yyD

n

jijj

i .

ˆˆ2

1)(

fitted values when point i is omitted

original fitted values

Mean Square Error

No. of Parameters = RSS / (n-p)

Page 21: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

21

Combining both Leverages and Residuals

Cook's distance

2

2

1. ii

iiii

h

h

MSEpD

Raw residuals Leverage values

Any Di value above 0.5 deserves a closer look and

Any Di value above 1 merits special attention.

ii

iiii h

h

p

rD

1

2

Standardised residuals

Page 22: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

22

Combining both Leverages and Residuals

Cook's distance

Any Di value above 0.5 deserves a closer look and

Any Di value above 1 merits special attention.

Page 23: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

23

Remove point #22 ?Although point #22 is influential, it does not invalidate the model.

We should probably keep it,

but note that the regression slope

may be shallower than the model

suggests.

A "reason" for the presence of these

outliers can be suggested:

some obese people were included

in the survey.

Depending on the purpose of the model,

you may want to keep or remove these outliers.

22

31

13

Page 24: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

24

Next : " 2-sample t-test "> data(energy) ## in ISwR package> attach(energy)> plot(expend ~ stature)> stripchart(expend ~ stature,

method="jitter")first, let's have a look at the classical t-test

SEDM

xxt 12

t-value Standard Error of the Difference

of Means

22

21 SEMSEMSEDM

Standard Error of the Mean

Page 25: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

25

Classical 2-sample t-test

> t.test(expend ~ stature)

Welch Two Sample t-test

data: expend by stature t = -3.8555, df = 15.919, p-val = 0.001411alternative hypothesis: true difference

in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778

unequal variance

for the difference

Assuming Unequal Variance

Page 26: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

26

Classical 2-sample t-test

> t.test(expend~stature, var.equal=T)

Two Sample t-test

data: expend by stature t = -3.9456, df = 20, p-value = 0.000799alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778

Equal Variance

slightly more significant

Assuming Equal Variance

Page 27: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

27

" 2-sample t-test " as a linear model

The t-test could "logically" translate into :

obeseleanY 0

However:

One of these β parameters is superfluous,

and actually makes the model fitting

through matrix algebra impossible.

So, instead it is usually translated into:

when X = "lean" when X = "obese"

β0

obesedefaultY

for "all" X when X = "obese"

βΔlean

βΔobese

βdefault

βΔobese

Page 28: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

28

" 2-sample t-test " as a linear model

> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Remember : linear models assume Equal Variance

doesn't look symmetric

difference between lean and obese averages

Average for lean category

factor w/ only two levels: lean & obese

Page 29: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

29

> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance

Standard Error of the Mean (SEM) for lean

Standard Error of the Difference of the Means (SEDM)

Page 30: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

30

> mod <- lm(expend~stature)> summary(mod)...Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.304 on 20 degrees of freedomMultiple R-squared: 0.4377, Adjusted R-squared: 0.4096 F-statistic: 15.57 on 1 and 20 DF, p-value: 0.000799

" 2-sample t-test " as a linear modelRemember : linear models assume Equal Variance

Same values as classical t-test (equal variance)

Page 31: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

31

> anova(mod)Analysis of Variance Table

Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature 1 26.485 26.485 15.568 0.000799 ***Residuals 20 34.026 1.701

" 2-sample t-test " as a linear model

SSmodel

F = MSmodel / MSERSS

MSmodel= SSmodel / dfmodel

MSE = RSS / rdf

RSS = 34.026 TSS = 60.511 SSmodel = 26.485

Page 32: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

32

Are the assumptions met ? 1 : Is the response variable continuous ? YES !

2 : Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling

Anderson-Darling normality test

data: mod$residuals A = 0.9638, p-value = 0.01224

> plot(mod, which=2) ## qqplot of std.residuals

Answer: NO !

Page 33: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

33

3 : Are the residuals independent

and identically distributed ?

> plot(mod, which=1) ## residuals vs fitted values.

Answer: perhaps not identically distributed

> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

lean obese

Page 34: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

34

Transforming the response variable to produce normal residuals

Can be optimised using the Box-Cox method.

> library(MASS)> boxcox(mod, plotit=T)

The method transforms the response y into gλ(y)where:

0log

01

)(

y

y

yg

optimal solution

λ ≈ - 1/2

y

yyg

22

2/1

11

)(

Page 35: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

35

Is this transformation good enough ?

Transforming the response variable to produce normal residuals

> new.y <- 2 – 2/sqrt(expend)> mod2 <- lm(new.y ~ stature)> ad.test(residuals(mod2))

Anderson-Darling normality test

data: residuals(mod2) A = 0.5629, p-value = 0.1280

> plot(mod2, which=2)> plot(mod, which=2)

## qqplot of std.residuals.

Perhaps the assumption of equal variance is not warranted ??

Page 36: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

36

Is this transformation good enough ?

Transforming the response variable to produce normal residuals

> var.test(new.y ~ stature) ## only works if stature is ## a factor with only 2 levels

F test to compare two variances

data: new.y by stature F = 1.5897, num df = 12, denom df = 8, p-value = 0.5201alternative hypothesis:

true ratio of variances is not equal to 1 95 percent confidence interval: 0.3785301 5.5826741 sample estimates:ratio of variances 1.589701

Page 37: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

37

Is this transformation good enough ?

Transforming the response variable to produce normal residuals

> summary(mod) ## untransformed yCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 ***statureobese 2.2316 0.5656 3.946 0.000799 ***

> summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.29049 0.01301 99.176 < 2e-16 ***statureobese 0.08264 0.02034 4.062 0.000608 ***

Page 38: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

38

Parametric extension of the t-test to more than 2 groupswhose sample sizes can be unequal

Next : One-Way Anova

> data(coagulation) ## in faraway package ## blood coagulation times among ## 24 animals fed one of 4 diets> attach(coagulation)> plot(coag ~ diet)> stripchart(coag ~ diet,

method="jitter")

Page 39: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

39

One-Way Anova as an F-test> res.aov <- aov(coag ~ diet) ## classical ANOVA> summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6 ---

RSS = 112 TSS = 340 SSmodel = 228

Page 40: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

40

One-Way Anova as a linear model> mod <- lm(coag ~ diet) ## as a linear model> summary(mod) ## some output left outCoefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 ***dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 ***dietD -1.071e-14 1.449e+00 -7.39e-15 1.000000

Residual standard error: 2.366 on 20 degrees of freedomMultiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05

> anova(mod) ## or summary(aov(mod))Analysis of Variance Table

Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 ***Residuals 20 112.0 5.6

AxAB xx AC xx AD xx

not very useful

Page 41: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

41

Are the assumptions met ? 1 : Is the response variable continuous ? No, discrete !

2 : Are the residuals normally distributed ?

> library(nortest) ## Package of Normality tests> ad.test(mod$residuals) ## Anderson-Darling

Anderson-Darling normality test

data: mod$residuals A = 0.301, p-value = 0.5517

> plot(mod, which=2) ## qqplot of std.residuals

Answer: Yes !

Page 42: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

42

3 : Are the residuals independent

and identically distributed ?

> plot(mod, which=1) ## residuals vs fitted values.

Answer: OK

> plot(mod, which=3) ## sqrt(abs(standardized(residuals)))

vs fitted values.

A D B C

3 obs.

> library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F)group 3 0.6492 0.5926 20

Page 43: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

43

Page 44: Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.ukom502@york.ac.uk Advanced Research Skills.

44

More Classical GraphsHistogram + Theoretical curve Boxplot Stripchart

Barplot Pie chart 3D models