Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis...

23
1 Introduction to Linear regression analysis Part 2 Model comparisons

Transcript of Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis...

Page 1: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

1

Introduction to Linear regression analysis

Part 2

Model comparisons

Page 2: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

2

ANOVA for regression

Total variation in Y

SSTotal

=

Variation explained by regression with X

SSRegression

+

Residual variation

SSResidual

Full model

yi = 0 + 1xi + i

• Unexplained variation in Y from full model = SSResidual

( )y yi i 2

Page 3: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

3

Reduced model (H0 true)

• Reduced model (H0: 1 = 0 true):

yi = 0 + i

• Unexplained variation in Y from reduced model = SSTotal

(Mean and error)

( )y yi 2

Model comparison

• Difference in unexplained variation between full and reduced models:

SSTotal - SSResidual

= SSRegression

• Variation explained by including 1 in model

( )y yi 2

Page 4: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

4

Explained variation

• Proportion of variation in Y explained by linear relationship with X

• Termed r2, coefficient of determination:

SS Regression

SS Total

• r2 is simply square of correlation coefficient (r) between X and Y.

0 5000 10000 15000X

0

100

200

300

400

500

Y1

0 5000 10000 15000X

0

100

200

300

400

500

Y2

Which is the better model??

Page 5: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

5

Which is the better model??

0 5000 10000 15000X

0

100

200

300

400

500

Y1

Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085

Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450

X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009

Residual 1.81383E+05 24 7557.643448

Which is the better model??

0 5000 10000 15000X

0

100

200

300

400

500

Y2

Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781

Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617

Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)

CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482

X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864

Residual 3388.617362 3 1129.539121

Page 6: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

6

0 5000 10000 15000X

0

100

200

300

400

500Y

1

0 5000 10000 15000X

0

100

200

300

400

500

Y2

Which is the better model??

n = 5P = 0.00386r2 = 0.942

n = 26P = 0.000009r 2= 0.551

Which is the better model??

0 5000 10000 15000X

0

100

200

300

400

500

Y1

0 5000 10000 15000X

0

100

200

300

400

500

Y2

95% Confidence bands (for slope)

n = 5P = 0.00386r2 = 0.942

n = 26P = 0.000009r 2= 0.551

Page 7: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

7

Assumptions

Normality

Y normally distributed at each value of X:

– Boxplot of y should be symmetrical - watch out for outliers and skewness

– Transformations often help

Page 8: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

8

Homogeneity of variance

Variance (spread) of Y should be constant for each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with only one value of y per x).

x1 x2 X

Y

Y1

Y2

y iix 0 1

Page 9: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

9

Homogeneity of variance

Variance (spread) of Y should be constant for each value of xi (homogeneity of variance):

– Very difficult to assess usually (for models with only one value of y per x).

– Spread of residuals should be even when plotted against xi or predicted yi’s

– Transformations often help

– Transformations that improve normality of Y will also usually make variance of Y more constant

0 5000 10000 15000X

-2

-1

0

1

2

3

Sta

ndar

dize

d R

esid

ual

0 100 200 300 400ESTIMATE

-200

-100

0

100

200

300

RE

SID

UA

L

0 5000 10000 15000X

0

100

200

300

400

500

Y1

Tests of Homogeneity of variance

Page 10: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

10

Independence

Values of yi are independent of each other:

– watch out for data which are a time series on same experimental or sampling units

– should be considered at design stage

Linearity

True population relationship between Yand X is linear:

– scatterplot of Y against X

– watch out for asymptotic or exponential patterns

– transformations of Y or Y and X often help

– Always look at residuals

Page 11: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

11

EDA and regression diagnostics

• Check assumptions

• Check fit of model

• Warn about influential observations and outliers

EDA

• Boxplots of Y (and X):– check for normality, outliers etc.

• Scatterplot of Y and X:– check for linearity, homogeneity of

variance, outliers etc.

Page 12: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

12

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

Anscombe (1973) data set

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15

0

2

4

6

8

10

12

14

0 5 10 15 20

R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002

Page 13: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

13

Limited or weighted data Smoothers (for data exploration – especially useful for

model fitting)• Nonparametric description of

relationship between Y and X– unconstrained by specific model structure

• Useful exploratory technique:– is linear model appropriate?

– are particular observations influential?

Limited or weighted data Smoothers

• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model

through surrounding observations

• Surrounding observations in window (or band)– covers range along X-axis

– size of window (number of observations) determined by smoothing parameter

Page 14: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

14

Limited or weighted data Smoothers

• Adjacent windows overlap– resulting line is smooth

– smoothness controlled by smoothing parameter (width of window)

• Any section of line robust to extreme values in other windows

Types of limited or weighted data smoothers (examples)

• Running (moving) means or averages:– means or medians within each window

• Lo(w)ess:– locally weighted regression scatterplot

smoothing

– observations within a window weighted differently

– observations replaced by predicted values from local regression line

Page 15: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

15

( )y yi i

Residuals – very useful for examining regression assumptions

• Difference between observed value and value predicted or fitted by the model

• Residual for each observation:– difference between observed y and value

of y predicted by linear regression equation:

Studentised residuals

• residual / SE residuals• follow a t-distribution• studentised residuals can be compared

between different regressions

Observations with large residual (or studentised residual) are outliers from fitted model.

Page 16: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

16

0-se

+se

Predicted yi

Res

idua

l

•No pattern in residuals• Indicates assumptions OK

x

y •Even spread of Yaround line

Plot residuals against predicted yi

• Increasing spread of residuals, ie. wedge-shape

• Unequal variance in Y• Skewed distribution of Y• Transformation of Y helps

• Uneven spread of Yaround line

0-se

+se

Predicted yi

Res

idua

l

x

y

Page 17: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

17

Violations of assumptions may not always lead to non-significant result

• Relationship between birth rate and Per capita income

0 5000 10000 15000 20000Per Capita Income (thousands)

0.1

0.2

0.3

0.4

0.5

0.6

Birt

hrat

e pe

r fe

mal

e (p

er y

ear)

Violations of assumptions may not always lead to non-significant result

• Relationship between birth rate and Per capita income

0 5000 10000 15000 20000Per Capita Income (thousands)

0.1

0.2

0.3

0.4

0.5

0.6

Birt

h rat

e pe

r fe

mal

e (p

er y

ear)

Dep Var: BIRTHRATE N: 57 Multiple R: 0.759086 Squared multiple R: 0.576212

Adjusted squared multiple R: 0.568506 Standard error of estimate: 0.088099

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 0.580417 1 0.580417 74.781752 0.000000

Residual 0.426881 55 0.007761

Page 18: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

18

Violations of assumptions may not always lead to non-significant result

• Perhaps consider transformation

0 5000 10000 15000 20000Per Capita Income (thousands)

0.1

0.2

0.3

0.4

0.5

0.6

Bir

thra

te p

er

fem

ale

(p

er

yea

r)

0.0 0.1 0.2 0.3 0.4ESTIMATE

-0.2

-0.1

0.0

0.1

0.2

RE

SID

UA

L

Violations of assumptions may not always lead to non-significant result

• Log transform of Per Capita Income

0.0 0.1 0.2 0.3 0.4 0.5 0.6ESTIMATE

-0.1

0.0

0.1

0.2

RE

SID

UA

L

2 3 4 5Per Capita Income (log (thousands))

0.1

0.2

0.3

0.4

0.5

0.6

Birt

hrat

e pe

r fe

mal

e (p

er y

ear)

Page 19: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

19

Comparison of models

0 5000 10000 15000 20000Per Capita Income (thousands)

0.1

0.2

0.3

0.4

0.5

0.6

Bir

thra

te p

er

fem

ale

(per

ye

ar)

0 5000 10000 15000 20000Per Capita Income (thousands)

0.1

0.2

0.3

0.4

0.5

0.6

Bir

thra

te p

er

fem

ale

(per

ye

ar)

Dep Var: BIRTHRATE N: 57 Multiple R: 0.759086 Squared multiple R: 0.576212

Adjusted squared multiple R: 0.568506 Standard error of estimate: 0.088099

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 0.580417 1 0.580417 74.781752 0.000000

Residual 0.426881 55 0.007761

Dep Var: BIRTHRATE N: 57 Multiple R: 0.894713 Squared multiple R: 0.800512

Adjusted squared multiple R: 0.796885 Standard error of estimate: 0.060444

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 0.806354 1 0.806354 220.705235 0.000000

Residual 0.200944 55 0.003654

2 3 4 5Per Capita Income (log (thousands))

0.1

0.2

0.3

0.4

0.5

0.6

Birt

hra

te p

er

fem

ale

(p

er

yea

r)

Other indicators

• Outliers

• Leverage

• Influence

Page 20: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

20

Outliers

• Observations further from fitted model than remaining observations– might be different from

sample outliers in boxplots

• Large residual

outlier

Use robust estimator.syz

Leverage

• How extreme observation is for X-variable

• Measures how much each xi

influences predicted yi Large leverage

Page 21: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

21

Influence

• Cook’s D statistic:– incorporates leverage & residual– observations with large influence on

estimated slope– observations with D near or greater than 1

should be checked

• Observation 1 is X and Y outlier but not influential

1Y

X

2

• Observation 2 has large residual - outlier

3

• Observation 3 is very influential (large Cook’s D) - also outlier

Page 22: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

22

Transformations

• If Y (and error terms) is skewed:– log or power transformation of Y– improves homogeneity of variance– can reduce influence of outliers

• If nonlinear relationship:– linearize by transformation of Y and/or X

• Transformed variables must make biological sense

Robust regression

• Help with outliers in Y – Least Absolute Deviation (LAD) regression

• Minimizes sum of residuals (instead of squares)

– M (maximum likelihood) regression• Many types – often based on iteratively reweighted least squares, not useful

for issues with leverage. Converges on OLS when assumptions are met

• Help with outliers in Y and X– Least Median of Squares (LMS) regression

• Minimizes median of squares of the residuals

– Least Trimmed Squares (LTS) regression• Uses a trimmed set of observations and operates on sums of residuals

– Rank regression• Uses ranks of residuals rather than values

Page 23: Introduction to Linear Regression 2014 Part 2 Handouts...Introduction to Linear regression analysis Part 2 Model comparisons. 2 ANOVA for regression Total variation in Y SSTotal =

23

Regression through origin

• Fit model:yi = 1xi + i

• Problems:– extrapolation below smallest xi

– ANOVA partition of SS no longer additive– r2 difficult to interpret

• Always test H0 that 0 = 0 first• Recommendation - Only do this if there is a

compelling reason to do so (usually there is not)