Download - Linear Regression Didier Concordet [email protected] ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

Linear Regression

Didier [email protected]

ECVPT Workshop April 2011

Ecole NationaleVétérinairede Toulouse

Can be downloaded at http://www.biostat.envt.fr/

2

An example

0

50

100

150

200

250

300

350

0 20 40 60 80 100

HP

LC (Y

)

Known concentrations (x)

x Y10 38.810 60.010 49.020 85.520 82.220 72.930 96.730 102.930 114.050 156.950 176.750 171.970 212.070 223.470 228.090 283.490 274.4

3

About the straight line

Y= a + b x

Y

x

a

b>0

b<0

Y

x

a b=0

Y

x

a=0

b>0

a = intercept b = slope

4

Questions

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

5

How to obtain the best straight line ?

• write a (statistical) model

• estimate the parameters

• graphical inspection of data

Proceed in three main steps

6

Write a model

A statistical model

Mean model :functionnal relationship

Variance model :Assumptions on the residuals

7

Write a model

= residual (error term)

iY

ix

ibxa i

iii bxaY

iMean model

8

Assumptions on the residuals

• the xi 's are not random variables

they are known with a high precision

• the i 's have a constant variance

homoscedasticity

• the i 's are independent

• the i 's are normally distributed

normality

9

Homoscedasticity

0

50

100

150

200

250

300

350

0 50 100

Y

x

0

50

100

150

200

250

300

350

400

0 50 100

Y

x

homoscedasticity heteroscedasticity

10

Normality

x

Y

11

Estimate the parameters

A criterion is needed to estimate parameters

A statistical model A criterion

12

How to estimate the "best" a et b ?

Intuitive criterion : i

i minimum

compensation

Reasonnable criterion : minimum

Least squares criterion (L.S.)

i

i2

Linear model

Homoscedasticity

Normality

13

The least squares criterion

minimum

minimum

2

2

iii

ii

bxaY

iii bxaY

14

Result of optimisation

ii

iii

xx

xxYYb 2ˆ xbYa ˆˆ

a band change with samples

a band are random variables

ii xx

x

nase 2

22 1

ˆ

ii xx

bse 2

2ˆ

15

Balance sheet

True mean straight line bxaY

Estimated straight line xbaY ˆˆˆ or xxbYY ˆˆ

Mean predicted value for the ith observationii xbaY ˆˆˆ

ith residual iiiii xbaYYY ˆˆˆˆ

iii bxaY

16

Example

Dep Var: HPLC N: 18

Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000

Intercept

Slope

a

bxY 916.2046.20ˆ Estimated straight line

%37.181837.0046.20

682.3ˆ aCV ase ˆ

17

Example

0

50

100

150

200

250

300

350

0 50 100

HP

LC

(Y

)

Known concentrations (x)

18

Example

10 38.8 49.2 -10.410 60.0 49.2 10.810 49.0 49.2 -0.220 85.5 78.4 7.220 82.2 78.4 3.820 72.9 78.4 -5.530 96.7 107.5 -10.930 102.9 107.5 -4.630 114.0 107.5 6.550 156.9 165.8 -9.050 176.7 165.8 10.850 171.9 165.8 6.170 212.0 224.2 -12.270 223.4 224.2 -0.870 228.0 224.2 3.890 283.4 282.5 0.990 274.4 282.5 -8.190 294.0 282.5 11.5

iY iiYix

19

Residual variance

by construction 0ˆˆ i

iii

i YY

0ˆˆ22

iii

ii YY

2

ˆ

2

ˆˆ

22

2

n

YY

ni

iii

i

but

The residual variance is defined by

2standard error of estimate

20

Example

Dep Var: HPLC N: 18

Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991

Standard error of estimate : 8.282 Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000

21

Questions




22

Is this model the best one to use ?

Tools to check the mean model : • scatterplot residuals vs fitted values• test(s)

Tools to check the variance model :• scatterplot residuals vs fitted values• Probability plot (Pplot)

23

Checking the mean model

scatterplot residuals vs fitted values

iY

i

0

No structure in the residualsOK

iY

i

0

structure in the residualschange the mean model

24

Checking the mean model : tests

Two cases

ReplicationsTest of lack of fit

No replicationTry a polynomial model (quadratic first)

25

Without replication

Example :

try another mean model and test the improvement

iiii cxbxaY 2

If the test on c is significant (c 0) then keep this model

Dep Var: HPLC N: 18 Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991Standard error of estimate: 8.539 Effect Coefficient Std Error t P(2 Tail) CONSTANT 21.284 6.649 3.201 0.006CONCENT 2.842 0.335 8.486 0.000CONCENT*CONCENT 0.001 0.003 0.227 0.824

iiii cxbxaY 2

26

With replications

Perform a test of lack of fit

Principle : compare to

if > then change the model-

Yxba ˆˆ

x

Departure from linearity

Pure error

27

Test of lack of fit : how to do it ?

Three steps

1) Linear regression

2 2

ˆ2 2

ndf

nSS

RES

RES

2) One way ANOVA

errorSS2ˆanova

3) errorRESLOF

errorRESLOF

dfdfdf

SSSSSS

errorLOFerror

error

LOF

LOF fSS

df

df

SSF ,if

then change the model

28

Test of lack of fit : example

Three steps

1) Linear regression

22 282.8ˆ

16218

1097.5282.8218 2

RES

RES

df

SS

2) One way ANOVA

3) 1216

427.10055.1097

LOF

LOF

df

SS26.32747.0 05.0

12,4 fFif

We keep the straight line

Dep Var: HPLC N: 18 Analysis of VarianceSource Sum-of-Squares df Mean-Square F-ratio PCONCENT 121251.776 5 24250.355 289.434 0.000Error 1005.427 12 83.786

29

Checking the variance model : homoscedasticity

scatterplot residuals vs fitted values

iY

i

0

homoscedasticityOK

No structure in the residualsbut heteroscedasticitychange the model (criterion)

iY

i

0

30

What to do with heteroscedasticity ?

scatterplot residuals vs fitted values : modelize the dispersion.

iY

i

0

The standard deviation of the residuals increaseswith : it increases with xY

31

What to do with heteroscedasticity ?

Estimate again the slope and the intercept but withweights proportionnal to the variance.

and check that the weight residuals (as defined above) are homoscedastic

minimum 22

iiii

ii bxaYw

2

1

i

ix

w with

32

Checking the variance model : normality

i0

No curvature :Normality

Curvature : non normalityis it so important ?

i0

Exp

ecte

d va

lue

for

norm

al d

istr

ibut

ion

Exp

ecte

d va

lue

for

norm

al d

istr

ibut

ion

33

What to do with non normality ?

Try to modelize the distribution of residuals

In general, it is difficult with few observations

If enough observations are available,the non normality does not affect too much

the result.

34

An interesting indice R²

R² = square correlation coefficient

= % of dispersion of the Yi's explained by the straight line (the model)

0 R² 1

If R² = 1, all the i = 0, the straight line explain all the variation of the Yi's

If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's

35

0

20

40

60

80

100

120

140

160

180

200

0 5 10 15 20 25 30

Y

x

An interesting indice R²

R² and R (correlation coefficient) are not designed to measure linearity !

Example :Multiple R: 0.990Squared multiple R: 0.980Adjusted squared multiple R: 0.980

36

Questions




37

How to use this straight line ?

• Direct use : for a given x– predict the mean Y– construct a confidence interval of the mean Y– construct a prediction interval of Y

• Reverse use calibration (approximate results): for a given Y

– predict the mean x– construct a confidence interval of the mean x– construct a prediction interval of X

38

For a given x predict the mean Y

x

xba ˆˆ

53.107ˆˆ

30

916.2ˆ046.20ˆ

xba

x

b

a

Example :

39

Confidence interval of the mean Y

ii

n

ii

n

xx

xx

ntxba

bxa

xx

xx

ntxba

2

222/1

2

2

222/1

2

1ˆˆˆ

1ˆˆˆ

There is a probability 1- that a+bx belongs to this interval

40

Confidence interval of the mean Y

0

50

100

150

200

250

300

350

0 20 40 60 80 100

L

U

30

41

Example

2.112

8.102

14250

45

282.8ˆ

12.205.0

18

53.107ˆˆ

30

916.2ˆ046.20ˆ

2

22

2/12

U

L

xx

x

tn

xba

x

b

a

ii

n

42

Prediction interval of Y

ii

n

ii

n

xx

xx

ntxba

Y

xx

xx

ntxba

2

222/1

2

2

222/1

2

11ˆˆˆ

11ˆˆˆ

100(1-of the measurements carried-out for this x belongs to this interval

43

Prediction interval of Y

0

50

100

150

200

250

300

350

0 20 40 60 80 100

L

U

30

44

Example

7.125

4.89

14250

45

282.8ˆ

12.205.0

18

53.107ˆˆ

30

916.2ˆ046.20ˆ

2

22

2/12

U

L

xx

x

tn

xba

x

b

a

ii

n

45

Reverse use : for a given Y=y0 predict the mean X

X

0y

30

53.107

916.2ˆ046.20ˆ

0

X

y

b

a

Example :

46

For a given Y=y0 a confidence interval of the mean X

0

50

100

150

200

250

300

350

0 20 40 60 80 100

Y0

X

L U

47

Confidence interval of the mean X

ii

n

ii

n

xx

xL

ntLbay

xx

xU

ntUbay

2

222/1

20

2

222/1

20

1ˆˆˆ

1ˆˆˆ

There is a probability 1- that the mean X belongs to [ L , U ]

L and U are so that

48

Example

33.31

59.28

14250

45

282.8ˆ

12.205.0

1853.107

916.2ˆ046.20ˆ

2

22

2/12

0

U

L

xx

x

tn

y

b

a

ii

n

49

What you should no longer believe

One can fit the straight line by inverting x and Y

If the correlation coefficient is high, the straight line is the best model

Normality of the i's is essential to perform a good regression

Normality of the xi's is required to perform a regression