Linear Regression
Didier [email protected]
ECVPT Workshop April 2011
Ecole NationaleVétérinairede Toulouse
Can be downloaded at http://www.biostat.envt.fr/
2
An example
0
50
100
150
200
250
300
350
0 20 40 60 80 100
HP
LC (Y
)
Known concentrations (x)
x Y10 38.810 60.010 49.020 85.520 82.220 72.930 96.730 102.930 114.050 156.950 176.750 171.970 212.070 223.470 228.090 283.490 274.4
3
About the straight line
Y= a + b x
Y
x
a
b>0
b<0
Y
x
a b=0
Y
x
a=0
b>0
a = intercept b = slope
4
Questions
• How to obtain the best straight line ?
• Is this straight line the best curve to use ?
• How to use this straight line ?
5
How to obtain the best straight line ?
• write a (statistical) model
• estimate the parameters
• graphical inspection of data
Proceed in three main steps
6
Write a model
A statistical model
Mean model :functionnal relationship
Variance model :Assumptions on the residuals
7
Write a model
= residual (error term)
iY
ix
ibxa i
iii bxaY
iMean model
8
Assumptions on the residuals
• the xi 's are not random variables
they are known with a high precision
• the i 's have a constant variance
homoscedasticity
• the i 's are independent
• the i 's are normally distributed
normality
9
Homoscedasticity
0
50
100
150
200
250
300
350
0 50 100
Y
x
0
50
100
150
200
250
300
350
400
0 50 100
Y
x
homoscedasticity heteroscedasticity
10
Normality
x
Y
11
Estimate the parameters
A criterion is needed to estimate parameters
A statistical model A criterion
12
How to estimate the "best" a et b ?
Intuitive criterion : i
i minimum
compensation
Reasonnable criterion : minimum
Least squares criterion (L.S.)
i
i2
Linear model
Homoscedasticity
Normality
13
The least squares criterion
minimum
minimum
2
2
iii
ii
bxaY
iii bxaY
14
Result of optimisation
ii
iii
xx
xxYYb 2ˆ xbYa ˆˆ
a band change with samples
a band are random variables
ii xx
x
nase 2
22 1
ˆ
ii xx
bse 2
2ˆ
15
Balance sheet
True mean straight line bxaY
Estimated straight line xbaY ˆˆˆ or xxbYY ˆˆ
Mean predicted value for the ith observationii xbaY ˆˆˆ
ith residual iiiii xbaYYY ˆˆˆˆ
iii bxaY
16
Example
Dep Var: HPLC N: 18
Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000
Intercept
Slope
a
bxY 916.2046.20ˆ Estimated straight line
%37.181837.0046.20
682.3ˆ aCV ase ˆ
17
Example
0
50
100
150
200
250
300
350
0 50 100
HP
LC
(Y
)
Known concentrations (x)
18
Example
10 38.8 49.2 -10.410 60.0 49.2 10.810 49.0 49.2 -0.220 85.5 78.4 7.220 82.2 78.4 3.820 72.9 78.4 -5.530 96.7 107.5 -10.930 102.9 107.5 -4.630 114.0 107.5 6.550 156.9 165.8 -9.050 176.7 165.8 10.850 171.9 165.8 6.170 212.0 224.2 -12.270 223.4 224.2 -0.870 228.0 224.2 3.890 283.4 282.5 0.990 274.4 282.5 -8.190 294.0 282.5 11.5
iY iiYix
19
Residual variance
by construction 0ˆˆ i
iii
i YY
0ˆˆ22
iii
ii YY
2
ˆ
2
ˆˆ
22
2
n
YY
ni
iii
i
but
The residual variance is defined by
2standard error of estimate
20
Example
Dep Var: HPLC N: 18
Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991
Standard error of estimate : 8.282 Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000
21
Questions
• How to obtain the best straight line ?
• Is this straight line the best curve to use ?
• How to use this straight line ?
22
Is this model the best one to use ?
Tools to check the mean model : • scatterplot residuals vs fitted values• test(s)
Tools to check the variance model :• scatterplot residuals vs fitted values• Probability plot (Pplot)
23
Checking the mean model
scatterplot residuals vs fitted values
iY
i
0
No structure in the residualsOK
iY
i
0
structure in the residualschange the mean model
24
Checking the mean model : tests
Two cases
ReplicationsTest of lack of fit
No replicationTry a polynomial model (quadratic first)
25
Without replication
Example :
try another mean model and test the improvement
iiii cxbxaY 2
If the test on c is significant (c 0) then keep this model
Dep Var: HPLC N: 18 Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991Standard error of estimate: 8.539 Effect Coefficient Std Error t P(2 Tail) CONSTANT 21.284 6.649 3.201 0.006CONCENT 2.842 0.335 8.486 0.000CONCENT*CONCENT 0.001 0.003 0.227 0.824
iiii cxbxaY 2
26
With replications
Perform a test of lack of fit
Principle : compare to
if > then change the model-
Yxba ˆˆ
x
Departure from linearity
Pure error
27
Test of lack of fit : how to do it ?
Three steps
1) Linear regression
2 2
ˆ2 2
ndf
nSS
RES
RES
2) One way ANOVA
errorSS2ˆanova
3) errorRESLOF
errorRESLOF
dfdfdf
SSSSSS
errorLOFerror
error
LOF
LOF fSS
df
df
SSF ,if
then change the model
28
Test of lack of fit : example
Three steps
1) Linear regression
22 282.8ˆ
16218
1097.5282.8218 2
RES
RES
df
SS
2) One way ANOVA
3) 1216
427.10055.1097
LOF
LOF
df
SS26.32747.0 05.0
12,4 fFif
We keep the straight line
Dep Var: HPLC N: 18 Analysis of VarianceSource Sum-of-Squares df Mean-Square F-ratio PCONCENT 121251.776 5 24250.355 289.434 0.000Error 1005.427 12 83.786
29
Checking the variance model : homoscedasticity
scatterplot residuals vs fitted values
iY
i
0
homoscedasticityOK
No structure in the residualsbut heteroscedasticitychange the model (criterion)
iY
i
0
30
What to do with heteroscedasticity ?
scatterplot residuals vs fitted values : modelize the dispersion.
iY
i
0
The standard deviation of the residuals increaseswith : it increases with xY
31
What to do with heteroscedasticity ?
Estimate again the slope and the intercept but withweights proportionnal to the variance.
and check that the weight residuals (as defined above) are homoscedastic
minimum 22
iiii
ii bxaYw
2
1
i
ix
w with
32
Checking the variance model : normality
i0
No curvature :Normality
Curvature : non normalityis it so important ?
i0
Exp
ecte
d va
lue
for
norm
al d
istr
ibut
ion
Exp
ecte
d va
lue
for
norm
al d
istr
ibut
ion
33
What to do with non normality ?
Try to modelize the distribution of residuals
In general, it is difficult with few observations
If enough observations are available,the non normality does not affect too much
the result.
34
An interesting indice R²
R² = square correlation coefficient
= % of dispersion of the Yi's explained by the straight line (the model)
0 R² 1
If R² = 1, all the i = 0, the straight line explain all the variation of the Yi's
If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's
35
0
20
40
60
80
100
120
140
160
180
200
0 5 10 15 20 25 30
Y
x
An interesting indice R²
R² and R (correlation coefficient) are not designed to measure linearity !
Example :Multiple R: 0.990Squared multiple R: 0.980Adjusted squared multiple R: 0.980
36
Questions
• How to obtain the best straight line ?
• Is this straight line the best curve to use ?
• How to use this straight line ?
37
How to use this straight line ?
• Direct use : for a given x– predict the mean Y– construct a confidence interval of the mean Y– construct a prediction interval of Y
• Reverse use calibration (approximate results): for a given Y
– predict the mean x– construct a confidence interval of the mean x– construct a prediction interval of X
38
For a given x predict the mean Y
x
xba ˆˆ
53.107ˆˆ
30
916.2ˆ046.20ˆ
xba
x
b
a
Example :
39
Confidence interval of the mean Y
ii
n
ii
n
xx
xx
ntxba
bxa
xx
xx
ntxba
2
222/1
2
2
222/1
2
1ˆˆˆ
1ˆˆˆ
There is a probability 1- that a+bx belongs to this interval
40
Confidence interval of the mean Y
0
50
100
150
200
250
300
350
0 20 40 60 80 100
L
U
30
41
Example
2.112
8.102
14250
45
282.8ˆ
12.205.0
18
53.107ˆˆ
30
916.2ˆ046.20ˆ
2
22
2/12
U
L
xx
x
tn
xba
x
b
a
ii
n
42
Prediction interval of Y
ii
n
ii
n
xx
xx
ntxba
Y
xx
xx
ntxba
2
222/1
2
2
222/1
2
11ˆˆˆ
11ˆˆˆ
100(1-of the measurements carried-out for this x belongs to this interval
43
Prediction interval of Y
0
50
100
150
200
250
300
350
0 20 40 60 80 100
L
U
30
44
Example
7.125
4.89
14250
45
282.8ˆ
12.205.0
18
53.107ˆˆ
30
916.2ˆ046.20ˆ
2
22
2/12
U
L
xx
x
tn
xba
x
b
a
ii
n
45
Reverse use : for a given Y=y0 predict the mean X
X
0y
30
53.107
916.2ˆ046.20ˆ
0
X
y
b
a
Example :
46
For a given Y=y0 a confidence interval of the mean X
0
50
100
150
200
250
300
350
0 20 40 60 80 100
Y0
X
L U
47
Confidence interval of the mean X
ii
n
ii
n
xx
xL
ntLbay
xx
xU
ntUbay
2
222/1
20
2
222/1
20
1ˆˆˆ
1ˆˆˆ
There is a probability 1- that the mean X belongs to [ L , U ]
L and U are so that
48
Example
33.31
59.28
14250
45
282.8ˆ
12.205.0
1853.107
916.2ˆ046.20ˆ
2
22
2/12
0
U
L
xx
x
tn
y
b
a
ii
n
49
What you should no longer believe
One can fit the straight line by inverting x and Y
If the correlation coefficient is high, the straight line is the best model
Normality of the i's is essential to perform a good regression
Normality of the xi's is required to perform a regression
Top Related