Simple Linear Regression
: Model
eβXαY
where
variable.dependentor response:Y
.y variableexplanatoror regressor predictor, t,independen:X
ts)coefficien regressionor (parameter
slope:β and intercept :α
noise.or error :e
Data available: (X,Y)
squaresofsumresidualtheimizetoFindCriterion min,:
0,
0,
Q
Q
equationnormalthesolveBy
Goal: To predict the response Y. (i.e. to obtain the fitted response function f(X))
Least Squares Fitting Method
How to determine this regression function? (need to estimate the parameters.)
n
iii XYQ
1
2,
Least Squares Regression Function:
2ˆ
,ˆ
XX
YYXX
XY
i
ii
XY ˆˆˆ
Least Squares Estimates
0,
0,
Q
Q
Set
XY ˆ
n
iii XYQSolve
1
12,:
n
iii XY
1
02
XY
Xn
Yn
XY
n
ii
n
ii
n
ii
n
ii
11
11
11
XY
2ˆ
XX
YYXX
i
ii
022,:11
i
n
iii
n
iiii XXYXXYQSolve
n
ii
n
ii
n
iii
n
ii
n
ii
n
iii
n
iii
n
iii
XXXYYX
XXYX
XXYX
1
2
11
1
2
11
11
n
ii
n
iii
n
ii
n
ii
n
ii
n
ii
XXnYXnYX
XXnYXn
XXXXY
1
22
1
1
22
1
2
11
222XX
YYXX
XnX
YXnYX
i
ii
i
ii
2ˆ
,ˆ
XX
YYXX
XY
i
ii
How do we know the two estimators can minimize Q?
Terminology
Fitted model
eXY ˆˆˆ
True model
eXY
Fitted regression function
XY ˆˆˆ
xY
xf
S
SXY
xx
xy
Model
)(Let
ˆ,ˆˆ
It can be shown thatIt can be shown that
2
22
,~ˆ XSn
Nxx
xxSN
2
,~ˆ
2
22
),(~)(ˆ XXSn
xfNxfxx
xxSN
2
,~ˆ
2..
2..
,~),,0(~
,mod:
XNYhavewethenNwhere
xYeltheSinceSolvediidii
)(1
)ˆ( xyxxxx
xy SESS
SEE
n
iii
xx
YEXXS 1
1
n
iii
xx
XXXS 1
1
n
iii
xx
XXXXS 1
xxSN
2
,~ˆ
xyxxxx
xy SVarSS
SVarVarSolve
21
)ˆ(:
n
iii
xx
YXXVarS 1
21
n
i jijiiiii
xx
YYCovXXXXYVarXXS 1
22
,1
n
i jijiiii
xx
YYCovXXXXXXS 1
222
,1
xxS
2
2
22
),(~)(ˆ XXSn
xfNxfxx
)(ˆˆ
ˆˆˆˆˆˆ:
XfXXXYEXXYEXfE
XXYXXYXXfSolve
n
iii
xx
YXXS
YCovYCov1
1,)ˆ,(
n
iii
xx
YXXYCovS 1
,1
),(1
1ii
n
ixx
YXXYCovS
n
iii
xx
YYCovXXS 1
,1
n
i
n
jiji
xx
YYCovn
XXS 1 1
,11
n
iiii
xx
YYCovn
XXS 1
,11
n
ii
xx
XXSn 1
2
0
0)ˆ,(
ˆˆˆˆ:
YCovand
XXYXXfSolve
2
22
),(~)(ˆ XXSn
xfNxfxx
XXVarYVar
XXYVarXfVar
ˆ
ˆˆ
xxS
XXn
22
2
REGRESSION ON MIDTERM GRADE Obs MIDTERM FINAL 1 68 75 2 49 63 3 60 57 4 68 88 5 97 88 6 82 79 7 59 82 8 50 73 9 73 90 10 39 62 11 71 70 12 95 96 13 61 76 14 72 75 15 87 85 16 40 40 17 66 74 18 58 70 19 58 75 20 77 72
Figure 1.4 SAS PROC PRINT output for the grade data problem.
TITLE ‘REGRESSION ON MIDTERM GRADE’;
DATA;
INPUT MIDTERM FINAL;
CARDS;
68 75
49 63
60 57
. .
77 72
;
PROC PLOT;
PLOT FINAL*MIDTERM=’O’ PRED*MIDTERM=’P’ / OVERLAY;
LABEL FINAL=’FINAL’;
PROC PRINT;
PROC REG;
MODEL FINAL=MIDTERM / P;
OUTPUT PREDICTED=PRED
RESIDUAL=RESID;
PROC RANK NORMAL=VW;
VAR RESID;
RANKS NSCORE;
PROC PLOT;
PLOT RESID*NSCORE=’R’;
LABEL NSCORE=’NORMAL SCORE’;
RUN;
REGRESSION ON MIDTERM GRADE Model: MODEL1 Dependent Variable: FINAL Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F
Model 1 1774.44117 1774.44117 24.26 0.0001 Error 18 1316.55883 73.14216 Corrected Total 19 3091.00000 Root MSE 8.55232 R-Square 0.5741 Dependent Mean 74.50000 Adj R-Sq 0.5504 Coeff Var 11.47962 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 34.56757 8.32984 4.15 0.0006 MIDTERM 1 0.60049 0.12192 4.93 0.0001
Dep Var Predicted Obs FINAL Value Residual 1 75.0000 75.4007 -0.4007 2 63.0000 63.9915 -0.9915 3 57.0000 70.5968 -13.5968 4 88.0000 75.4007 12.5993 5 88.0000 92.8149 -4.8149 6 79.0000 83.8076 -4.8076 7 82.0000 69.9963 12.0037 8 73.0000 64.5920 8.4080 9 90.0000 78.4032 11.5968 10 62.0000 57.9866 4.0134 11 70.0000 77.2022 -7.2022 12 96.0000 91.6139 4.3861 13 76.0000 71.1973 4.8027 14 75.0000 77.8027 -2.8027 15 85.0000 86.8100 -1.8100 16 40.0000 58.5871 -18.5871 17 74.0000 74.1998 -0.1998 18 70.0000 69.3959 0.6041 19 75.0000 69.3959 5.6041 20 72.0000 80.8051 -8.8051 Sum of Residuals 0 Sum of Squared Residuals 1316.55883 Predicted Residual SS (PRESS) 1668.47241
|
100 +
| o
|
| o p p
| o o
| o
| o p
80 + p o
F | o p pp
I | o o o o o
N | o pp o o
A | p p
L | p
| o o
60 + p
| p o
|
|
|
|
|
40 + o
|
-+------------+------------+------------+------------+------------+------------+------------+
30 40 50 60 70 80 90 100
NOTE: 6 obs hidden.
MIDTERM
Figure 1.6 Output for the first PROC PLOT step for the grade data problem.
20 +
|
|
|
| R
| R R
10 +
| R
R |
e | R R R
s | R
i |
d 0 +---------------------------------R---------R--R---------------------------------------------
u | R R
a | R
l | R R
| R
| R
-10 +
|
| R
|
|
| R
-20 +
|
--+----------+----------+----------+----------+----------+----------+----------+----------+--
55 60 65 70 75 80 85 90 95
Predicted Value of FINAL
Figure 1.7 The remainder of the output from the first PROC PLOT step.
20 +
|
|
|
| R
| R R
10 +
| R
R |
e | R R R
s | R
i |
d 0 + R R R
u | R R
a | R
l | R R
| R
| R
-10 +
|
| R
|
|
| R
-20 +
|
--+----------+----------+----------+----------+----------+----------+----------+----------+--
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
NORMAL SCORE
).(~)ˆ(ˆ)ˆ(ˆ
,
varˆˆ
.
dftdtS
ETthen
iablenormalaisandIf
Theoremestimate
* Confidence Interval
0
00
::
:aH
Hexample
)ˆ(ˆˆ
..%95
025.0 dtSt
ICConstruct
2~ˆˆ
ˆ222
nt
S
X
n
t
XX
The range lies between –1 and 1.
22
,
:
EYYEEXXE
YEYXEXEYXCov
Definition
YXxy
x
yxy S
Sr ̂
* Pearson’s Correlation Coefficient
* Goal: The degree of linear correlation between two variables.
n
i
n
iii
n
iii
xy
YYn
XXn
YYXXn
r
1 1
22
1
1
1
1
1
1
1
n
iiy
n
iiy
y
y
YYn
SYYn
Swhere
RS
SR
1
22ˆ
1
22
22
2ˆ2
ˆˆ1
1,
1
1
)10(
* Coefficient of Determination: the fraction of the variance in y that is explained by regression on x.
Definition:
Goal: may be used as an index of linearity for the relation of y to x.
2ˆ
2ˆ
2
22
eyy
xy
SSS
rR
120 +
| o
|
| o
|
100 + o
|
| o
| o
P | o
R 80 +
E | o o
S | o
S | o
U | o
R 60 + o o
| o
| o
| o
| o
40 + o o
| o o
| o o o
| o
|
20 +
|
---+---------+---------+---------+---------+---------+---------+---------+---------+--
10 15 20 25 30 35 40 45 50
VOLUME
Figure 3.3: A plot of the air pressure data (an example of residual analysis).
|
30 +
|
|
|
| *
|
20 +
|
R |
e | *
s |*
i |
d 10 + * *
u |
a | *
l | * *
|
| * *
0 +------------------------------------------------------------------------------*-------------
| *
| * *
| * *
| * *
| * * * *
-10 + * * *
|
-+---------+---------+---------+---------+---------+---------+---------+---------+---------+-
16.357 25.007 33.658 42.308 50.959 59.609 68.259 76.910 85.560 94.210
Predicted Value of P
Figure 3.4 The residual on fit plot after fitting the model P= a + b V + e to the air pressure data.
0.50 +
| *
|
| *
|
0.25 +
|
| * * * * * *
| * * *
R | * * *
e 0.00 +-----------------------*--------------------------*------------------------
s | *
i | * * *
d |
u | *
a -0.25 + *
l | *
| *
|
|
-0.50 +
|
|
| *
|
-0.75 +
---+-------------+-------------+-------------+-------------+-------------+--
20 40 60 80 100 120
Predicted Value of P
Figure 3.5 The residual on the fit plot using the model P = a + b/V +e for the air pressure data.
Weighted RegressionProblem : (unequal variance)
Model :Claim : minimize
2iiYVar
eXY
)..(,,1
2 knownieWgivenXYWQ i
n
iiii
Ordinary RegressionModel :Claim : minimize
eXY
n
iii XYQ
1
2,
n
iiii XYWQSolve
1
2,:
How to determine the weights?
So the optimal weights are inversely proportional to the variances of the y.
2~~ iiii YVarYWYLet
2~ iiiii YVarWYWVarYVar
Then
ii YraV
Wˆ
ˆ 2
n
iiii
n
iiiiii
XYW
XWWYW
1
2
1
2
DATA; INPUT V P;VI=1/V;CARDS;48 29.1...12 117.6;
PROC REG;MODEL P=VI;WIGHT W;OUTPUT P=FIT R=RES; DATA;SET;WRES=SQRT(W)*RES;
PROC REG;MODEL P=VI;OUTPUT P=LSFIT;
DATA;SET;W=1/LSFIT;
PROC RANK NORMAL=VW;VAR WRES;RANKS NSCORE; PROC PLOT;PLOT WRES*FIT=’*’ / VREF=0 VPOS=30;POLT WRES*NSCORE=’*’ /VPOS=30;LABEL WRES=’WEIGHTED RESIDUAL’
NSCORE=’NORMAL SCORE’; RUN;
|
0.050 +
|
| *
W | *
E |
I 0.025 + * * *
G | * * *
H | *
T | * * *
E | * *
D 0.000 +-----------------------*---------------------------------------------------
| *
R | * * *
E | *
S |
I -0.025 + *
D | *
U |
A | *
L |
-0.050 + *
|
|
|
| *
-0.075 +
|
---+-------------+-------------+-------------+-------------+-------------+--
20 40 60 80 100 120
Predicted Value of P
Figure 3.13 Weighted residual plot for a weighted fit of the model P = a + b/V + e to the air pressure data .
0.0002 +
|
|
| *
| *
0.0001 + * *
| *
|
R | * * *
e | * *
s 0 +------*--------*-------------------------------*---------------*--------------------*
i | * * * *
d | *
u | * *
a | *
l -0.0001 + *
|
|
|
|
-0.0002 +
|
|
| *
|
-0.0003 +
|
---+---------------+---------------+---------------+---------------+---------------+--
-0.034 -0.029 -0.024 -0.019 -0.014 -0.009
Predicted Value of PT
Figure 3.17 Residual on fit plot for the model –1/ P =α+ BV + e in air pressure data.
|
|
0.0002 +
|
|
| *
| *
0.0001 + * *
| *
|
R | * * *
e | * *
s 0 + * * * * *
i | * * * *
d | *
u | * *
a | *
l -0.0001 + *
|
|
|
|
-0.0002 +
|
|
| *
|
-0.0003 +
|
---+------------------+------------------+------------------+------------------+--
-2 -1 0 1 2
NORMAL SCORE
Figure 3.18 Residual normal probability plot for the model –1/ P =α+ BV + e in air pressure data..
|
|
0.0001 + *
| *
| *
|
| *
0.00005 + * *
| *
| *
R | *
e | * *
s 0 +----------------------------------------------------*------------------------
i | * * *
d | * *
u |
a | * *
l -0.00005 + *
| * *
|
|
| *
-0.0001 +
|
|
| *
|
-0.00015 +
|
---+-------+-------+-------+-------+-------+-------+-------+-------+-------+--
-0.033 -0.030 -0.027 -0.024 -0.021 -0.018 -0.016 -0.013 -0.010 -0.007
Predicted Value of PT
Figure 3.19 Residual on fit plot for the model –1/ P =α+ BV + e in Example 3.4 after deleting the first data point.
|
|
0.0001 + *
| *
| *
|
| *
0.00005 + * *
| *
| *
R | *
e | * *
s 0 + *
i | * * *
d | * *
u |
a | * *
l -0.00005 + *
| * *
|
|
| *
-0.0001 +
|
|
| *
|
-0.00015 +
|
---+------------------+------------------+------------------+------------------+--
-2 -1 0 1 2
NORMAL SCORE
Figure 3.20 Residual normal probability plot for the model –1/ P =α+ BV + e in Example 3.4 after deleting the first data point.
remainderfYfT
fTYT
thenEYfaboutansionsTaylorBySolve
!1
,exp':
How to determine the weights of transformation Tsuch that
2 i Y T Var
(assuming T is monotonic increasing)
remainderfYfT
fTVarYTVar
havewesidesbothoniancetheTaking
!1
,var
fYfTfTVar
22 Let
YVarfT
22 YVarfT
Since
?,~: YTtheFindfPYeg
dfYVar
fTThen)(
,2
cfdffdff
fT
22
12
YYTtransformpowerthetakeWe
Top Related