Download - Simple Linear Regression. Data available ： (X,Y) Goal ： To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.

Simple Linear Regression

: Model

eβXαY

where

variable.dependentor response:Y

.y variableexplanatoror regressor predictor, t,independen:X

ts)coefficien regressionor (parameter

slope:β and intercept :α

noise.or error :e

Data available： (X,Y)

squaresofsumresidualtheimizetoFindCriterion min,:

0,

0,

Q

Q

equationnormalthesolveBy

Goal： To predict the response Y. (i.e. to obtain the fitted response function f(X))

Least Squares Fitting Method

How to determine this regression function? (need to estimate the parameters.)

n

iii XYQ

1

2,

Least Squares Regression Function：

2ˆ

,ˆ

XX

YYXX

XY

i

ii

XY ˆˆˆ

Least Squares Estimates

0,

0,

Q

Q

Set

XY ˆ

n

iii XYQSolve

1

12,:

n

iii XY

1

02

XY

Xn

Yn

XY

n

ii

n

ii

n

ii

n

ii

11

11

11

XY

2ˆ

XX

YYXX

i

ii

022,:11

i

n

iii

n

iiii XXYXXYQSolve

n

ii

n

ii

n

iii

n

ii

n

ii

n

iii

n

iii

n

iii

XXXYYX

XXYX

XXYX

1

2

11

1

2

11

11

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

XXnYXnYX

XXnYXn

XXXXY

1

22

1

1

22

1

2

11

222XX

YYXX

XnX

YXnYX

i

ii

i

ii

2ˆ

,ˆ

XX

YYXX

XY

i

ii

How do we know the two estimators can minimize Q?

Terminology

Fitted model

eXY ˆˆˆ

True model

eXY

Fitted regression function

XY ˆˆˆ

xY

xf

S

SXY

xx

xy

Model

)(Let

ˆ,ˆˆ

It can be shown thatIt can be shown that

2

22

,~ˆ XSn

Nxx

xxSN

2

,~ˆ

2

22

),(~)(ˆ XXSn

xfNxfxx

xxSN

2

,~ˆ

2..

2..

,~),,0(~

,mod:

XNYhavewethenNwhere

xYeltheSinceSolvediidii

)(1

)ˆ( xyxxxx

xy SESS

SEE

n

iii

xx

YEXXS 1

1

n

iii

xx

XXXS 1

1

n

iii

xx

XXXXS 1

xxSN

2

,~ˆ

xyxxxx

xy SVarSS

SVarVarSolve

21

)ˆ(:

n

iii

xx

YXXVarS 1

21

n

i jijiiiii

xx

YYCovXXXXYVarXXS 1

22

,1

n

i jijiiii

xx

YYCovXXXXXXS 1

222

,1

xxS

2

2

22

),(~)(ˆ XXSn

xfNxfxx

)(ˆˆ

ˆˆˆˆˆˆ:

XfXXXYEXXYEXfE

XXYXXYXXfSolve

n

iii

xx

YXXS

YCovYCov1

1,)ˆ,(

n

iii

xx

YXXYCovS 1

,1

),(1

1ii

n

ixx

YXXYCovS

n

iii

xx

YYCovXXS 1

,1

n

i

n

jiji

xx

YYCovn

XXS 1 1

,11

n

iiii

xx

YYCovn

XXS 1

,11

n

ii

xx

XXSn 1

2

0

0)ˆ,(

ˆˆˆˆ:

YCovand

XXYXXfSolve

2

22

),(~)(ˆ XXSn

xfNxfxx

XXVarYVar

XXYVarXfVar

ˆ

ˆˆ

xxS

XXn

22

2

REGRESSION ON MIDTERM GRADE Obs MIDTERM FINAL 1 68 75 2 49 63 3 60 57 4 68 88 5 97 88 6 82 79 7 59 82 8 50 73 9 73 90 10 39 62 11 71 70 12 95 96 13 61 76 14 72 75 15 87 85 16 40 40 17 66 74 18 58 70 19 58 75 20 77 72

Figure 1.4 SAS PROC PRINT output for the grade data problem.

TITLE ‘REGRESSION ON MIDTERM GRADE’;

DATA;

INPUT MIDTERM FINAL;

CARDS;

68 75

49 63

60 57

. .

77 72

;

PROC PLOT;

PLOT FINAL*MIDTERM=’O’ PRED*MIDTERM=’P’ / OVERLAY;

LABEL FINAL=’FINAL’;

PROC PRINT;

PROC REG;

MODEL FINAL=MIDTERM / P;

OUTPUT PREDICTED=PRED

RESIDUAL=RESID;

PROC RANK NORMAL=VW;

VAR RESID;

RANKS NSCORE;

PROC PLOT;

PLOT RESID*NSCORE=’R’;

LABEL NSCORE=’NORMAL SCORE’;

RUN;

REGRESSION ON MIDTERM GRADE Model: MODEL1 Dependent Variable: FINAL Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F

Model 1 1774.44117 1774.44117 24.26 0.0001 Error 18 1316.55883 73.14216 Corrected Total 19 3091.00000 Root MSE 8.55232 R-Square 0.5741 Dependent Mean 74.50000 Adj R-Sq 0.5504 Coeff Var 11.47962 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 34.56757 8.32984 4.15 0.0006 MIDTERM 1 0.60049 0.12192 4.93 0.0001

Dep Var Predicted Obs FINAL Value Residual 1 75.0000 75.4007 -0.4007 2 63.0000 63.9915 -0.9915 3 57.0000 70.5968 -13.5968 4 88.0000 75.4007 12.5993 5 88.0000 92.8149 -4.8149 6 79.0000 83.8076 -4.8076 7 82.0000 69.9963 12.0037 8 73.0000 64.5920 8.4080 9 90.0000 78.4032 11.5968 10 62.0000 57.9866 4.0134 11 70.0000 77.2022 -7.2022 12 96.0000 91.6139 4.3861 13 76.0000 71.1973 4.8027 14 75.0000 77.8027 -2.8027 15 85.0000 86.8100 -1.8100 16 40.0000 58.5871 -18.5871 17 74.0000 74.1998 -0.1998 18 70.0000 69.3959 0.6041 19 75.0000 69.3959 5.6041 20 72.0000 80.8051 -8.8051 Sum of Residuals 0 Sum of Squared Residuals 1316.55883 Predicted Residual SS (PRESS) 1668.47241

|

100 +

| o

|

| o p p

| o o

| o

| o p

80 + p o

F | o p pp

I | o o o o o

N | o pp o o

A | p p

L | p

| o o

60 + p

| p o

|

|

|

|

|

40 + o

|

-+------------+------------+------------+------------+------------+------------+------------+

30 40 50 60 70 80 90 100

NOTE: 6 obs hidden.

MIDTERM

Figure 1.6 Output for the first PROC PLOT step for the grade data problem.

20 +

|

|

|

| R

| R R

10 +

| R

R |

e | R R R

s | R

i |

d 0 +---------------------------------R---------R--R---------------------------------------------

u | R R

a | R

l | R R

| R

| R

-10 +

|

| R

|

|

| R

-20 +

|

--+----------+----------+----------+----------+----------+----------+----------+----------+--

55 60 65 70 75 80 85 90 95

Predicted Value of FINAL

Figure 1.7 The remainder of the output from the first PROC PLOT step.

20 +

|

|

|

| R

| R R

10 +

| R

R |

e | R R R

s | R

i |

d 0 + R R R

u | R R

a | R

l | R R

| R

| R

-10 +

|

| R

|

|

| R

-20 +

|

--+----------+----------+----------+----------+----------+----------+----------+----------+--

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

NORMAL SCORE

).(~)ˆ(ˆ)ˆ(ˆ

,

varˆˆ

.

dftdtS

ETthen

iablenormalaisandIf

Theoremestimate

＊ Confidence Interval

0

00

：：

：aH

Hexample

)ˆ(ˆˆ

..%95

025.0 dtSt

ICConstruct

2~ˆˆ

ˆ222

nt

S

X

n

t

XX

The range lies between –1 and 1.

22

,

:

EYYEEXXE

YEYXEXEYXCov

Definition

YXxy

x

yxy S

Sr ̂

＊ Pearson’s Correlation Coefficient

＊ Goal： The degree of linear correlation between two variables.

n

i

n

iii

n

iii

xy

YYn

XXn

YYXXn

r

1 1

22

1

1

1

1

1

1

1

n

iiy

n

iiy

y

y

YYn

SYYn

Swhere

RS

SR

1

22ˆ

1

22

22

2ˆ2

ˆˆ1

1,

1

1

)10(

＊ Coefficient of Determination: the fraction of the variance in y that is explained by regression on x.

Definition：

Goal： may be used as an index of linearity for the relation of y to x.

2ˆ

2ˆ

2

22

eyy

xy

SSS

rR

120 +

| o

|

| o

|

100 + o

|

| o

| o

P | o

R 80 +

E | o o

S | o

S | o

U | o

R 60 + o o

| o

| o

| o

| o

40 + o o

| o o

| o o o

| o

|

20 +

|

---+---------+---------+---------+---------+---------+---------+---------+---------+--

10 15 20 25 30 35 40 45 50

VOLUME

Figure 3.3: A plot of the air pressure data (an example of residual analysis).

|

30 +

|

|

|

| *

|

20 +

|

R |

e | *

s |*

i |

d 10 + * *

u |

a | *

l | * *

|

| * *

0 +------------------------------------------------------------------------------*-------------

| *

| * *

| * *

| * *

| * * * *

-10 + * * *

|

-+---------+---------+---------+---------+---------+---------+---------+---------+---------+-

16.357 25.007 33.658 42.308 50.959 59.609 68.259 76.910 85.560 94.210

Predicted Value of P

Figure 3.4 The residual on fit plot after fitting the model P= a + b V + e to the air pressure data.

0.50 +

| *

|

| *

|

0.25 +

|

| * * * * * *

| * * *

R | * * *

e 0.00 +-----------------------*--------------------------*------------------------

s | *

i | * * *

d |

u | *

a -0.25 + *

l | *

| *

|

|

-0.50 +

|

|

| *

|

-0.75 +

---+-------------+-------------+-------------+-------------+-------------+--

20 40 60 80 100 120


Figure 3.5 The residual on the fit plot using the model P = a + b/V +e for the air pressure data.

Weighted RegressionProblem ： (unequal variance)

Model ：Claim ： minimize

2iiYVar

eXY

)..(,,1

2 knownieWgivenXYWQ i

n

iiii

Ordinary RegressionModel ：Claim ： minimize

eXY

n

iii XYQ

1

2,

n

iiii XYWQSolve

1

2,:

How to determine the weights?

So the optimal weights are inversely proportional to the variances of the y.

2~~ iiii YVarYWYLet

2~ iiiii YVarWYWVarYVar

Then

ii YraV

Wˆ

ˆ 2

n

iiii

n

iiiiii

XYW

XWWYW

1

2

1

2

DATA; INPUT V P;VI=1/V;CARDS;48 29.1...12 117.6;

PROC REG;MODEL P=VI;WIGHT W;OUTPUT P=FIT R=RES; DATA;SET;WRES=SQRT(W)*RES;

PROC REG;MODEL P=VI;OUTPUT P=LSFIT;

DATA;SET;W=1/LSFIT;

PROC RANK NORMAL=VW;VAR WRES;RANKS NSCORE; PROC PLOT;PLOT WRES*FIT=’*’ / VREF=0 VPOS=30;POLT WRES*NSCORE=’*’ /VPOS=30;LABEL WRES=’WEIGHTED RESIDUAL’

NSCORE=’NORMAL SCORE’; RUN;

|

0.050 +

|

| *

W | *

E |

I 0.025 + * * *

G | * * *

H | *

T | * * *

E | * *

D 0.000 +-----------------------*---------------------------------------------------

| *

R | * * *

E | *

S |

I -0.025 + *

D | *

U |

A | *

L |

-0.050 + *

|

|

|

| *

-0.075 +

|

---+-------------+-------------+-------------+-------------+-------------+--

20 40 60 80 100 120


Figure 3.13 Weighted residual plot for a weighted fit of the model P = a + b/V + e to the air pressure data .

0.0002 +

|

|

| *

| *

0.0001 + * *

| *

|

R | * * *

e | * *

s 0 +------*--------*-------------------------------*---------------*--------------------*

i | * * * *

d | *

u | * *

a | *

l -0.0001 + *

|

|

|

|

-0.0002 +

|

|

| *

|

-0.0003 +

|

---+---------------+---------------+---------------+---------------+---------------+--

-0.034 -0.029 -0.024 -0.019 -0.014 -0.009

Predicted Value of PT

Figure 3.17 Residual on fit plot for the model –1/ P =α+ BV + e in air pressure data.

|

|

0.0002 +

|

|

| *

| *

0.0001 + * *

| *

|

R | * * *

e | * *

s 0 + * * * * *

i | * * * *

d | *

u | * *

a | *

l -0.0001 + *

|

|

|

|

-0.0002 +

|

|

| *

|

-0.0003 +

|

---+------------------+------------------+------------------+------------------+--

-2 -1 0 1 2

NORMAL SCORE

Figure 3.18 Residual normal probability plot for the model –1/ P =α+ BV + e in air pressure data..

|

|

0.0001 + *

| *

| *

|

| *

0.00005 + * *

| *

| *

R | *

e | * *

s 0 +----------------------------------------------------*------------------------

i | * * *

d | * *

u |

a | * *

l -0.00005 + *

| * *

|

|

| *

-0.0001 +

|

|

| *

|

-0.00015 +

|

---+-------+-------+-------+-------+-------+-------+-------+-------+-------+--

-0.033 -0.030 -0.027 -0.024 -0.021 -0.018 -0.016 -0.013 -0.010 -0.007

Predicted Value of PT

Figure 3.19 Residual on fit plot for the model –1/ P =α+ BV + e in Example 3.4 after deleting the first data point.

|

|

0.0001 + *

| *

| *

|

| *

0.00005 + * *

| *

| *

R | *

e | * *

s 0 + *

i | * * *

d | * *

u |

a | * *

l -0.00005 + *

| * *

|

|

| *

-0.0001 +

|

|

| *

|

-0.00015 +

|

---+------------------+------------------+------------------+------------------+--

-2 -1 0 1 2

NORMAL SCORE

Figure 3.20 Residual normal probability plot for the model –1/ P =α+ BV + e in Example 3.4 after deleting the first data point.

remainderfYfT

fTYT

thenEYfaboutansionsTaylorBySolve

!1

,exp':

How to determine the weights of transformation Tsuch that

2 i Y T Var

(assuming T is monotonic increasing)

remainderfYfT

fTVarYTVar

havewesidesbothoniancetheTaking

!1

,var

fYfTfTVar

22 Let

YVarfT

22 YVarfT

Since

?,~: YTtheFindfPYeg

dfYVar

fTThen)(

,2

cfdffdff

fT

22

12

YYTtransformpowerthetakeWe