01_03
description
Transcript of 01_03
Least squares estimation of regression linesRegression via least squares
Brian Caffo, Jeff Leek and Roger PengJohns Hopkins Bloomberg School of Public Health
General least squares for linear equationsConsider again the parent and child height data from Galton
2/19
Fitting the best lineLet be the child's height and be the (average over the pair of) parents' heights.
Consider finding the best line
Use least squares
How do we do it?
∙ Yi ith Xi ith
∙
Child's Height = + Parent's Height β0 β1
∙
− ( + )∑i=1
n
Yi β0 β1Xi 2
∙
3/19
Let's solve this problem generallyLet and our estimates be .
We want to minimize
Suppose that
then
∙ = +μi β0 β1Xi = +μi β0 β1Xi
∙
† ( − = ( − + 2 ( − )( − ) + ( −∑i=1
n
Yi μi)2 ∑
i=1
n
Yi μi)2 ∑
i=1
n
Yi μi μi μi ∑i=1
n
μi μi)2
∙
( − )( − ) = 0∑i=1
n
Yi μi μi μi
† = ( − + ( − ≥ ( −∑i=1
n
Yi μi)2 ∑
i=1
n
μi μi)2 ∑
i=1
n
Yi μi)2
4/19
Mean only regressionSo we know that if:
where and then the line
is the least squares line.
Consider forcing and thus ; that is, only considering horizontal lines
The solution works out to be
∙
( − )( − ) = 0∑i=1
n
Yi μi μi μi
= +μi β0 β1Xi = +μi β0 β1Xi
Y = + Xβ0 β1
∙ = 0β1 = 0β1
∙
= .β0 Y
5/19
Let's show it
Thus, this will equal 0 if
Thus
( − )( − ) =∑i=1
n
Yi μi μi μi
=
( − )( − )∑i=1
n
Yi β0 β0 β0
( − ) ( − ) β0 β0 ∑i=1
n
Yi β0
( − ) = n − n = 0∑ni=1 Yi β0 Y β0
= .β0 Y
6/19
Regression through the originRecall that if:
where and then the line
is the least squares line.
Consider forcing and thus ; that is, only considering lines through the origin
The solution works out to be
∙
( − )( − ) = 0∑i=1
n
Yi μi μi μi
= +μi β0 β1Xi = +μi β0 β1Xi
Y = + Xβ0 β1
∙ = 0β0 = 0β0
∙
= .β1∑i=1n YiXi
∑ni=1 X
2i
7/19
Let's show it
Thus, this will equal 0 if
Thus
( − )( − ) =∑i=1
n
Yi μi μi μi
=
( − )( − )∑i=1
n
Yi β1Xi β1Xi β1Xi
( − ) ( − ) β1 β1 ∑i=1
n
YiXi β1X2i
( − ) = − = 0∑ni=1 YiXi β1X2
i ∑ni=1 YiXi β1 ∑
ni=1 X
2i
= .β1∑i=1n YiXi
∑ni=1 X
2i
8/19
Recapping what we knowIf we define then .
If we define then
What about when ? That is, we don't want to restrict ourselves to horizontal lines orlines through the origin.
∙ =μi β0 =β0 Y
If we only look at horizontal lines, the least squares estimate of the intercept of that line is theaverage of the outcomes.
∙ =μi Xiβ1 =β1∑ i=1n YiX i
∑ni=1 X
2i
If we only look at lines through the origin, we get the estimated slope is the cross product of theX and Ys divided by the cross product of the Xs with themselves.
∙ = +μi β0 β1Xi
9/19
Let's figure it out
Note that
Then
( − )( − ) =∑i=1
n
Yi μi μi μi
=
( − − )( + − − )∑i=1
n
Yi β0 β1Xi β0 β1Xi β0 β1Xi
( − ) ( − − ) + ( − ) ( − − )β0 β0 ∑i=1
n
Yi β0 β1Xi β1 β1 ∑i=1
n
Yi β0 β1Xi Xi
0 = ( − − ) = n − n − n implies that = −∑i=1
n
Yi β0 β1Xi Y β0 β1X β0 Y β1X
( − − ) = ( − + − )∑i=1
n
Yi β0 β1Xi Xi ∑i=1
n
Yi Y β1X β1Xi Xi
10/19
Continued
And thus
So we arrive at
And recall
= ( − ) − ( − )∑i=1
n
Yi Y β1 Xi X Xi
( − ) − ( − ) = 0.∑i=1
n
Yi Y Xi β1 ∑i=1
n
Xi X Xi
= = = Cor(Y, X) .β1( − )∑n
i=1 Yi Y Xi
( − )∑ni=1 Xi X Xi
( − )( − )∑ni=1 Yi Y Xi X
( − )( − )∑ni=1 Xi X Xi X
Sd(Y)Sd(X)
= − .β0 Y β1X
11/19
ConsequencesThe least squares model fit to the line through the data pairs with as theoutcome obtains the line where
has the units of , has the units of .
The line passes through the point )
The slope of the regression line with as the outcome and as the predictor is .
The slope is the same one you would get if you centered the data, , and didregression through the origin.
If you normalized the data, , the slope is .
∙ Y = + Xβ0 β1 ( , )Xi Yi Yi
Y = + Xβ0 β1
= Cor(Y, X) = −β1Sd(Y)Sd(X)
β0 Y β1X
∙ β1 Y/X β0 Y
∙ ( ,X Y
∙ X YCor(Y, X)Sd(X)/Sd(Y)
∙ ( − , − )Xi X Yi Y
∙ , −X i XSd(X)
−Yi YSd(Y) Cor(Y, X)
12/19
Revisiting Galton's dataDouble check our calculations using R
y <- galton$childx <- galton$parentbeta1 <- cor(y, x) * sd(y) / sd(x)beta0 <- mean(y) - beta1 * mean(x)rbind(c(beta0, beta1), coef(lm(y ~ x)))
(Intercept) x[1,] 23.94 0.6463[2,] 23.94 0.6463
13/19
Revisiting Galton's dataReversing the outcome/predictor relationship
beta1 <- cor(y, x) * sd(x) / sd(y)beta0 <- mean(x) - beta1 * mean(y)rbind(c(beta0, beta1), coef(lm(x ~ y)))
(Intercept) y[1,] 46.14 0.3256[2,] 46.14 0.3256
14/19
Revisiting Galton's dataRegression through the origin yields an equivalent slope if you center the
data first
yc <- y - mean(y)xc <- x - mean(x)beta1 <- sum(yc * xc) / sum(xc 2)c(beta1, coef(lm(y ~ x))[2])
x 0.6463 0.6463
15/19
Revisiting Galton's dataNormalizing variables results in the slope being the correlation
yn <- (y - mean(y))/sd(y)xn <- (x - mean(x))/sd(x)c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
xn 0.4588 0.4588 0.4588
16/19
Plotting the fitSize of points are frequencies at that X, Y combination.
For the red lie the child is outcome.
For the blue, the parent is the outcome (accounting for the fact that the response is plotted on thehorizontal axis).
Black line assumes (slope is ).
Big black dot is .
∙
∙
∙
∙ Cor(Y, X) = 1 Sd(Y)/Sd(x)
∙ ( , )X Y
17/19
The code to add the lines
abline(mean(y) - mean(x) * cor(y, x) * sd(y) / sd(x), sd(y) / sd(x) * cor(y, x), lwd = 3, col = "red")abline(mean(y) - mean(x) * sd(y) / sd(x) / cor(y, x), sd(y) cor(y, x) / sd(x), lwd = 3, col = "blue")abline(mean(y) - mean(x) * sd(y) / sd(x), sd(y) / sd(x), lwd = 2)points(mean(x), mean(y), cex = 2, pch = 19)
18/19
19/19