Least Squares Regression Fitting a Line to Bivariate Data.
-
Upload
loren-norrie -
Category
Documents
-
view
219 -
download
2
Transcript of Least Squares Regression Fitting a Line to Bivariate Data.
Least Squares Regression
Fitting a Line to Bivariate Data
Linear Relationships
Avg. occupants per car
1980: 6/car 1990: 3/car 2000: 1.5/car By the year 2010
every fourth car will have nobody in it!
Food for Thought Kind of
mathematical relationship between year and avg. no. of occupants per car?
Why might relation-
ship break down by 2010?
Basic Terminology Scatterplots, correlation: interested in
association between 2 variables (assign x and y arbitrarily)
Least squares regression: does one quantitative variable explain or cause changes in another variable?
Basic Terminology (cont.) Explanatory variable: explains or
causes changes in the other variable; the x variable. (independent variable)
Response variable: the y -variable; it responds to changes in the x - variable. (dependent variable)
Examples Fertilizer (x ) corn yield (y ) Advertising $ (x ) store income (y ) Drug dose (x ) blood pressure (y ) Daily temperature (x )
natural gas demand (y ) change in min wage(x)
unemployment rate (y)
Simplest Relationship Simplest equation that describes the
dependence of variable y on variable x
y = b0 + b1x linear equation graph is line with slope b1 and y-
intercept b0
Graph
y
x0
b0
y=b0 +b1x
run
riseSlope b=rise/run
Notation (x1, y1), (x2, y2), . . . , (xn, yn)
draw the line y= b0 + b1x through the scatterplot , the point on the line corresponding to xi is
0 1
0 1 i
i
ˆ ˆ; is the value of y predicted by the line
y when ;
is the observed value of when .
i i i
i
y b b x y
b b x x x
y y x x
Observed y, Predicted y
predicted y when x=2.7yhat = a + bx = a + b*2.7
2.7
Scatterplot: Fuel Consumption vs Car Weight Fuel Consumption vs Car Weight
2
3
4
5
6
7
1 2 3 4 5
Car Weight (1000 lbs)
Fu
el
con
sum
pti
on
(g
al/
100
mil
es)
Fuel consumption
“Best” line?
Scatterplot with least squares prediction line
FUEL CONSUMPTION vs CAR WEIGHT
y = 1.639x - 0.3631r2 = 0.9538
234567
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
FU
EL
CO
NS
UM
P.
(gal
/100
mile
s)
How do we draw the line? Residuals
0 1
ˆ
( )
th
th
th
i i
i i
i
i
i y y
y y
y b b x
the residual is the vertical deviation of the
data point from the line :
residual = observed predicted
Residuals: graphically
Graphical Display of Residuals
XXi
Yi ei=Yi - Yi
Yi
positive residual
negative residual
Criterion for choosing what line to draw: method of least
squares The method of least squares chooses
the line that makes the sum of squares of the residuals as small as possible
This line has slope b1 and intercept b0 that minimizes
20 1
1
[ ( )]
( , )
n
i ii
i i
y b b x
x y
for the given observations
Least Squares Line y = b0 + b1x: Slope b1 and Intercept b0
1
0
2
11 2
2
11 2
1
( )is the standard deviation of , ,...,
1
( )is the standard deviation of , ,...,
1
( )( )
(
y
x
n
ii
x n
n
ii
y n
n
i ii
s
s
b
x xs x x x
n
y ys y y y
n
x x y yr
b r
y bx
1 1 2 2 n n(x ,y ),(x ,y ), ,(x ,y )
slope
y intercept
where
20 1
1 1 1
is the correlation between and1) x y
n n n
i i i ii i i
x yn s s
SSE y b y b x y
Example: Income vs Consumption Expenditure
Income (x)ConsumptionExpenditure (y)
1 75 69 9
13 817 10
Questions
Construct scatterplot; determine if linear model is appropriate. If so …
… find the least squares prediction line Estimate consumption expenditure in a
household with an income of (i) $6,000 (ii) $25,000. Comfortable with estimates?
Compute the residuals
Scatterplot
Consumption Expenditure
5
6
7
8
9
10
11
0 5 10 15 20
Household Income ($1,000's)
Exp
end
itu
re (
$1,0
00's
)
SolutionInc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)
(yi-ybar) 1 7 -8 64 -1 1 8
5 6 -4 16 -2 4 8
9 9 0 0 1 1 0
13 8 4 16 0 0 0
17 10 8 64 2 4 16
x=45 y=40 (xi-xbar) =0
(xi-xbar)2
=160 (yi-ybar)
=0(yi-ybar)2
=10 32
1604
104
45 409; 8; 40 6.325
5 532
2.5 1.581; .84(6.325)(1.581)
x
y
x y s
s r
Calculations
1
0 1
1.581.8 .2;
6.325
8 .2(9) 8 1.8 6.2
least squares prediction line:
ˆ 6.2 .2
y
x
sb r
s
b y b x
y x
least squares prediction line
0 1ˆ 6.2 .2
$6,000, 6
ˆ 6.2 .2(6) 7.4 ($7,400)
$25,000, 25
ˆ 6.2 .2(25) 11.2 ($11,200)
y b b x x
income x
y
income x
y
Least Squares Prediction Line
Consumption Expenditure
y = 6.2 + 0.2x
5
6
7
8
9
10
11
0 5 10 15 20
Household Income ($1,000's)
Exp
end
itu
re (
$1,0
00's
)
Consumption Expenditure Prediction When x=$6,000
Consumption Expenditure
y = 6.2 + 0.2x
5
6
7
8
9
10
11
0 5 10 15 20
Household Income ($1,000's)
Exp
end
itu
re (
$1,0
00's
)
6
7.4
Consumption Expenditure Prediction When x=$25,000
Consumption Expenditure
y = 6.2 + 0.2x
5
6
7
8
9
10
11
12
0 5 10 15 20 25
Household Income ($1,000's)
Exp
endi
ture
($1,
000'
s)
25
11.2
The least squares line always goes through the point with coordinates (x, y)
( x, y ) = ( 9, 8 )
C. Compute the Residuals
Inc. x ConE y y=6.2+.2x y - y (y-y)^2
1 7 6.4 .6 .36
5 6 7.2 -1.2 1.44
9 9 8 1 1
13 8 8.8 -.8 .64
17 10 9.6 .4 .16
residuals=0 (residuals)2
=3.6
Residuals
Consumption Expenditure
y = 6.2 + 0.2x
5
6
7
8
9
10
11
0 5 10 15 20
Household Income ($1,000's)
Exp
end
itu
re (
$1,0
00's
)
Income Residual Plot
Income Residual Plot
-2-1012
0 5 10 15 20
Incom e
Resi
dual
s
residuals, residuals)2
Note that* residuals = 0 residuals)2 = 3.6* From formula in box on p. 7:
SSE=yi2 – b0*yi – b1*xiyi
330 – 6.2*40 - .2*392= 330 – 248 – 78.4 = 3.6
Any other line drawn through the scatterplot will have
residuals)2 > 3.6
Car Weight, Fuel Consumption Example, cont.
FUEL CONSUMPTION vs CAR WEIGHT
2
3
4
5
6
7
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
FU
EL
CO
NS
UM
P.
(gal
/100
mile
s)(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)
Wt
(x)
Fuel
(y)
3.4 5.5 .5 .25 1.11 1.231 .555
3.8 5.9 .9 .81 1.51 2.2801 1.359
4.1 6.5 1.2 1.44 2.11 4.4521 2.532
2.2 3.3 -.7 .49 -1.09 1.1881 .763
2.6 3.6 -.3 .09 -.79 .6241 .237
2.9 4.6 0 0 .21 .0441 0
2.0 2.9 -.9 .81 -1.49 2.2201 1.341
2.7 3.6 -.2 .04 -.79 .6241 .158
1.9 3.1 -1.0 1 -1.29 1.6641 1.29
3.4 4.9 .5 .25 .51 .2601 .255
29 43.9 0 5.18 0 14.589 8.49
ix - x 2i(x - x) iy - y 2
i(y - y) i i(x - x)(y - y)
col. sum
Calculations
5.189
14.5899
1
0 1
0 1
slope 1.639
intercept 4.39 1.639(2.9) .3631
ˆleast squares prediction line .3631 1.
2.9; 4.39; .7587;
8.491.2732; .9766
9(.77587)(1.2732)
1.2732.9766
.7587
x
y
y
x
b r
b y b x
y b b x
x y s
s r
s
s
639x
Scatterplot with least squares prediction line
FUEL CONSUMPTION vs CAR WEIGHT
y = 1.639x - 0.3631r2 = 0.9538
234567
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
FU
EL
CO
NS
UM
P.
(gal
/100
mile
s)
The Least Squares Line Always goes Through ( x, y )
(x, y ) = (2.9, 4.39)
Using the least squares line for prediction. Fuel consumption of 3,000 lb car? (x=3)
ˆ .3631 1.639(3) 4.5539y Fuel Consumption vs Car Weight: Scatterplot and Least Squares Line
y = - 0.3631 + 1.639x
2
3
4
5
6
7
1.5 2 2.5 3 3.5 4 4.5CAR WEIGHT
FU
EL
CO
NS
UM
PT
ION
(3.0, 4.5539)
Be Careful!
ˆ .3631 1.639(.5) .4564
(219 mpg)
y
Fuel consumption of 500 lb car? (x = .5)
FUEL CONSUMPTION vs CAR WEIGHT
y = 1.639x - 0.3631r2 = 0.9538
234567
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
FU
EL
CO
NS
UM
P.
(gal/100 m
iles)
x = .5 is outside the range of the x-data that we used to determine the least squares line
Avoid GIGO! Evaluating the least squares line
1. Create scatterplot. Approximately linear?
2. Calculate r2, the square of the correlation coefficient
3. Examine residual plot
r2 : The Variation Accounted For
The square of the correlation coefficient r gives important information about the usefulness of the least squares line
r2: important information for evaluating the usefulness of the least squares line
The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the least squares regression of y on x.
-1 ≤ r ≤ 1 implies 0 ≤ r2 ≤ 1
The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the variation in x.
Example: car weight, fuel consumption
x=car weight, y=fuel consumption
r2 = (.9766)2 .95
About 95% of the variation in fuel consumption (y) is explained by the linear relationship between car weight (x) and fuel consumption (y).
What else affects fuel consumption?
– Driver, size of engine, tires, road, etc.
Example: SAT scoresSAT Mean per State vs % Seniors Taking Test
y = -2.2375x + 1023.4
R2 = 0.7542
820
870
920
970
1020
1070
1120
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test
Mea
n S
AT
Sco
re
SAT scores: calculations
1 0 1
1
0
33.882 24.103 947.549 62.1 .868
,
62.1slope .868 2.23635
24.103intercept 947.549 ( 2.236)33.882 1023.309
ˆleast squares prediction line 1023.309 2.236
x y
y
x
x s y s r
sb r b y b x
s
b
b
y x
SAT scores: result
SAT Mean per State vs % Seniors Taking Test
y = -2.2375x + 1023.4
R2 = 0.7542
820
870
920
970
1020
1070
1120
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test
Mea
n S
AT
Sco
re
r2 = (-.868)2 = .7534
If 57% of NC seniors take the SAT, the predicted mean score is
ˆ 1023.309 2.23635(57) 895.84y
Avoid GIGO! Evaluating the least squares line
1. Create scatterplot. Approximately linear?
2. Calculate r2, the square of the correlation coefficient
3. Examine residual plot
Residuals residual =observed y - predicted y
= y - y Properties of residuals
1. The residuals always sum to 0 (therefore the mean of the residuals is 0)
2. The least squares line always goes through the point (x, y)
Graphicallyresidual = y - y
y
yi
yi ei=yi - yi
Xxi
Residual Plot
Residuals help us determine if fitting a least squares line to the data makes sense
When a least squares line is appropriate, it should model the underlying relationship; nothing interesting should be left behind
We make a scatterplot of the residuals in the hope of finding…
NOTHING!
Car Wt/ Fuel Consump: Residuals
CAR WT. FUEL CONSUMP. Pred FUEL CONSUMP. Residuals
3.4 5.5 5.2094980690 .290501931 3.8 5.9 5.865096525 0.034903475 4.1 6.5 6.356795367 0.143204633 2.2 3.3 3.242702703 0.057297297 2.6 3.6 3.898301158 -0.29830115 2.9 4.6 4.39 0.21 2 2.9 2.914903475 -0.01490347 2.7 3.6 4.062200772 -0.46220077 1.9 3.1 2.751003861 0.348996139 3.4 4.9 5.209498069 -0.309498069
Example: Car wt/fuel consump. residual plot page 13
RESIDUALS vs WT(X)
-0.6
-0.4
-0.2
0
0.2
0.4
1.5 2 2.5 3 3.5 4 4.5
WT(X)
RE
SID
UA
LS
RESIDUAL
SAT Residuals
%TAKE Residual Plot
-100-50
0
50100
0 20 40 60 80
%TAKE
Resi
dual
s
Linear Relationship?
Linear(?)
0
10
20
30
40
50
60
-4 -2 0 2 4 6 8X
Y
Garbage In Garbage Out
GIGO
y = 4x + 11
0
10
20
30
40
50
60
-4 -2 0 2 4 6 8X
Y
Residual Plot – Clue to GIGO
Residual Plot
-20
-10
0
10
20
-4 -2 0 2 4 6 8
X Variable
Resi
dual
s
GIGO
y = 4x + 11
0
10
20
30
40
50
60
-4 -2 0 2 4 6 8X
Y
Residual Plot
-20
-10
0
10
20
-4 -2 0 2 4 6 8
X Variable
Re
sid
ua
ls