Regression and Prediction - UW Faculty Web Serverfaculty.washington.edu/grover4/class11.pdf ·...
Transcript of Regression and Prediction - UW Faculty Web Serverfaculty.washington.edu/grover4/class11.pdf ·...
Regression and PredictionChapter 15 plus extra
May 2, 2012
Prediction
Vertical Chimneys
Regression Line
Equation of the Regression Line
Regression and Least Squares
Regression Fallacy
1.0 Prediction
If we have two quantitative variables X and Y that arelinearly related to each other, then knowing the particularvalue of X for one individual can help us to estimate(or predict) the value of Y for that individual.
We will explore what is the best prediction of the responsevariable (Y ) given a value of the explanatory variable (X ).
What is the likely size of the prediction error?
1.1 Fundamental Principle of
Prediction
Incoming students at a large law school have an averageL.S.A.T. score of 163 and a S.D. of 8. You may assume thehistogram of these data follows a normal curve approximately.Tomorrow one of these students will be chosen at random.
What is your best guess for their score?
The guess will be compared to their actual score to seehow far off it is. What is the likely size for the error inyour guess?
2.0 Vertical Chimneys In a
Scatterplot
55 60 65 70 75 80
6065
7075
Father's height (inches)
Son
's h
eigh
t (in
ches
)
55 60 65 70 75 8060
6570
75
Father's height (inches)
Son
's h
eigh
t (in
ches
)
2.0 Vertical Chimneys in a
Scatterplot
55 60 65 70 75 80
6065
7075
Father's height (inches)
Son
's h
eigh
t (in
ches
)
The graph ofaverages shows theaverage son’s height foreach father’s height.
It is close to a straightline in the middle.
At the ends, it is quitebumpy.
2.1 Prediction in a Scatterplot
Use the mean of the relevant sub-group of data as ourpredictor.
S.D. of the group gives the “likely size” of the error inour prediction.
3.0 Regression Line
55 60 65 70 75 80
6065
7075
Father's height (inches)
Son
's h
eigh
t (in
ches
)
The regression lineis a line fit to the graphof averages.
It smooths away someof the chance variationin the data.
If the graph of averagesis close to a straightline, then we use theregression line to predictY for a given X .
If the graph of averagesis non-linear, it is betterto use it instead.
3.1 Predicting using a Regression Line
Estimate the averageweight of the men whoseheight is 69 inches.
If you used theregression method toestimate weight fromheight, would yourestimates generally be alittle too high, low orabout right, for men inthe sample with heightbetween 72 in. and 74in?
4.0 The Regression Line
The regression line for predicting Y from X passesthrough the point of averages
(X , Y
)and has
sloper × S.D. of Y
S.D. of X
5.0 The Equation of the Regression
Line
The regression line for predicting Y from X has the form:
Y = a + b X ,
= intercept + slope X .
Here
b = slope,
= rS.D. of Y
S.D of X.
a = intercept,
= Y − b X ,
= Y −(r
S.D. of Y
S.D of X
)X .
5.1 Prediction from a Regression Line
The predicted value of Y for a given value of X say X ∗
has the form:
Y = a + b X ∗,
=
(Y − r
S.D. of Y
S.D of XX
)+
(r
S.D. of Y
S.D of X
)X ∗.
5.2 Predicting Sons’ Heights1,078 father-son pairs and their heights were measured.
I Average height of fathers is ≈ 68 in.I S.D. of height of fathers is ≈ 2.7 in.I Average height of sons is ≈ 69 in.I S.D. of height of sons is ≈ 2.8 in.I r is ≈ 0.5.
What are the co-ordinates for the point of averages?
What is the slope of the regression line?
What is the intercept of the regression line?
Write the equation of the regression line.
Suppose a father has a height of 72 inches. What wouldyou predict for his sons’ height?
Suppose a father has a height of 62 inches. What wouldyou predict for his son’s height?
5.3 Interpreting the Regression
Coefficients
Associated with a unit increase in X , there is some averagechange in Y . The slope of the regression line estimates thischange. The formula for the slope is:
r × S.D. of YS.D. of X
That is, associated with an increase of one S.D. in X , there isan increase of r S.D.s in Y , on the average.
The intercept is just the predicted value for Y when X equalszero. be wary of extrapolation
6.0 Regression and Least Squares
The Regression Line is familiarly referred to as the leastsquares line. This is because it minimizes the sum of thesquares of the vertical distances of the data points.
Data point
Vertical distanceto line
y
x
Regression Line
7.0 The Regression Fallacy
7.0 The Regression Fallacy
In virtually every scatterplot with less than perfect correlation,the data points that are extreme along the x axis tend not tobe as extreme on the y axis. This is called the regressioneffect.
DefinitionThinking that the regression effect must be due to somethingimportant, not just chance error, is called the regressionfallacy.
7.1 Example
An instructor standardizes both her midterm and the finaleach semester so the class average is 50 and the S.D. is 10 onboth tests. The correlation between the tests is around 0.5.One semester she took all the students who scored below 30 inthe midterm and gave them special tutoring. On average, theygained 10 points the final. She claims that her tutoringworked. Can you give her alternate explanation?