Regression and Prediction - UW Faculty Web Serverfaculty.washington.edu/grover4/class11.pdf ·...

Regression and PredictionChapter 15 plus extra

May 2, 2012

Prediction

Vertical Chimneys

Regression Line

Equation of the Regression Line

Regression and Least Squares

Regression Fallacy

1.0 Prediction

If we have two quantitative variables X and Y that arelinearly related to each other, then knowing the particularvalue of X for one individual can help us to estimate(or predict) the value of Y for that individual.

We will explore what is the best prediction of the responsevariable (Y ) given a value of the explanatory variable (X ).

What is the likely size of the prediction error?

1.1 Fundamental Principle of

Prediction

Incoming students at a large law school have an averageL.S.A.T. score of 163 and a S.D. of 8. You may assume thehistogram of these data follows a normal curve approximately.Tomorrow one of these students will be chosen at random.

What is your best guess for their score?

The guess will be compared to their actual score to seehow far off it is. What is the likely size for the error inyour guess?

2.0 Vertical Chimneys In a

Scatterplot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

55 60 65 70 75 8060

6570

75


Son

's h

eigh

t (in

ches

)

2.0 Vertical Chimneys in a

Scatterplot

55 60 65 70 75 80

6065

7075


Son

's h

eigh

t (in

ches

)

The graph ofaverages shows theaverage son’s height foreach father’s height.

It is close to a straightline in the middle.

At the ends, it is quitebumpy.

2.1 Prediction in a Scatterplot

Use the mean of the relevant sub-group of data as ourpredictor.

S.D. of the group gives the “likely size” of the error inour prediction.

3.0 Regression Line

55 60 65 70 75 80

6065

7075


Son

's h

eigh

t (in

ches

)

The regression lineis a line fit to the graphof averages.

It smooths away someof the chance variationin the data.

If the graph of averagesis close to a straightline, then we use theregression line to predictY for a given X .

If the graph of averagesis non-linear, it is betterto use it instead.

3.1 Predicting using a Regression Line

Estimate the averageweight of the men whoseheight is 69 inches.

If you used theregression method toestimate weight fromheight, would yourestimates generally be alittle too high, low orabout right, for men inthe sample with heightbetween 72 in. and 74in?

4.0 The Regression Line

The regression line for predicting Y from X passesthrough the point of averages

(X , Y

)and has

sloper × S.D. of Y

S.D. of X

5.0 The Equation of the Regression

Line

The regression line for predicting Y from X has the form:

Y = a + b X ,

= intercept + slope X .

Here

b = slope,

= rS.D. of Y

S.D of X.

a = intercept,

= Y − b X ,

= Y −(r

S.D. of Y

S.D of X

)X .

5.1 Prediction from a Regression Line

The predicted value of Y for a given value of X say X ∗

has the form:

Y = a + b X ∗,

=

(Y − r

S.D. of Y

S.D of XX

)+

(r

S.D. of Y

S.D of X

)X ∗.

5.2 Predicting Sons’ Heights1,078 father-son pairs and their heights were measured.

I Average height of fathers is ≈ 68 in.I S.D. of height of fathers is ≈ 2.7 in.I Average height of sons is ≈ 69 in.I S.D. of height of sons is ≈ 2.8 in.I r is ≈ 0.5.

What are the co-ordinates for the point of averages?

What is the slope of the regression line?

What is the intercept of the regression line?

Write the equation of the regression line.

Suppose a father has a height of 72 inches. What wouldyou predict for his sons’ height?

Suppose a father has a height of 62 inches. What wouldyou predict for his son’s height?

5.3 Interpreting the Regression

Coefficients

Associated with a unit increase in X , there is some averagechange in Y . The slope of the regression line estimates thischange. The formula for the slope is:

r × S.D. of YS.D. of X

That is, associated with an increase of one S.D. in X , there isan increase of r S.D.s in Y , on the average.

The intercept is just the predicted value for Y when X equalszero. be wary of extrapolation

6.0 Regression and Least Squares

The Regression Line is familiarly referred to as the leastsquares line. This is because it minimizes the sum of thesquares of the vertical distances of the data points.

Data point

Vertical distanceto line

y

x

Regression Line

7.0 The Regression Fallacy

7.0 The Regression Fallacy

In virtually every scatterplot with less than perfect correlation,the data points that are extreme along the x axis tend not tobe as extreme on the y axis. This is called the regressioneffect.

DefinitionThinking that the regression effect must be due to somethingimportant, not just chance error, is called the regressionfallacy.

7.1 Example

An instructor standardizes both her midterm and the finaleach semester so the class average is 50 and the S.D. is 10 onboth tests. The correlation between the tests is around 0.5.One semester she took all the students who scored below 30 inthe midterm and gave them special tutoring. On average, theygained 10 points the final. She claims that her tutoringworked. Can you give her alternate explanation?

Regression and Prediction - UW Faculty Web Serverfaculty.washington.edu/grover4/class11.pdf ·...

Documents

Transcript of Regression and Prediction - UW Faculty Web Serverfaculty.washington.edu/grover4/class11.pdf ·...