Chapter 10 - Regression - PART III : CORRELATION &...

Chapter 10 - RegressionPART III : CORRELATION & REGRESSION

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Chapter 10 - Regression 1 / 35

What is Regression?

If a scatter diagram shows a linear relationship, we would like tosummarize the overall pattern with a line on the scatter diagram.

A drawn line may be used to predict the values of y (dependentvariable) from the values of x (independent variable).

Applications

Trend Estimation: Predicting trends in business analytics.

Epidemiology: Relating tobacco smoking to mortality and morbidity.

Finance: Analyzing the systematic risk of investments.

Economics: The predominant empirical tool in economics.

No straight line passes through all the points. To the naked eye, manylines appear potentially optimal.


Many Optimal Lines

Which line best represents the linear trend of the data?


A Brief Review: Lines

A line is characterized by having a constant slope:

m = slope =rise

run

Points on a line can be generated by the slope-intercept formula:

y = mx + b

where m is slope and b is the y-intercept. The y-intercept is the y valuewhen x is 0. The line y = −2x + 4:


Regression Line

We need a formal way to draw an optimal line that will go as close to thepoints as possible!

The least squares regression line is the UNIQUE line fitted by the leastsquares method and which passes as close to the data as possible in thevertical direction. (More soon!)


Regression Line

Assume that y and x are the dependent and independent variables of astudy. Denote:

y to be the predicted (by regression) value of y for a given x ,

r to be the correlation coefficient between x and y ,

y and sy the average and standard deviation for the dependent(response) variable y ,

x and sx the average and standard deviation for the independent(explanatory) variable x .

The optimal least squares regression line of y on x , derivedmathematically, is defined as:

y = mx + b

with slope and intercept:

m = rsysx

b = y −mx


Equations of the Regression Line: Z-Score Independent

The optimal least squares regression line of y on x , derivedmathematically, is defined as:

y = mx + b

with slope and intercept:

m = rsysx

b = y −mx .

If we substitute the formula for slope and intercept into the equation of

the regression line and work some algebra, we will get the following form

of the regression line equation:

y = y + rsyx − x

sx= y + r · sy · zx , (1)

where zx is the z - score for x .Dr. Joseph Brennan (Math 148, BU) Chapter 10 - Regression 7 / 35

Equations of the Regression Line: Z-Score Dependent

Recall that the z-score is found via the equations:

zx =x − x

sxzy =

y − y

sy

With some algebraic manipulation:

y = y + r · sy · zx ⇒ zy =y − y

sy= r · zx

Interpretation:

The correlation coefficient help predict the z-score for y with only thez-score for x .This equation is considered the regression equation for thestandardized data.

The data pairs (xi , yi ) can be transformed to (zxi , zyi ) by a lineartransformation.The set of standardized z-scores has a mean of 0 and standarddeviation of 1. They also have a correlation coefficient of r .


Exercise 1, page 213 of the text.

Find the regression equations for predicting final score from midterm score,based on the following information:

average midterm score = 70 SD = 10average final score = 55 SD = 20 r = 0.6

Solution: First, let’s rewrite all the given information in our notation:

The dependent (response) variable, y , is the final score.

The independent variable, x , is the midterm score.

The x average x = 70 and standard deviation sx = 10.

The y average y = 55 and standard deviation sy = 20.


Exercise 1, page 213 of the text.

Compute the slope of the regression line:

m = rsysx

= 0.620

10= 1.2

Then find the intercept:

b = y −mx = 55− 1.2 · 70 = −29

The equation of the least squares regression line is:

y = 1.2x − 29

The equation of the predicted y z-score:

zy = 0.6zx


Interpreting the Regression Line

1 By saying we regress y on x , we mean that we want to predict yfrom x .

2 The best use of the regression line is to estimate the AVERAGEvalue of y for a given value of x .

Using the regression line formula of Exercise 1:

y = 1.2x − 29

if we assume x = 78, then we find

y = 1.2 · 78− 29 = 64.6

We interpret this as the average score on the final exam for studentswho score 78 points on the midterm exam is 64.6.


Example (HANES5: Weight and Height)

The scatter diagram of the weight and height measurements for 471 menin HANES 5 survey is shown below:

The solid line in the above figure is the regression line. The three crosses on the

scatter diagram estimate average heights of men for x equal to 64, 73, and 76

inches.


Example (HANES5: Weight and Height)

The graph of averages for the 471 men aged 18-24 in the HANES5sample. The regression line smooths the graph:

In general, if the relationship between x and y is linear and there are no extreme

outliers, then the average points follow the regression line very closely.Dr. Joseph Brennan (Math 148, BU) Chapter 10 - Regression 13 / 35

Using Regression Line for Individual Predictions

Although the best use of the regression line is to predict the averageoutcomes, it may also be used to predict individual outcomes, but theprediction error may be quite large. More to come on prediction error . . .

From Exercise 1: We will use the regression line to predict the finalscore for a student with the midterm score of 50 points. Theregression line is

y = 1.2x − 29

and the prediction is

y = 1.2 · 50− 29 = 31.

This prediction may be quite off the true value.


The Role of r in Regression

The correlation coefficient r measures the amount of scattering of pointsabout the regression line.

Case 1 (extreme): A correlation of r = −1 or r = 1 correspond to aperfect linear relationship. The scatter diagram is a perfect linecontaining the points coincides with the regression line.

Case 2 (extreme): A correlation r = 0 corresponds to a chaoticscattering of points which means there is no relationship between xand y .

In this case the slope of the regression line m = rsysx

= 0, they -intercept b = y −mx = y . Therefore, when r = 0 the regressionline is horizontal. Illustrated on the next slide.

Case 3: The closer r is to -1 or 1, the closer the points are to theregression line, the greater the success of regression in explainingindividual responses y for given values of x .


Horizontal Regression Line

There is no clear pattern for the points to drift up or down. As a result, thecorrelation coefficient is close to zero and the regression line is horizontal.


Interpretation of the Slope

The slope of the regression line shows how the average value of ychanges when x goes 1 unit up.

The units of measurement for the slope:units of y

units of x.

The expression for the slope m = rsysx

implies that x changes by onestandard deviation when y changes by r standard deviations.

Because −1 ≤ r ≤ 1, the change in y is less than or equal to thechange in x . As the correlation between x and y decreases, theprediction y changes more slowly in response to changes in x .Thiseffect is sometimes called attenuation.


Interpretation of the Slope

The slope of the regression line is proportional to the regression coefficientr , but not equal to it unless sy = sx .


Interpretation of the Intercept

The intercept of the regression line is the predicted value for y whenx equals 0.

Quite often the intercept does not have any physical interpretation!

In Exercise 1 the intercept of the line is -29. Does this mean that theaverage final score for students who got 0 on the midterm will be -29?!

NO WAY!!! First of all, 0 on the midterm is not truly a score, it justmeans that a student missed the test. Everyone who showed up isexpected to earn some points on the test. A reasonable range for xwhich we can use to predict the values of y will be, say, from 25 to 100points. Going beyond this range or extrapolating is risky: we canobtain a non-sensible prediction!


The Graph of Averages

The Graph of Averages is constructed from the scatter diagram. For agiven x-value, the y -value on the graph of averages is the average ofy -values associated to x on the scatter diagram.

Example: Assume that three points are plotted on the scatter diagramwith x value 7:

(7, 2) (7, 5) (7, 8)

The graph of averages will have the point (7, 5) as 5 is the mean of{2, 5, 8}.

The regression line for the graph of averages coincides with theregression line of the original scatter diagram.

You do not lose information by pre-smoothing data by constructingthe graph of averages.

If the graph of averages does not follow a straight line:

There may be extreme outliers.The relationship may be non-linear.


Example: Non-linear Association

Year 1999 2000 2001 2002 2003

AIDS Cases 41,356 41,267 40,833 41,289 43,171

The following parabola seems a nice fit!

Figure : Nonlinear relationship.

For those curious, the equation (parabola) used above is

y = 345.1428571x2 − 1705.657143x + 42903.6.

SOURCE: US Dept. of Health and Human Services, Center for Disease Control and Prevention, HIV/AIDS Surveillance, 2003.


Regression Effect

Regression Effect describes the tendency of individuals with extremevalues to retest towards the mean.

The regression effect is witnessed when an experiment is repeated andan individuals progress is tracked.

On average the top group will value lower on a second experimentand on average the bottom group will value higher on a secondexperiment.

Example: One would expect a student who received an 95 on amidterm with a class average of 72 and a standard deviation of 8points to score significantly lower on the final. Outliers tend towardsthe mean!


Regression Effect

Example (Page 167):How would you predict a student’s rank in a mathematics class?

Without any additional knowledge we would be safe to assume that astudent earns the mean or median among all grades; the expected(central) values.

Correlation enters into our predictions as additional information. Asphysics and mathematics can be considered similar subjects, one canassume that a student’s success in physics would correlate to a student’ssuccess in mathematics. Therefore, we are able to confidently predict therank of a mathematics student by their rank in a physics class.

On the other hand, pottery and mathematics are not similar subjects andone wouldn’t expect a correlation in student success. Therefore, we areunable to confidently predict the rank of a mathematics student by theirrank in a pottery class.


Example: (Father-son heights)

In this data thee average heightof fathers is x = 68 inches andthe average height of sons isy = 69 inches.

One of the vertical strips in thefigure corresponds to the fatherswho are 72 inches tall. Theaverage height of their sons is71 inches.

The other vertical stripcorresponds to fathers who are64 inches tall. The averageheight of their sons is 67 inches.


What’s wrong with this?

In the following figure, we see a near-perfect fit of a positive correlationfor Android’s market shares plotted against time.

How could such a neat fit go wrong?

Following the green line, Android would have 120% of the market shareby 2014!


Regression Fallacy

The Regression Effect assures us that it is natural for extremes tobecome average.

The Regression Fallacy is a fallacy by which individuals conjecture acause for an extreme to become average.

Being ill prepared for an extreme event, surviving, becoming preparedfor another occurrence of the extreme event, and conjecturing thatcurrent preparations prevent the repetition of extreme event.


Example 2, p. 166 of the textbook

A university has made a statistical analysis of the relationship betweenMath SAT scores (ranging from 200 to 800) and first year GPAs (rangingfrom 0 to 4.0), for students who complete the first year.

average SAT score = 550, SD = 80average first-year GPA = 2.6, SD = 0.6, r = 0.4

The scatter diagram is football-shaped. Suppose the percentile rank ofone student on the SAT is 90%, among the first-year students. Predict hispercentile rank on first-year GPA.

We will make the following assumptions :

Distributions of SAT scores is approximately normal with the meanx = 550 and standard deviation sx = 80.

The distribution of GPA values is approximately normal with themean y = 2.6 and sy = 0.6.


Example 2, p. 166 of the textbook, continued

Solution Let’s find the z - score for the a student who is placed in 90thpercentile. From the normal table zx ≈ 1.3. We will use the regressionmethod to predict student’s GPA value from his SAT score. From theequation of the regression line

zy = r · zx , ⇒ zy = r · zx = 0.4 · 1.3 = 0.52 ≈ 0.5,

which is the predicted student’s standard score. From the normal table wefind that the point with the z - score equal to 0.50 is approximately the69th percentile.

The first-year GPA’s percentile rank of a studentwho is the 90th percentile on the SAT distributionis predicted (by regression) to be 69%.

90th percentile of SAT distribution but only 69th percentile of GPA

distribution. WHY? It’s due to the regression effect.


Notes on Linear Regression

The least squares regression line always passes through the center of thedata; the point (x , y) consisting of averages.

By swapping dependent and independent status for variables, a secondregression line can be found.

We have seen that the correlation coefficient r is symmetric: if we switchthe axes, we will get the same correlation.

This is not true for regression. When switching the roles of x and y , youget a different regression equation as there is not necessarily an equalitybetween x and y or between sx and sy .

The equations for y regressed on x and for x regressed on y are generallydifferent!


Twin Regression Lines

Figure : Data on the lean body mass and metabolic rate. The lines are theleast-squares regression lines of the rate on the mass (solid/red) and of the masson the rate (dashed/black).


Boston Marathon versus Temperature

The average finish time in minutes and the temperature during the race inFahrenheit for the Boston Marathon are listed below:

Year Avg. Finish Time (minutes) Temperature (F)2000 221 492001 226 542002 221 552003 235 652004 253 852005 237 682006 230 542007 234 492008 231 532009 229 502010 230 53



Finish Time TemperatureMean 231.5 57.7

Standard Deviation 8.4 10.4

We have a properly scaled scatter plot:



We have two possible regression lines:

y = 1.05x − 185 x = 0.68y + 192


Football-Shaped Clustering

The data in a scatter plot for variables with a linear correlation clusters ina football (elliptical) shape estimated by three lines:

The solid line is the regression line for y on x.

The dashed line SD (standard deviation) line.

The dotted line is the regression line for x on y.


The SD Line

The SD line passes through the point of averages (x ,y).

In fact, the SD line, the regression line of y on x , and the regression lineof x on y all pass through the point of averages.

The slope of the SD line is the ratio of the standard deviations:sysx

.

Compare this slope to the regression line slope of r · sysx

and recall that

−1 ≤ r ≤ 1. This implies that the SD line has a larger absolute valuedslope than both regression lines.


Chapter 10 - Regression - PART III : CORRELATION &...

Documents

Transcript of Chapter 10 - Regression - PART III : CORRELATION &...