Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.

Chapter 13

Simple Linear Regression and

Correlation: Inferential Methods

Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average.

The equation for an additive probabilistic model is:

Where e is an “error” variable

Is the first-year college grade point average determined solely by the high school grade point

average?

A relationship in which the value of y is completely determined by the value of an independent variable x is called a

deterministic relationship.

The first-year college grade point average and the high school grade point average do NOT

have a deterministic relationship.

A description of the relationship between two variables that are not

deterministically related can be given by a probabilistic model.

exf

y

)(

deviation random x of function ticdeterminis

x

y

x1 x2

The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line.

When a value of the independent variable x is fixed and an observation on the dependent variable y is made, exy

a

Population regression line (slope b)

e1

e2

Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the

population regression line.

Basic Assumptions of the Simple Linear Regression Model1. The distribution of e at any particular x

value has mean value 0. that is, me = 0.

2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s.

3. The distribution of e at any particular value of x is normal.

4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

60 62 64 66 68Height

Weig

ht

60 62 64 66 68

How much would an

adult female weigh if she were 5 feet

tall?

Weights of women that are 5 feet tall will vary – in other words, there is a

distribution of weights for adult

females who are 5 feet tall.

This distribution is normally distributed.

60 62 64 66 68

60 62 64 66 68

We want the standard

deviations of all these normal

distributions to be the same.

Let’s look at the heights and weights of a population of adult women.

Are some of these weights

more likely than others?

What would this distribution look

like?

What would you expect for

other heights?

Where would you expect

the population regression line

to be?

Basic Assumptions of the Simple Linear Regression Model Revisited

1. The distribution of e at any particular x value has mean value 0. that is, me = 0.

2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s.

3. The distribution of e at any particular value of x is normal.

4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

Remember the variable e is a measure of the extent that individual y-values deviate from

the population regression line.

For any particular x value, the standard deviation

of y equals the standard deviation of e.

The distribution of y at any particular value of x is normal.

We use to estimate the true population regression line.

b = point estimate of b =

where

bxay ˆ

nx

xSn

yxxyS xxxy

22 and

a = point estimate of a = y - bx

xx

xy

S

S

Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers.

The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

15 16 17 18 19

2500

3000

3500

Mother’s Age (yrs)

Baby’s

Weig

ht

(g)

Sketch a scatterplot of these data.

The scatterplot shows a linear pattern and the spread in the y

values appears to be similar across the range of x values. This supports the

appropriateness of the simple linear

regression model.

Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

xy 15.24545.1163ˆ

15 16 17 18 19

2500

3000

3500


Baby’s

Weig

ht

(g)

grams25.3249)18(15.24545.1163 What is the point estimate for the mean weight of

babies born to 18-year-old mothers?This is the point

estimate for the mean weight of all babies born to 18-year-old mothers.

This is also the prediction of the weight of a single

baby born to a mother 18 years of

age.

The weight of babies increase approximately 245.15 grams for each increase of 1 year in the

mother’s age.

The statistic for estimating the variance s2 is

where

The estimate for the standard deviation s is

Recall the coefficient of determination, r2, is the proportion of observed y variation that is attributed to the model relationship.

2Resid2

nSS

se

2

ˆResid yySS

2ee ss

The subscript e reminds us that we are estimating

the variance of the “errors” or residuals.

Note that the degrees of freedom associated with

estimating s2 or s in simple linear regression is

df = n - 2

Why n – 2?

Since we must estimate both for a and b in the

regression line, we reduce the sample size n by 2

76.2 r

Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams).

x 15 17 18 15 16 19 17 16 18 19

y 2289

3393

3271

2648

2897

3327

2970

2535

3138

3573

15 16 17 18 19

2500

3000

3500


Baby’s

Weig

ht

(g)

308.205es

For a particular mother’s age, the typical deviation for possible

weights of babies is approximately 231 grams.

Approximately 76% of the variability observed weight of babies can be

explained by this model.

Properties of the Sampling Distribution of bWhen the four basic assumptions of the

simple linear regression model are satisfied, the following statements are true:

1. The mean value of b is b. That is, mb = b.

2. The standard deviation of the statistic b is

3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)

xxb S

Since b is almost always unknown, it must be estimated from independently selected

observations. The slope b of the least-squares line gives a point

estimate for b.

Since s is usually unknown, the estimated standard deviation of the

statistic b isxx

eb S

ss

Confidence Interval for b

When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form

where the t critical value is based on df = n – 2.

bstb value) critical(

The plot shows a linear pattern, and the vertical spread of points does not appear to be changing over the range of x values in the sample. If we assume that the distribution of errors at any given x value is approximately normal, then the simple linear regression model seems appropriate.

Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race?

The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995):

x 7.7 8.4 8.7 9.0 9.6 9.6 10.0

10.2

10.4

11.0

11.7

y 71.0

71.4

65.0

68.7

64.4

69.4

63.0

64.6

66.9

62.6

61.7

Sketch a scatterplot for the data.

8 9 10 11 12

62

67

72

Treadmill Time (min)

Ski

Tim

e

(min

)

We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time is between 1 minute and 3.7 minutes.

Biathletes Continued . . .x = treadmill exhaustion time

y = ski time

x 7.7 8.4 8.7 9.0 9.6 9.6 10.0

10.2

10.4

11.0

11.7

y 71.0

71.4

65.0

68.7

64.4

69.4

63.0

64.6

66.9

62.6

61.7

Find a 95% confidence interval for the slope of the true regression line.8 9 10 11 12

62

67

72


Ski

Tim

e

(min

)

)999.,671.3()591)(.26.2(3335.2

Biathletes Continued . . .Partial Minitab Output

The regression equation is

Ski time = 88.8 – 2.33 treadmill time

Predictor Coef StDev T P

Constant 88.796 5.750 15.44 0.000

Treadmill -2.3335 0.5911 -3.95 0.003

S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 74.630 74.630 15.58 0.003

Residual Error 9 43.097 4.789

Total 10 117.727

Equation of estimated

regression line

Estimated y intercept aEstimated slope b

sb = estimated standard deviation of b

se100×r2r2 (adjusted) is

not used in simple linear regression.

SSResidSSTo2es

n - 2

Summary of Hypothesis Tests Concerning bNull hypothesis: H0: b = hypothesized value

Test Statistic:

The test is based on df = n – 2.

Alternative Hypothesis: P -value:Ha: b > hypothesized value area to right of t under the

appropriate t curve

Ha: b < hypothesized value area to left of t under the appropriate t curve

Ha: b ≠ hypothesized value 2(area to right of t ) if +t or 2(area to left of t ) if -t

bs

bt

value edhypothesiz

Often the hypothesized value is zero – this is called

the model utility test for simple linear regression.

Summary of Hypothesis Tests Concerning b Continued . . .Assumptions:For this test to be appropriate the four basic

assumptions of the simple regression model must be met:

1. The distribution of e at any particular x value has a mean of 0 (me = 0),

2. The standard deviation of e is s, which does not depend on x.

3. The distribution of e at any particular x value is normal.

4. The random deviations e1, e2, …, en associated with different observations are independent of one another.

60 62 64 66 68Height

Weig

ht

Suppose the least-squares

line is horizontal –

would height be useful in predicting weight?

What is the slope of a

horizontal line?

A slope of zero – means that there

is NO linear relationship

between x and y!

The Model Utility Test for Simple Linear Regression

The model utility test for simple linear regression is the test of

H0: b = 0

Ha: b ≠ 0

Test Statistic:

The null hypothesis specifies that there is no useful linear relationship

between x and y.

bsb

t

Since the P-value < a, we reject H0. There is sufficient evidence of a linear relationship between treadmill time and ski time.

Biathletes Revisited . . .x = treadmill exhaustion time

y = ski time

x 7.7 8.4 8.7 9.0 9.6 9.6 10.0

10.2

10.4

11.0

11.7

y 71.0

71.4

65.0

68.7

64.4

69.4

63.0

64.6

66.9

62.6

61.7

8 9 10 11 12

62

67

72


Ski

Tim

e

(min

)

H0: b = 0

Ha: b ≠ 0

Where b is the slope of the population regression line between treadmill time and ski time

95.35911.03335.2

tP-value = .003

a = .05 df = 9Even though the

scatterplots indicates a linear relationship between ski time and treadmill time,

let’s perform the model utility test.

Biathletes Revisited . . .Partial Minitab Output

The regression equation is

Ski time = 88.8 – 2.33 treadmill time


Constant 88.796 5.750 15.44 0.000

Treadmill -2.3335 0.5911 -3.95 0.003

S = 2.188 R-Sq = 63.4% R-Sq (adj) = 59.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 74.630 74.630 15.58 0.003

Residual Error 9 43.097 4.789

Total 10 117.727

t test statistic

÷ =

P-value

Statistical software usually performs the model utility test with

H0: b = 0 versus Ha: b ≠ 0

Checking Model Adequacy

The simple linear regression model is

y = a + bx + e

where e represents the random deviation of an observed y value from the population regression line a + bx.

The assumptions for simple linear regression are based on this random

deviation e.

However, we do not know the deviations for e1, e2, …, en because the population regression line is unknown.

If we knew the deviations of e1, e2, …, en, we could examine them for any

inconsistencies with model assumptions.

Therefore, we must estimate these deviations using the residuals from the

estimated line. Thus, we use the residuals to check our assumptions.

Residual Analysis

• Standardize the residuals to look at their magnitudes

• Create a residual plot (from Chapter 5) or a standardized residual plot (which is a plot of the (x, standardized residual) pairs)

residual of deviation standard estimatedresidual

residual edstandardiz

Any observation with a large positive or negative residual should be examined

carefully for any error in recording data, nonstandard experimental condition, or

atypical experimental unit.

Most statistical software will perform this calculation. It is

tedious to do by hand.A desirable plot is one that exhibits no particular pattern (such as curvature or much greater spread in one part on the plot than the

other) and that has no point that is far removed from all the others.

A Look at Standardized Residual Plots

This is a desirable plot in that it exhibits no

pattern and has no point that lies far away from

the other points.

This plot exhibits a curved pattern which indicates

that the fitted model should be changed to incorporate

the curvature.

In this plot, the standard deviation of the residuals increases as the x-values increase.

While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares.

Consult your local statistician!

Both of these plots contain

points far away from the others.

These points can have

substantial effects on

estimates of a and b as well as other quantities.

Biathletes Revisited . . .r = residuals

sr = standardized residuals (from Minitab)x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7

y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7

r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21

sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12

8 9 10 11 12

62

67

72


Ski

Tim

e

(min

)

-2

-1

0

1

-2 -1 0 1 2Normal ScoreS

tan

dard

ized R

esi

du

alLet’s look at a normal

probability plot of the standardized residuals

The normal probability plot of the standardized residuals is quite straight.

There is no reason to doubt the plausibility that the random deviations e

are normally distributed.

Biathletes Continued . . .r = residualssr = standardized residuals (from Minitab)

x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7

y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7

r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21

sr 0.10 1.13 -1.74 0.44 -0.96 1.44 -1.18 -0.19 1.16 -0.27 0.12

8 9 10 11 12

-2

-1

0

1

Treadmill Time

Sta

ndard

ized R

esi

du

als Sketch a standardized

residual plot.

The standardized residual plot does not show evidence of any pattern or of increasing spread.

Sketch a residual plot.

8 9 10 11 12

-3

-2-1

10

23

Treadmill Time

Resi

duals

Notice that these two plots have similar appearances.

Remember that residuals can also be plotted against y.

Optional Topics

Inferences Based on the Estimated Regression Line

andInference about the

Population Correlation Coefficient

Properties of the Sampling Distribution of a + bx for a Fixed Value of x

Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a +bx* had the following properties:1) The mean value of a + bx* is a + bx*, so a + bx* is an unbiased statistic estimating the mean y value when x = x*.2) The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by

3) The distribution of a + bx* is normal.

xxbxa S

xxn

2

*

)*(1

The farther x* is from the center, the larger sa+bx* is.

Since s is unknown, sa+bx* can be estimated by sa+bx* which substitutes se in

place of s.

Confidence Interval for a Mean y ValueWhen the basic assumptions of the simple linear regression model are met, a confidence interval for a +bx*, the mean y value when x has value x*, is


*value) critical (* bxastbxa

Because sa+bx* is larger the farther x* is from x, the confidence interval becomes

wider as x* moves away from the center of the data.

Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.)Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width from body length, which is more easily measured.

This scatterplot of the data shows a linear

pattern and is consistent with use of

the simple linear regression model.

.in140.15)15(96345.688.)15( ba

Jaws Continued . . .The regression equation is

Jaw Width = 0.69 + 0.963 Length


Constant 0.688 1.299 0.53 0.599

Length 0.96345 0.08228 11.71 0.000

S = 1.376

R-Sq = 76.6% R-Sq (adj) = 76.0%

The simple linear regression model explains 76.6% of the

variability in jaw width.

The model utility test confirms the usefulness of

this model.

The point estimate isLet’s use the data to compute a 90% confidence interval for the mean jaw

width for 15 foot long sharks.

213.8718.279

)586.1515(441

376.12

)15(

bas

The estimated standard deviation of a + b(15) is

)498.15,782.14()213)(.68.1(140.15

value) critical()15( *

bxastba

Jaws Continued . . .The regression equation is

Jaw Width = 0.69 + 0.963 Length


Constant 0.688 1.299 0.53 0.599

Length 0.96345 0.08228 11.71 0.000

S = 1.376

R-Sq = 76.6% R-Sq (adj) = 76.0%

The 90% confidence interval is

Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches.

Prediction Interval for a Single y ValueWhen the basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x = x*, has the form


2*

2value) critical (* bxae sstbxa

The prediction interval and the confidence interval are centered at exactly the same place, a + bx*.

The prediction interval is wider than the confidence interval due to the due to the addition of se under the square-

root symbol.

)479.17,801.12(9388.1)68.1(140.15

value) critical()15( 2)15(

2

bae sstba

Jaws Revisited . . .

The 90% prediction interval is

We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches.

Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet.

140.15)15(96245.688.)15( ba 8934.1376.1 22 es

0454.213. 22)15( bas

Notice that this interval is much wider than the

confidence interval for the

mean jaw width.

Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data.

Notice that the prediction interval is substantial

wider than the confidence

interval

Also notice that the

confidence interval is very narrow close

to x, but widens the farther it is from the mean.

A Test for Independence in a Bivariate Normal Population

Null Hypothesis: H0: r = 0

Test Statistic:

The test is based on df = n – 2.

Alternative Hypothesis: P-value:Ha: r > 0 (positive dependence) Area to the right of t

Ha: r < 0 (negative dependence) Area to the left of t

Ha: r ≠ 0 (dependence) 2(Area to the right of t) if +t

or 2(Area to the left of t) if -t

Greek letter “rho”r is the population correlation coefficient. It assesses the extent of

any linear relationship in the population. r must be between -1 and

1.

Many investigators are interested if ANY relationship exist between x and

y. That is, are x and y are independent of each other?

However, r = 0 is NOT equivalent to x and y being independent except in the case of a bivariate normal

population.

A bivariate normal population is one where for any fixed x value, the distribution of

associated y values is normal, and for any fixed y value, the distribution of x values is

normal.An example would be the height x and

weight y of American adult males.

21 2

nr

rt

A Test for Independence in a Bivariate Normal Population

Assumptions:r is the correlation coefficient for a random sample from a bivariate normal population.

The one way to verify that the population is a bivariate normal

population is to plot individual normal probability plots of the x and y

variables.

Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans

The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01.

H0: r = 0

Ha: r > 0

Test Statistic:

State the hypotheses.

96.2

714)11(.1

11.2

t

P-value = .0015 df = 714 a = .01

To verify the assumptions, we would look at normal probability plots of the x values and

of the y values. However, data is not available, so we will assume the bivariate normal population is reasonable. We will

also assume that it is reasonable to regard the sample of participants as representative

of the population of adult Americans.

Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans

Sleepless Nights Continued . . .

H0: r = 0

Ha: r > 0

Test Statistic: 96.2

714)11(.1

11.2

t

P-value = .0015 df = 714 a = .01

Since the P-value < .01, we reject H0. There is evidence to suggest that there is a positive association (perhaps a weak one since r = .11) between sleep duration and blood leptin level.

Note: the hypothesis of no linear relationship (H0: b = 0) can also be used to test for independence in a

bivariate normal population.

Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.

Documents

Transcript of Chapter 13 Simple Linear Regression and Correlation: Inferential Methods.