Chapter 9 - National Paralegal College for...Copyright © 2015, 2012, and 2009 Pearson Education,...

Post on 11-Apr-2020

0 views 0 download

Transcript of Chapter 9 - National Paralegal College for...Copyright © 2015, 2012, and 2009 Pearson Education,...

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 1

Chapter

Correlation and

Regression

9

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 2

Correlation

Correlation

• A relationship between two variables.

• The data can be represented by ordered pairs (x, y)

x is the independent (or explanatory) variable

y is the dependent (or response) variable

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 3

Correlation

x 1 2 3 4 5

y – 4 – 2 – 1 0 2

A scatter plot can be used to determine whether a

linear (straight line) correlation exists between two

variables.

x

2 4

–2

– 4

y

2

6

Example:

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 4

Types of Correlation

x

y

Negative Linear Correlation

x

y

No Correlation

x

y

Positive Linear Correlation

x

y

Nonlinear Correlation

As x increases, y

tends to decrease.

As x increases, y

tends to increase.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 5

Example: Constructing a Scatter Plot

An economist wants to determine

whether there is a linear relationship

between a country’s gross domestic

product (GDP)

and carbon dioxide (CO2)

emissions. The data are shown in

the table. Display the data in a

scatter plot and determine whether

there appears to be a positive or

negative linear correlation or no

linear correlation. (Source: World

Bank and U.S. Energy Information

Administration)

GDP

(trillions of

$), x

CO2

emission

(millions of

metric tons),

y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 6

Solution: Constructing a Scatter Plot

Appears to be a positive linear correlation. As the

gross domestic products increase, the carbon dioxide

emissions tend to increase..

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 7

Example: Constructing a Scatter Plot

Using Technology

Old Faithful, located in

Yellowstone National Park, is the

world’s most famous geyser. The

duration (in minutes) of several of

Old Faithful’s eruptions and the

times (in minutes) until the next

eruption are shown in the table.

Using a TI-83/84, display the data

in a scatter plot. Determine the

type of correlation.

Duration

x

Time,

y

Duration

x

Time,

y

1.8 56 3.78 79

1.82 58 3.83 85

1.9 62 3.88 80

1.93 56 4.1 89

1.98 57 4.27 90

2.05 57 4.3 89

2.13 60 4.43 89

2.3 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.6 92

3.27 77 4.63 91

3.65 77

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 8

Solution: Constructing a Scatter Plot

Using Technology

From the scatter plot, it appears that the variables have a

positive linear correlation, as the durations of the

eruptions increase, the times until the next eruption tend

to increase.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 9

Correlation Coefficient

Correlation coefficient

• A measure of the strength and the direction of a linear

relationship between two variables.

• The symbol r represents the sample correlation

coefficient.

• A formula for r is

• The population correlation coefficient is represented

by ρ (rho).

2 22 2

n xy x yr

n x x n y y

n is the number

of data pairs

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 10

Correlation Coefficient

• The range of the correlation coefficient is -1 to 1.

-1 0 1

If r = -1 there is a

perfect negative

correlation

If r = 1 there is a

perfect positive

correlation

If r is close to 0

there is no linear

correlation

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 11

Linear Correlation

Strong negative correlation

Weak positive correlation

Strong positive correlation

No Correlation

x

y

x

y

x

y

x

y

r = 0.91 r = 0.88

r = 0.42 r = 0.07

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 12

Calculating a Correlation Coefficient

1. Find the sum of the x-

values.

2. Find the sum of the y-

values.

3. Multiply each x-value by

its corresponding y-value

and find the sum.

x

y

xy

In Words In Symbols

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 13

Calculating a Correlation Coefficient

4. Square each x-value

and find the sum.

5. Square each y-value

and find the sum.

6. Use these five sums to

calculate the

correlation coefficient.

2x

2y

2 22 2

n xy x yr

n x x n y y

In Words In Symbols

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 14

Example: Calculating the Correlation

Coefficient

Calculate the correlation

coefficient for the gross

domestic products and carbon

dioxide emissions data. What

can you conclude?

GDP

(trillions of $),

x

CO2 emission

(millions of

metric tons), y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 15

Solution: Calculating the Correlation

Coefficient

x y xy x2 y2

1.6 428.2 685.12 2.56 183,355.24

3.6 828.8 2983.68 12.96 686,909.44

4.9 1214.2 5949.58 24.01 1,474,281.64

1.1 444.6 489.06 1.21 197,669.16

0.9 264.0 237.6 0.81 69,696

2.9 415.3 1204.37 8.41 172,474.09

2.7 571.8 1543.86 7.29 326,955.24

2.3 454.9 1046.27 5.29 206,934.01

1.6 358.7 573.92 2.56 128,665.69

1.5 573.5 860.25 2.25 328,902.25

Σx = 23.1 Σy = 5554 Σxy = 15,573.71 Σx2 = 67.35 Σy2 = 3,775,842.76.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 16

Solution: Calculating the Correlation

Coefficient

2 22 2

n xy x yr

n x x n y y

2 2

10(15,573.71) 23.1 5554

10(67.35) 23.1 10(3,775,842.76) 5554

27,439.70.882

139.89 6,911,511.6

Σx = 23.1 Σy = 5554 Σxy = 15,573.71 Σx2 = 32.44

Σy2 = 3,775,842.76

r ≈ 0.882 suggests a strong positive linear correlation. As

the gross domestic product increases, the carbon dioxide

emissions also increase..

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 17

Example: Using Technology to Find a

Correlation Coefficient

Use a technology tool to calculate

the correlation coefficient for the

Old Faithful data. What can you

conclude?

Duration

x

Time,

y

Duration

x

Time,

y

1.8 56 3.78 79

1.82 58 3.83 85

1.9 62 3.88 80

1.93 56 4.1 89

1.98 57 4.27 90

2.05 57 4.3 89

2.13 60 4.43 89

2.3 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.6 92

3.27 77 4.63 91

3.65 77

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 18

Solution: Using Technology to Find a

Correlation Coefficient

STAT > CalcTo calculate r, you must first enter the

DiagnosticOn command found in the Catalog menu

r ≈ 0.979 suggests a strong positive correlation.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 19

Using a Table to Test a Population

Correlation Coefficient ρ

• Once the sample correlation coefficient r has been

calculated, we need to determine whether there is

enough evidence to decide that the population

correlation coefficient ρ is significant at a specified

level of significance.

• Use Table 11 in Appendix B.

• If |r| is greater than the critical value, there is enough

evidence to decide that the correlation coefficient ρ is

significant.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 20

Using a Table to Test a Population

Correlation Coefficient ρ

• Determine whether ρ is significant for five pairs of

data (n = 5) at a level of significance of α = 0.01.

• If |r| > 0.959, the correlation is significant. Otherwise,

there is not enough evidence to conclude that the

correlation is significant.

Number of

pairs of data

in sample

level of significance

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 21

Using a Table to Test a Population

Correlation Coefficient ρ

1. Determine the number

of pairs of data in the

sample.

2. Specify the level of

significance.

3. Find the critical value.

Determine n.

Identify .

Use Table 11 in

Appendix B.

In Words In Symbols

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 22

Using a Table to Test a Population

Correlation Coefficient ρ

In Words In Symbols

4. Decide if the

correlation is

significant.

5. Interpret the decision

in the context of the

original claim.

If |r| > critical value, the

correlation is significant.

Otherwise, there is not

enough evidence to

support that the

correlation is significant.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 23

Example: Using a Table to Test a

Population Correlation Coefficient ρ

Using the Old Faithful data, you

used 25 pairs of data to find

r ≈ 0.979. Is the correlation

coefficient significant? Use

α = 0.05.

Duration

x

Time,

y

Duration

x

Time,

y

1.8 56 3.78 79

1.82 58 3.83 85

1.9 62 3.88 80

1.93 56 4.1 89

1.98 57 4.27 90

2.05 57 4.3 89

2.13 60 4.43 89

2.3 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.6 92

3.27 77 4.63 91

3.65 77

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 24

Solution: Using a Table to Test a

Population Correlation Coefficient ρ

• n = 25, α = 0.05

• |r| ≈ 0.979 > 0.396

• There is enough evidence

at the 5% level of

significance to conclude

that there is a significant

linear correlation between

the duration of Old

Faithful’s eruptions and the

time between eruptions.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 25

Hypothesis Testing for a Population

Correlation Coefficient ρ

• A hypothesis test can also be used to determine

whether the sample correlation coefficient r provides

enough evidence to conclude that the population

correlation coefficient ρ is significant at a specified

level of significance.

• A hypothesis test can be one-tailed or two-tailed.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 26

Hypothesis Testing for a Population

Correlation Coefficient ρ

• Left-tailed test

• Right-tailed test

• Two-tailed test

H0: ρ 0 (no significant negative correlation)

Ha: ρ < 0 (significant negative correlation)

H0: ρ 0 (no significant positive correlation)

Ha: ρ > 0 (significant positive correlation)

H0: ρ = 0 (no significant correlation)

Ha: ρ 0 (significant correlation)

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 27

The t-Test for the Correlation Coefficient

• Can be used to test whether the correlation between

two variables is significant.

• The test statistic is r

• The standardized test statistic

follows a t-distribution with d.f. = n – 2.

• In this text, only two-tailed hypothesis tests for ρ are

considered.

212

r

r rt

rn

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 28

Using the t-Test for ρ

1. State the null and alternative

hypothesis.

2. Specify the level of

significance.

3. Identify the degrees of

freedom.

4. Determine the critical

value(s) and rejection

region(s).

State H0 and Ha.

Identify .

d.f. = n – 2.

Use Table 5 in

Appendix B.

In Words In Symbols

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 29

Using the t-Test for ρ

5. Find the standardized test

statistic.

6. Make a decision to reject or

fail to reject the null

hypothesis.

7. Interpret the decision in the

context of the original claim.

In Words In Symbols

If t is in the rejection

region, reject H0.

Otherwise fail to reject

H0.

212

rt

rn

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 30

Example: t-Test for a Correlation

Coefficient

Previously you calculated

r ≈ 0.882. Test the significance

of this correlation coefficient.

Use α = 0.05.

GDP

(trillions of $),

x

CO2 emission

(millions of

metric tons), y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 31

Solution: t-Test for a Correlation

Coefficient

• H0:

• Ha:

• d.f. =

• Rejection Region:

• Test Statistic:

0.05

10 – 2 = 8

2

0.8825.294

1 (0.882)

10 2

t

ρ = 0

ρ ≠ 0

• Decision:

At the 5% level of significance,

there is enough evidence to

conclude that there is a

significant linear correlation

between gross domestic products

and carbon dioxide emissions.

Reject H0

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 32

Correlation and Causation

• The fact that two variables are strongly correlated

does not in itself imply a cause-and-effect

relationship between the variables.

• If there is a significant correlation between two

variables, you should consider the following

possibilities.

1. Is there a direct cause-and-effect relationship

between the variables?

• Does x cause y?

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 33

Correlation and Causation

2. Is there a reverse cause-and-effect relationship

between the variables?

• Does y cause x?

3. Is it possible that the relationship between the

variables can be caused by a third variable or by a

combination of several other variables?

4. Is it possible that the relationship between two

variables may be a coincidence?

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 34

Regression lines

• After verifying that the linear correlation between

two variables is significant, next we determine the

equation of the line that best models the data

(regression line).

• Can be used to predict the value of y for a given value

of x.

x

y

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 35

Residuals

Residual

• The difference between the observed y-value and the

predicted y-value for a given x-value on the line.

For a given x-value,

di = (observed y-value) – (predicted y-value)

x

y

}d1

}d2

d3{d4{

}d5

d6{

Predicted

y-value

Observed

y-value

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 36

Regression line (line of best fit)

• The line for which the sum of the squares of the

residuals is a minimum.

• The equation of a regression line for an independent

variable x and a dependent variable y is

ŷ = mx + b

Regression Line

Predicted

y-value for

a given x-

value

Slope

y-intercept

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 37

The Equation of a Regression Line

• ŷ = mx + b where

• is the mean of the y-values in the data

• is the mean of the x-values in the data

• The regression line always passes through the point

22

n xy x ym

n x x

y xb y mx m

n n

y

x

,x y

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 38

Example: Finding the Equation of a

Regression Line

Find the equation of the

regression line for the gross

domestic products and

carbon dioxide emissions

data.

GDP

(trillions of $), x

CO2 emission

(millions of

metric tons), y

1.6 428.2

3.6 828.8

4.9 1214.2

1.1 444.6

0.9 264.0

2.9 415.3

2.7 571.8

2.3 454.9

1.6 358.7

1.5 573.5

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 39

Solution: Finding the Equation of a

Regression Line

x y xy x2 y2

1.6 428.2 685.12 2.56 183,355.24

3.6 828.8 2983.68 12.96 686,909.44

4.9 1214.2 5949.58 24.01 1,474,281.64

1.1 444.6 489.06 1.21 197,669.16

0.9 264.0 237.6 0.81 69,696

2.9 415.3 1204.37 8.41 172,474.09

2.7 571.8 1543.86 7.29 326,955.24

2.3 454.9 1046.27 5.29 206,934.01

1.6 358.7 573.92 2.56 128,665.69

1.5 573.5 860.25 2.25 328,902.25

Σx = 23.1 Σy = 5554 Σxy = 15,573.71 Σx2 = 67.35

Σy2 =

3,775,842.76

Recall from section 9.1:

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 40

Solution: Finding the Equation of a

Regression Line

Σx = 23.1 Σy = 5554 Σxy = 15,573.71 Σx2 = 67.35

22

n xy x ym

n x x

b y mx

2

10(15,573.71) (23.1)(5554)

10(67.35) 23.1

27,439.7196.151977

139.89

5554 23.1(196.151977)

10 10

555.4 (196.151977)(2.31) 102.2889

ˆ 196.152 102.289y x Equation of the regression line

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 41

Solution: Finding the Equation of a

Regression Line

• To sketch the regression line, use any two x-values within

the range of the data and calculate the corresponding y-

values from the regression line.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 42

Example: Using Technology to Find a

Regression Equation

Use a technology tool to find the

equation of the regression line for

the Old Faithful data.

Duration

x

Time,

y

Duration

x

Time,

y

1.8 56 3.78 79

1.82 58 3.83 85

1.9 62 3.88 80

1.93 56 4.1 89

1.98 57 4.27 90

2.05 57 4.3 89

2.13 60 4.43 89

2.3 57 4.47 86

2.37 61 4.53 89

2.82 73 4.55 86

3.13 76 4.6 92

3.27 77 4.63 91

3.65 77

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 43

Solution: Using Technology to Find a

Regression Equation

550

100

1

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 44

Example: Predicting y-Values Using

Regression Equations

The regression equation for the gross domestic products

(in trillions of dollars) and carbon dioxide emissions (in

millions of metric tons) data is ŷ = 196.152x + 102.289.

Use this equation to predict the expected carbon dioxide

emissions for the following gross domestic products.

(Recall from section 9.1 that x and y have a significant

linear correlation.)

1. 1.2 trillion dollars

2. 2.0 trillion dollars

3. 2.5 trillion dollars

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 45

Solution: Predicting y-Values Using

Regression Equations

ŷ = 196.152x + 102.289

1. 1.2 trillion dollars

When the gross domestic product is $1.2 trillion, the

CO2 emissions are about 337.671 million metric tons.

ŷ =196.152(1.2) + 102.289 ≈ 337.671

2. 2.0 trillion dollars

When the gross domestic product is $2.0 trillion, the

CO2 emissions are 494.595 million metric tons.

ŷ =196.152(2.0) + 102.289 ≈ 494.593

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 46

Solution: Predicting y-Values Using

Regression Equations

3. 2.5 trillion dollars

When the gross domestic product is $2.5 trillion, the

CO2 emissions are 592.669 million metric tons.

ŷ =196.152(2.5) + 102.289 ≈ 592.669

Prediction values are meaningful only for x-values in

(or close to) the range of the data. The x-values in the

original data set range from 0.9 to 4.9. So, it would

not be appropriate to use the regression line to predict

carbon dioxide emissions for gross domestic products

such as $0.2 or $14.5 trillion dollars.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 47

Variation About a Regression Line

• Three types of variation about a regression line

Total variation

Explained variation

Unexplained variation

• To find the total variation, you must first calculate

The total deviation

The explained deviation

The unexplained deviation

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 48

Variation About a Regression Line

iy y

ˆiy y

ˆi iy y

(xi, ŷi)

x

y (xi, yi)

(xi, yi)

Unexplained

deviationˆi iy y

Total

deviationiy y Explained

deviationˆiy y

y

x

Total Deviation =

Explained Deviation =

Unexplained Deviation =

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 49

Total variation

• The sum of the squares of the differences between the

y-value of each ordered pair and the mean of y.

Explained variation

• The sum of the squares of the differences between

each predicted y-value and the mean of y.

Variation About a Regression Line

2

iy y Total variation =

Explained variation = 2

ˆiy y

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 50

Unexplained variation

• The sum of the squares of the differences between the

y-value of each ordered pair and each corresponding

predicted y-value.

Variation About a Regression Line

2

ˆi iy y Unexplained variation =

The sum of the explained and unexplained variation is

equal to the total variation.

Total variation = Explained variation + Unexplained variation

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 51

Coefficient of Determination

Coefficient of determination

• The ratio of the explained variation to the total

variation.

• Denoted by r2

2 Explained variationTotal variation

r

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 52

Example: Coefficient of Determination

r2 » (0.883)2

» 0.780

About 78% of the variation in the carbon emissions can be

explained by the variation in the gross domestic products.

About 22% of the variation is unexplained.

The correlation coefficient for the gross domestic

products and carbon dioxide emissions data is r ≈ 0.883.

Find the coefficient of determination. What does this tell

you about the explained variation of the data about the

regression line? About the unexplained variation?

Solution:

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 53

The Standard Error of Estimate

Standard error of estimate

• The standard deviation of the observed yi -values

about the predicted ŷ-value for a given xi -value.

• Denoted by se.

• The closer the observed y-values are to the predicted

y-values, the smaller the standard error of estimate

will be.

2( )ˆ2

i ie

y ys

n

n is the number of ordered pairs

in the data set

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 54

The Standard Error of Estimate

2

, , , ( ), ˆ ˆ

( )ˆi i i i i

i i

x y y y y

y y

1. Make a table that includes the

column heading shown.

2. Use the regression equation to

calculate the predicted y-values.

3. Calculate the sum of the squares

of the differences between each

observed y-value and the

corresponding predicted y-value.

4. Find the standard error of

estimate.

ˆ iy mx b

2 ( )ˆi iy y

2( )ˆ2

i ie

y ys

n

In Words In Symbols

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 55

Example: Standard Error of Estimate

The regression equation for the gross domestic products

and carbon dioxide emissions data as calculated in

section 9.2 is

ŷ = 196.152x + 102.289

Find the standard error of estimate.

Solution:

Use a table to calculate the sum of the squared

differences of each observed y-value and the

corresponding predicted y-value.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 56

Solution: Standard Error of Estimate

x y ŷ i (yi – ŷ i)2

1.6 428.2 416.1322 145.63179684

3.6 828.8 808.4362 414.68435044

4.9 1214.2 1063.4338 22,730.44706244

1.1 444.6 318.0562 16,013.33331844

0.9 264.0 278.8258 219.80434564

2.9 415.3 671.1298 65,448.88656804

2.7 571.8 631.8994 3611.93788036

2.3 454.9 553.4386 9709.85568996

1.6 358.7 416.1322 3298.45759684

1.5 573.5 396.517 31,322.982289

Σ = 152,916.020898

unexplained variation.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 57

Solution: Standard Error of Estimate

• n = 10, Σ(yi – ŷ i)2 = 152,916.020898

2( )ˆ2

i ie

y ys

n

The standard error of estimate of the carbon dioxide

emissions for a specific gross domestic product is

about 138.255 million metric tons.

152,916.020898138.255

10 2

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 58

Prediction Intervals

• Two variables have a bivariate normal distribution

if for any fixed value of x, the corresponding values

of y are normally distributed and for any fixed values

of y, the corresponding x-values are normally

distributed.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 59

Prediction Intervals

• A prediction interval can be constructed for the true

value of y.

• Given a linear regression equation ŷ = mx + b and x0,

a specific value of x, a c-prediction interval for y is

ŷ – E < y < ŷ + E where

• The point estimate is ŷ and the margin of error is E.

The probability that the prediction interval contains y

is c.

202 2

( )11

( )c e

n x xE t s

n n x x

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 60

Constructing a Prediction Interval for y

for a Specific Value of x

1. Identify the number of ordered

pairs in the data set n and the

degrees of freedom.

2. Use the regression equation and

the given x-value to find the

point estimate ŷ.

3. Find the critical value tc that

corresponds to the given level of

confidence c.

ˆi iy mx b

Use Table 5 in

Appendix B.

In Words In Symbols

d.f. = n – 2

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 61

Constructing a Prediction Interval for y

for a Specific Value of x

4. Find the standard error of

estimate se.

5. Find the margin of error E.

6. Find the left and right

endpoints and form the

prediction interval.

In Words In Symbols

2( )ˆ2

i ie

y ys

n

202 2

( )11

( )c e

n x xE t s

n n x x

Left endpoint: ŷ – E

Right endpoint: ŷ + E

Interval: ŷ – E < y < ŷ + E

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 62

Example: Constructing a Prediction

Interval

Construct a 95% prediction interval for the carbon

dioxide emission when the gross domestic product is

$3.5 trillion. What can you conclude?

Recall, n = 10, ŷ = 196.152x + 102.289, se = 138.255

Solution:

Point estimate:

ŷ = 196.152(3.5) + 102.289 ≈ 788.821

Critical value:

d.f. = n –2 = 10 – 2 = 8 tc = 2.306

215.8, 32.44, 1.975x x x

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 63

Solution: Constructing a Prediction

Interval

0

2

2

2

2 2

1 10(3.5 2.31)(2.447)(10.290)

(

1 349.42410 10(67.35) (

)11

2 .1

( )

3 )

c e

n x xE t s

n n x x

Left Endpoint: ŷ – E Right Endpoint: ŷ + E

439.397 < y < 1138.245

788.821 – 349.424

≈ 439.397

788.821 + 349.424

≈ 1138.245

You can be 95% confident that when the gross domestic

product is $3.5 trillion, the carbon dioxide emissions will be

between 439.397 and 1138.245 million metric tons..

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 64

Multiple Regression Equation

• In many instances, a better prediction can be found

for a dependent (response) variable by using more

than one independent (explanatory) variable.

• For example, a more accurate prediction for the

company sales discussed in previous sections might

be made by considering the number of employees on

the sales staff as well as the advertising expenses.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 65

Multiple Regression Equation

Multiple regression equation

• ŷ = b + m1x1 + m2x2 + m3x3 + … + mkxk

• x1, x2, x3,…, xk are independent variables

• b is the y-intercept

• y is the dependent variable

* Because the mathematics associated with this concept is

complicated, technology is generally used to calculate the

multiple regression equation.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 66

Example: Finding a Multiple Regression

Equation

A researcher wants to determine how employee salaries

at a certain company are related to the length of

employment, previous experience, and education. The

researcher selects eight employees from the company

and obtains the data shown on the next slide. Use

Minitab to find a multiple regression equation that

models the data.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 67

Example: Finding a Multiple Regression

Equation

Employee Salary, y

Employment

(yrs), x1

Experience

(yrs), x2

Education

(yrs), x3

A 57,310 10 2 16

B 57,380 5 6 16

C 54,135 3 1 12

D 56,985 6 5 14

E 58,715 8 8 16

F 60,620 20 0 12

G 59,200 8 4 18

H 60,320 14 6 17

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 68

Solution: Finding a Multiple Regression

Equation

• Enter the y-values in C1 and the x1-, x2-, and x3-

values in C2, C3 and C4 respectively.

• Select “Regression > Regression…” from the Stat

menu.

• Use the salaries as the response variable and the

remaining data as the predictors.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 69

Solution: Finding a Multiple Regression

Equation

The regression equation is

ŷ = 49,764 + 364x1 + 228x2 + 267x3

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 70

Predicting y-Values

• After finding the equation of the multiple regression

line, you can use the equation to predict y-values over

the range of the data.

• To predict y-values, substitute the given value for

each independent variable into the equation, then

calculate ŷ.

.

Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 71

Example: Predicting y-Values

Use the regression equation

ŷ = 49,764 + 364x1 + 228x2 + 267x3

to predict an employee’s salary given 12 years of

current employment, 5 years of experience, and 16

years of education.

Solution:

ŷ = 49,764 + 364(12) + 228(5) + 267(16)

= 59,544

The employee’s predicted salary is $59,544.

.