Download - Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Transcript
Page 1: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 3Simple Regression Analysis

(Part 1)

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Simple Regression I ٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.1 Using Simple Regression to Describe a Relationship

� Regression analysis is a statistical technique used to describe relationships among variables.

� The simplest case is one where a dependent variable y may be related to an independent or explanatory variable x.

� The equation expressing this relationship is the line:

xbby 10 +=

Simple Regression I ٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Slope and Intercept

� For a given set of data, we need to calculate values for the slope b1 and the intercept b0.

� Figure 3.1 shows the graph of a set of six (x, y) pairs that have an exact relationship.

� Ordinary algebra is all you need to compute y = 1 + 2x

Page 2: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.1 Graph of An Exact Relationship

654321

13

8

3

x

y

x y

1 3

2 5

3 7

4 9

5 11

6 13

Simple Regression I ٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Error in the Relationship

� In real life, we usually do not have exact relationships.

� Figure 3.2 shows a situation where the yand x have a strong tendency to increase together but it is not perfect.

� You can use a ruler to put a line in approximately the "right place" and use algebra again.

^� A good guess might be y = 1 + 2.5x

Simple Regression I ٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.2 Graph of a Relationship That is NOT Exact

x y

1 3

2 2

3 8

4 8

5 11

6 13654321

12

7

2

x

y

S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression Plot

Page 3: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Everybody Is Different

� The drawback to this technique is that everybody will have their own opinion about where the line goes.

� There would be ever greater differences if there were more data with a wider scatter.

� We need a precise mathematical technique to use for this task.

Simple Regression I ٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residuals

� Figure 3.3 shows the previous graph where the "fit error" of each point is indicated.

� These residuals are positive if the point is above the line and negative if the line is above the point.

� We want a technique that will make the + and – even out.

Simple Regression I ٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

654321

12

7

2

x

y

S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression PlotFigure 3.3 Deviations From the Line

- deviations

+ deviations

Page 4: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (1)

We can search for a line that minimizes the sum of the residuals:

While this is a good idea, it can be shown that any line passing through the point (x, y) will have this sum = 0.

)ˆ(1

i

n

i

iyy −∑

=

Simple Regression I ١١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (2)

We can work with absolute values and search for a line that minimizes:

Such a procedure—called LAV or least absolute value regression—does exist but usually is found only in specialized software.

|ˆ|1

i

n

i

iyy −∑

=

Simple Regression I ١٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation Ideas (3)

By far the most popular approach is to square the residuals and minimize:

This procedure is called least squaresand is widely available in software. It uses calculus to solve for the b0 and b1 terms and gives a unique solution.

2

1

)ˆ( i

n

i

i yy −∑=

Page 5: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Least Squares Estimators

� There are several formula for the b1term. If doing it by hand, we might want to use:

_ _� The intercept is b0 = y – b1 x

∑ ∑

∑ ∑ ∑

= =

= = =

=n

i

n

i

ii

n

i

n

i

n

i

iiii

xn

x

yxn

yx

b

1

2

1

2

1 1 11

1

1

Simple Regression I ١٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Figure 3.5 Computations

Requiredfor b1 and b0

xi yi xi2 xiyi

1 3 1 3

2 2 4 4

3 8 9 24

4 8 16 32

5 11 25 55

6 13 36 78

21 45 91 196Totals

Simple Regression I ١٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Calculations

=

=

∑ ∑

∑ ∑ ∑

= =

= = =

n

i

n

i

ii

n

i

n

i

n

i

iiii

xn

x

yxn

yx

b

1

2

1

2

1 1 11

1

1

_ _b0 = y – b1 x =

Page 6: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Unique Minimum

� The line we obtained was:

� The sum of squared errors (SSE) is:

� No other linear equation will yield a smaller SSE. For the line 1 + 2.5x we guessed earlier, the SSE is 10.75

xy 2.22.0ˆ +−=

80.8)ˆ( 2

1

=−∑=

i

n

i

i yy

Simple Regression I ١٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.2 Examples of Regression as a Descriptive Technique

Example 3.2 Pricing Communications Nodes

A Ft. Worth manufacturing company was concerned about the cost of adding nodes to a communications network. They obtained data on 14 existing nodes.

They did a regression of cost (the y) on number of ports (x).

Simple Regression I ١٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

70605040302010

60000

50000

40000

30000

20000

NUMPORTS

CO

ST

Pricing Communications Nodes

Cost = 16594 + 650 NUMPORTS

Page 7: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.3 Estimating Residential Real Estate Values

The Tarrant County Appraisal District uses data such as house size, location and depreciation to help appraise property.

Regression can be used to establish a weight for each factor. Here we look at how price depends on size for a set of 100 homes. The data are from 1990.

Simple Regression I ٢٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4500350025001500500

300000

200000

100000

0

SIZE

VA

LU

E

Tarrant County Real Estate

VALUE = -50035 + 72.8 SIZE

Simple Regression I ٢١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.4 Forecasting Housing Starts

Forecasts of various economic measures is important to the government and various industries.

Here we analyze the relationship between US housing starts and mortgage rates. The rate used is the US average for new home purchases.

Annual data from 1963 to 2002 is used.

Page 8: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٢٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

15105

2400

2200

2000

1800

1600

1400

1200

1000

RATES

ST

AR

TS

US Housing Starts

STARTS = 1726 - 22.2 RATES

Simple Regression I ٢٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3 Inferences From a Simple Regression Analysis

� So far regression has been used as a way to describe the relationship between the two variables.

� Here we will use our sample data to make inferences about what is going on in the underlying population.

� To do that, we first need some assumptions about how things are.

Simple Regression I ٢٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3.1 Assumptions Concerning the Population Regression Line

� Lets use the communications nodes example to illustrate. Costs ranged from roughly $23000 to $57000 and number of ports from 12 to 68.

� Three times we had projects with 24 ports, but the three costs were all different. The same thing occurred at repeated observations at 52 and 56 ports.

� This illustrates how we view things: at each value of x there is a distribution of potential y values that can occur.

Page 9: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٢٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Conditional Mean

� Our first assumption is that the means of these distributions all lie on a straight line:

� For example, at projects with 30 ports, we have:

� The actual cost of projects with 30 ports are going to be distributed about the mean. This also happens at other sizes of projects, so you might see something like the next slide.

xxy 10| ββµ +=

1030| 30ββµ +==xy

Simple Regression I ٢٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Cost

Nodes

12 30 68

Figure 3.12 Distribution of Costs around the Regression Line

ββββ0 + ββββ1 Nodes

Simple Regression I ٢٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Disturbance Terms

� Because of the variation around the regression line, it is convenient to view the individual costs as:

� The ei are called the disturbances and represent how yi differs from its conditional mean. If yi is above the mean, its disturbance has a + value.

iii exy ++= 10 ββ

Page 10: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٢٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Assumptions

1. We expect the average disturbance ei to be zero so the regression line passes through the conditional mean of y.

2. The ei have constant variance σe2.

3. The ei are normally distributed.

4. The ei are independent.

Simple Regression I ٢٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.3.2 Inferences About β0 and β1

� We use our sample data to estimate β0 byb0 and β1 by b1. If we had a different sample, we would not be surprised to get different estimates.

� Understanding how much they would vary from sample to sample is an important part of the inference process.

� We use the assumptions, together with our data, to construct the sampling distributions for b0 and b1.

Simple Regression I ٣٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Sampling Distributions

� The estimators have many good statistical properties. They are unbiased, consistent and minimum variance.

� They have normal distributions with standard errors that are functions of the x values and σe

2.

� Full details are in Section 3.3.2

Page 11: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٣١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimate of σe2

� This is an unknown quantity that needs to be estimated from data.

� We estimate it by the formula:

� The term MSE stands for mean squared error and is more or less the average squared residual.

MSEn

SSE

n

yy

Si

n

i

i

e =−

=−

=∑

=

22

)ˆ( 2

12

Simple Regression I ٣٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standard Error of the Regression

� The divisor n-2 used in the previous calculation follows our general rule that degrees of freedom are sample size – the number of estimates we make (b0 and b1) before estimating the variance.

� The square root of MSE is Se which we call the standard error of the regression.

� Se can be roughly interpreted as the "typical" amount we miss in estimating each y value.

Simple Regression I ٣٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Inference About β1

� Interval estimates and hypothesis tests are constructed using the sampling distribution of b1.

� The standard error of b1 is:

� Computer programs routinely compute this and report its value.

2)1(

11

x

ebSn

SS−

=

Page 12: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٣٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interval Estimate

� The distribution we use is a t with n-2 degrees of freedom.

� The interval is:

� The value of t, of course, depends on the selected confidence level.

121 bn stb −±

Simple Regression I ٣٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests About β1

The most common test is that a change in the x variable does not induce a change in y, which can be stated:

H0: β1 = 0 Ha: β1 ≠ 0

If H1 is true it implies the population regression equation is a flat line; that is, regardless of the value of x, y has the same distribution.

Simple Regression I ٣٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test Statistic

The test would be performed by using the standardized test statistic:

Most computer programs compute this, and its associated p-value. and include them on the output.

The p-value is for the two-sided version of the test.

bS

bt

1

01 −=

Page 13: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٣٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Inference About β0

� We can also compute confidence intervals and perform hypothesis tests about the intercept in the population equation.

� Details about the tests and intervals are in Section 3.3.2, but in most problems we are not interested in this.

� The intercept is the value of y at x=0 and in many problems this is not relevant; for example, we never see houses with zero square feet of floor space.

� Sometimes it is relevant, anyway. If we are estimating costs, we could interpret the intercept as the fixed cost. Even though we never see communication nodes with zero ports, there is likely to be a fixed cost associated with setting up each project.

Simple Regression I ٣٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.6 Pricing Communications Nodes (continued)

Inference questions:1. What is the equation relating NUMPORTS to

COST?

2. Is the relationship significant?

3. What is an interval estimate of β1?

4. Is the relationship positive?

5. Can we claim each port costs at least $1000?

6. What is our estimate of fixed cost?

7. Is the intercept 0?

Simple Regression I ٣٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Minitab Regression OutputRegression Analysis: COST versus NUMPORTS

The regression equation is

COST = 16594 + 650 NUMPORTS

Predictor Coef SE Coef T P

Constant 16594 2687 6.18 0.000

NUMPORTS 650.17 66.91 9.72 0.000

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 1751268376 1751268376 94.41 0.000

Residual Error 12 222594146 18549512

Total 13 1973862521

Page 14: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٤٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the relationship significant?

H0: β1 = 0 (Cost does not change whennumber of ports increase)

Ha: β1 ≠ 0 (Cost does change)

We will use a 5% level of significance and the t distribution with (n-2) = 12 degrees of freedom.

Decision rule: Reject H0 if t > 2.179or if t < -2.179

from Minitab output t = 9.72 (p-value =.000)

We conclude that there is a significant relationship between project size and cost.

Simple Regression I ٤١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What is an interval estimate of β1?

Interval is:

For a 95% interval use t = 2.179

650.17 ± 2.179(66.91) = 650.17 ± 145.80

We are 95% sure that the average cost for each additional node is between $504 and $796.

121 bn stb −±

Simple Regression I ٤٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Can we claim a positive relationship?

H0: β1 = 0 (Cost does not change when size increases)

Ha: β1 > 0 (Cost increases when size increases)

We will use a 5% level of significance and the tdistribution with (n-2) = 12 degrees of freedom.

Decision rule: Reject H0 if t > 1.782

From Minitab output t = 9.72 (p-value is half of the listed value of .000, which is still .000)

We conclude that the project cost does increase with project size.

Page 15: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ٤٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the cost per port at least $1000?

H0: β1 ≥ 1000 (Cost per port at least $1000)

Ha: β1 < 1000 (Cost is less than $1000)

Again we will use a 5% level of significance and 12 degrees of freedom.

Decision rule: Reject H0 if t < -1.782

We conclude that the cost per node is (much) less than $1000.

23.591.66

100017.6501000useHere

1

1 −=−

=−

=

bS

bt

Simple Regression I ٤٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What is our estimate of fixed cost?

We can interpret the intercept of the equation as fixed cost, and the slope as variable cost. For the intercept, an interval is:

16594 ± 2.179(2687) = 16954 ± 5855

We are 95% sure the fixed cost is between $11,099 and $22,809.

020 bn stb −±

Simple Regression I ٤٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is the intercept 0?

H0: β0 = 0 (Fixed cost is 0)

Ha: β0 ≠ 0 (Fixed cost is not 0)

Again, use a 5% level of significance and 12 d.f.

Decision rule: Reject H0 if t > 2.179or if t < -2.179

from Minitab output t = 6.18 (p-value =.000)

We conclude that the fixed cost is not zero.

Page 16: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٤٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 3Simple Regression Analysis

(Part 2)

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Simple Regression II ٤٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.4 Assessing the Fit of the Regression Line

� It some problems, it may not be possible to find a good predictor of the y values.

� We know the least squares procedure finds the best possible fit, but that deosnot guarantee good predictive power.

� In this section we discuss some methods for summarizing the fit quality.

Simple Regression II ٤٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.4.1 The ANOVA Table

Let us start by looking at the amount of variation in the y values. The variation about the mean is:

which we will call SST, the total sum of squares.

Text equations (3.14) and (3.15) show how this can be split up into two parts.

2

1

)( yyn

i

i −∑=

Page 17: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٤٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Partitioning SST

SST can be split into two pieces which are the previously introduced SSE and a new quantity, SSR, the regression sum of squares.

SST = SSR + SSE

2

1

2

1

2

1

)ˆ()ˆ()( i

n

i

i

n

i

i

n

i

i yyyyyy −+−=− ∑∑∑===

Simple Regression II ٥٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Explained and Unexplained Variation

� We know that SSE is the sum of all the squared residuals, which represent lack of fit in the observations.

� We call this the unexplained variation in the sample.

� Because SSR contains the remainder of the variation in the sample, it is thus the variation explained by the regression equation.

Simple Regression II ٥١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The ANOVA Table

Most statistics packages organize these quantities in an ANalysis Of VAriance table.

Source DF SS MS F

Regression 1 SSR MSR MSR/MSE

Residual n-2 SSE MSE

Total n-1 SST

Page 18: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٥٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.4.2 The Coefficient of Determination

� If we had an exact relationship between y and x, then SSE would be zero and SSR = SST.

� Since that does not happen often it is convenient to use the ratio of SSR to SST as measure of how close we get to the exact relationship.

� This ratio is called the Coefficient of Determination or R2.

Simple Regression II ٥٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

R2

SSRR2 = —— is a fraction between 0 and 1

SST

In an exact model, R2 would be 1. Most of the time we multiply by 100 and report it as a percentage.

Thus, R2 is the percentage of the variation in the sample of y values that is explained by the regression equation.

Simple Regression II ٥٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Correlation Coefficient

� Some programs also report the square root of R2 as the correlation between the y and y-hat values.

� When there is only a single predictor variable, as here, the R2 is just the square of the correlation between yand x.

Page 19: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٥٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.4.3 The F Test

� An additional measure of fit is provided by the F statistic, which is the ratio of MSR to MSE.

� This can be used as another way to test the hypothesis that β1 = 0.

� This test is not real important in simple regression because it is redundant with the t test on the slope.

� In multiple regression (next chapter) it is much more important.

Simple Regression II ٥٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

F Test Setup

The hypotheses are:H0: β1 = 0 Ha: β1 ≠ 0

The F ratio has 1 numerator degree of freedom and n-2 denominator degrees of freedom.

A critical value for the test is selected from that distribution and H0 is rejected if the computed F ratio exceeds the critical value.

Simple Regression II ٥٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.8 Pricing Communications Nodes (continued)

Below we see the portion of the Minitab output that lists the statistics we have just discussed.

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 1751268376 1751268376 94.41 0.000

Residual Error 12 222594146 18549512

Total 13 1973862521

Page 20: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٥٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

R2 and F

R2 = SSR/SST = 1751268376/ 1973862521

= .8872 or 88.7%

F = MSR/MSE = 1751268376/222594146

= 94.41

From the F1,12 distribution, the critical value at a 5% significance level is 4.75

Simple Regression II ٥٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.5 Prediction or Forecasting With a Simple Linear Regression Equation

� Suppose we are interested in predicting the cost of a new communications node that had 40 ports.

� If this size project is something we would see often, we might be interested in estimating the average cost of all projects with 40 nodes.

� If it was something we expect to see only once, we would be interested in predicting the cost of the individual project.

Simple Regression II ٦٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.5.1 Estimating the Conditional Mean of y Given x.

At xm = 40 ports, the quantity we are estimating is:

Our best guess of this is just the point on the regression line:

1040| 40ββµ +==xy

10 40ˆ bby m +=

Page 21: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٦١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standard Error of the Mean

� We will want to make an interval estimate, so we need some kind of standard error.

� Because our point estimate is a function of the random variables b0 and b1 their standard errors figure into our computation.

� The result is:

2

2

)1(

)(1

x

mem

Sn

xx

nSS

−+=

Simple Regression II ٦٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Where Are We Most Accurate?

� For estimating the mean at the point xm

the standard error is Sm.

� If you examine the formula:

you can see that the second term will be zero if we predict at the mean value of x.

� That makes sense—it says you do your best prediction right in the center of your data.

2

2

)1(

)(1

x

mem

Sn

xx

nSS

−+=

Simple Regression II ٦٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interval Estimate

� For estimating the conditional mean of y that occurs at xm we use:

^ym ± tn-2 Sm

� We call this a confidence interval for the mean value of y at xm.

Page 22: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٦٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Test

� We could also perform a hypothesis test about the conditional mean.

� The hypothesis would be:

H0: µy|x=40 = (some value)

and we would construct a t ratio from the point estimate and standard error.

Simple Regression II ٦٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.5.2 Predicting an Individual Value of y Given x

� If we are trying to say something about an individual value of y it is a little bit harder.

� We not only have to first estimate the conditional mean, but we also have to tack on an allowance for ybeing above or below its mean.

� We use the same point estimate but our standard error is larger.

Simple Regression II ٦٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Prediction Standard Error

� It can be show that the prediction standard error is:

� This looks a lot like the previous one but has an additional term under he square root sign.

� The relationship is:

2

2

)1(

)(11

x

mep

Sn

xx

nSS

−++=

222

emp SSS +=

Page 23: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٦٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Predictive Inference

� Although we could be interested in a hypothesis test, the most common type of predictive inference is a prediction interval.

� The interval is just like the one for the conditional mean, except that Sp

is used in the computation.

Simple Regression II ٦٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.10 Pricing Communications Nodes (one last time)

What do we get when there are 40 ports?

Many statistics packages have a way for you to do the prediction. Here is Minitab's output:

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 42600 1178 ( 40035, 45166) ( 32872, 52329)

Values of Predictors for New Observations

New Obs NUMPORTS

1 40.0

Simple Regression II ٦٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

From the Output

^ym = 42600 Sm = 1178

Confidence interval: 40035 to 45166computed: 42600 ± 2.179(1178)

Prediction interval: 32872 to 52329computed: 42600 ± 2.179(????)

it does not list Sp

Page 24: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٧٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interpretations

For all projects with 40 nodes, we are 95% sure that the average cost is between $40,035 and $45,166.

We are 95% sure that any individual project will have a cost between $32,872 and $52,329.

Simple Regression II ٧١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.5.3 Assessing Quality of Prediction

� We use the model's R2 as a measure of fit ability, but this may overestimate the model's ability to predict.

� The reason for that is that R2 is optimized by the least squares procedure, for the data in our sample.

� It is not necessarily optimal for data outside our sample, which is what we are predicting.

Simple Regression II ٧٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Data Splitting

� We can split the data into two pieces. Use the first part to obtain the equation and use it to predict the data in the second part.

� By comparing the actual y values in the second part to their corresponding predicted values, you get an idea of how well you predict data that is not in the "fit" sample.

� The biggest drawback to this is that it won't work too well unless we have a lot of data. To be really reliable we should have at least 25 to 30 observations in both samples.

Page 25: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٧٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The PRESS Statistic

� Suppose you temporarily deleted observation i from the data set, fit a new equation, then used it to predict the yivalue.

� Because the new equation did not use any information from this data point, we get a clearer picture of the model's ability to predict it.

� The sum of these squared prediction errors is the PRESS statistic.

Simple Regression II ٧٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Prediction R2

� It sounds like a lot of work to do by hand, but most statistics packages will do it for you.

� You can then compute an R2-like measure called the prediction R2:

SST

PRESSRPRED −=12

Simple Regression II ٧٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

In Our Example

For the communications node data we have been using, SSE = 222594146, SST =1973862521 and R2 = 88.7%

Minitab reports that PRESS = 345066019

Our prediction R2:

1 – (345066019/1973862521) = 1 - .175 = .825 or 82.5%

Although there is a little loss, it implies we still have good prediction ability.

Page 26: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٧٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.6 Fitting a Linear Trend Model to Time-Series Data

� Data gathered on different units at the same point in time are called cross sectional data.

� Data gathered on a single unit (person, firm, etc.) over a sequence of time periods are called time-series data.

� With this type of data, the primary goal is often building a model that can forecast the future

Simple Regression II ٧٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Time Series Models

� There are many types of models that attempt to identify patterns of behavior in a time series in order to extrapolate it into the future.

� Some of these will be examined in Chapter 11, but here we will just employ a simple linear trend model.

Simple Regression II ٧٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Linear Trend ModelWe assume the series displays a steady upward or

downward behavior over time that can be described by:

where t is the time index (t =1 for the first observation, t=2 for the second, and so forth).

The forecast for this model is quite simple:

You just insert the appropriate value for T into the regression equation.

tt ety ++= 10 ββ

Tbby T 10ˆ +=

Page 27: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٧٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 3.11 ABX Company Sales

� The ABX Company sells winter sports merchandise including skates and skis. The quarterly sales (in $1000s) from first quarter 1994 through fourth quarter 2003 are graphed on the next slide.

� The time-series plot shows a strong upward trend. There are also some seasonal fluctuations which will be addressed in Chapter 7.

Simple Regression II ٨٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

300

250

200

40302010

SA

LE

S

Index

ABX Company Sales

Simple Regression II ٨١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Obtaining the Trend Equation

� We first need to create the time index variable which is equal to 1 for first quarter 1994 and 40 for fourth quarter 2003.

� Once this is created we can obtain the trend equation by linear regression.

Page 28: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٨٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Trend Line EstimationThe regression equation is

SALES = 199 + 2.56 TIME

Predictor Coef SE Coef T P

Constant 199.017 5.128 38.81 0.000

TIME 2.5559 0.2180 11.73 0.000

S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 34818 34818 137.50 0.000

Residual Error 38 9622 253

Total 39 44440

Simple Regression II ٨٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Slope Coefficient

The slope in the equation is 2.5559. This implies that over this 10-year period, we saw an average growth in sales of $2,556 per quarter.

The hypothesis test on the slope has a t value of 11.73, so this is indeed significantly greater than zero.

Simple Regression II ٨٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Forecasts For 2004

� Forecasts for 2004 can be obtained by evaluating the equation at t = 41, 42, 43 and 44.

� For example, the sales in fourth quarter are forecast:

SALES = 199 + 2.56 (44) = 311.48

� A graph of the data, the estimated trend and the forecasts is next.

Page 29: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٨٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

454035302520151050

300

250

200

TIME

SA

LE

S

Data, Trend (—) and Forecast (---)

Simple Regression II ٨٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.7 Some Cautions in Interpreting Regression Results

Two common mistakes that are made when using regression analysis are:

1. That x causes y to happen, and

2. That you can use the equation to predict y for any value of x.

Simple Regression II ٨٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.7.1 Association Versus Causality

� If you have a model with a high R2, it does not automatically mean that a change in xcauses y to change in a very predictable way.

� It could be just the opposite, that y causes x to change. A high correlation goes both ways.

� It could also be that both y and x are changing in response to a third variable that we don't know about.

Page 30: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression II ٨٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Third Factor

� One example of this third factor is the price and gasoline mileage of automobiles. As price increases, there is a sharp drop in mpg. This is caused by size. Larger cars cost more and get less mileage.

� Another is mortality rate in a country versus percentage of homes with television. As TV ownership increases, mortality rate drops. This is probably due to better economic conditions improving quality of life and simultaneously allowing for greater ownership.

Simple Regression II ٨٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3.7.2 Forecasting Outside the Range of the Explanatory Variable

� When we have a model with a high R2, it means we know a good deal about the relationship of y and x for the range of x values in our study.

� Think of our communication nodes example where number of ports ranged from 12 to 68. Does our model even hold if we wanted to price a massive project of 200 ports?

Simple Regression II ٩٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

An Extrapolation Penalty

� Recall that our prediction intervals were always narrowest when we predicted right in the middle of our data set.

� As we go farther and farther outside the range of our data, the interval gets wider and wider, implying we know less and less about what is going on.

Page 31: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ٩١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 4Multiple Regression Analysis

(Part 1)

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Multiple Regression I ٩٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.1 Using Multiple Regression

� In Chapter 3, the method of least squares was used to describe the relationship between a dependent variable y and an explanatory variable x.

� Here we extend that to two or more predictor variables, using an equation of the form:

kk xbxbxbby ++++= L22110ˆ

Multiple Regression I ٩٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Basic Exploration

� In Chapter 3 our main graphic tool was the X-Y scatter plot.

� Exploratory graphics are a bit harder to produce here because they need to be multidimensional.

� Even if there were just two xvariables a 3-D display is needed.

Page 32: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ٩٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Estimation of Coefficients

We want an equation of the form:

As before we use least squares. The coefficients b0 b1 b2 ... bk are determined by minimizing the sum of squared residuals.

kk xbxbxbby ++++= L22110ˆ

Multiple Regression I ٩٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Formulae are Very Complex

� Can show exact formula when k=1 (simple regression). Refer to Section 3.1.

� Few texts show the formulae for k=2 (the simplest of multiple regressions)

� Appendix D shows formula in matrix notation

� This is totally a computer problem

Multiple Regression I ٩٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 4.1 Meddicorp Sales

n = 25 sales territories

Y = Sales (1000$) in each territory

X1 = Advertising (100$) in territory

X2 = Amount of bonuses (100$) paid

to salespersons in the territory

Data set MEDDICORP4

Page 33: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ٩٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

600500400

1700

1600

1500

1400

1300

1200

1100

1000

900

ADV

SA

LE

S

Plotsand

Correlation

340290240

1700

1600

1500

1400

1300

1200

1100

1000

900

BONUS

SA

LE

S

Correlations

SALES ADV

ADV 0.900

BONUS 0.568 0.419

Multiple Regression I ٩٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3D Graphics

900

1000

1100

1200

600

1300

1400

1500

1600

SALES

1700

500 240ADV400

290Bon

340

Meddicorp Sales

Multiple Regression I ٩٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Minitab Regression OutputThe regression equation is

SALES = - 516 + 2.47 ADV + 1.86 BONUS

Predictor Coef SE Coef T P

Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

Page 34: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٠٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

3D Surface Graph

900

600

1000

1100

1200

1300

1400

1500

EstSales

1600

500 240Advert

290

400Bonus340

Estimated Meddicorp Sales

The regression equation is

SALES = - 516 + 2.47 ADV + 1.86 BONUS

Multiple Regression I ١٠١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interpretation of Coefficients

� Recall that sales is in $1000s and advertising and bonus in $100s.

� If advertising is held fixed, sales increase $1860 for each $100 of bonus paid.

� If bonus were fixed, sales increase $2470 for each $100 spent on adv.

The regression equation is

SALES = - 516 + 2.47 ADV + 1.86 BONUS

Multiple Regression I ١٠٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.2 Inferences From a MultipleRegression Analysis

In general, the population regression equation

involving K predictors is:

This says the mean value of y at a given set of x

values is a point on the surface described by the

terms on the right-hand side of the equation.

KKxxxy xxxK

ββββµ ++++= L22110,...,,| 21

Page 35: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٠٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.2.1 Assumptions Concerning the Population Regression Line

An alternative way of writing the relationship is:

where i denotes the ith observation and eidenotes a random error or disturbance (deviation from the mean).

We make certain assumptions about the ei.

ikikiii exxxy +++++= ββββ L22110

Multiple Regression I ١٠٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Assumptions

1. We expect the average disturbance ei to be zero so the regression line passes through the average value of y.

2. The ei have constant variance σe2.

3. The ei are normally distributed.

4. The ei are independent.

Multiple Regression I ١٠٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Inferences

� The assumptions allow inferences about the population relationship to be made from a sample equation.

� The first inferences considered are those about the individual population coefficients β1 β2 ... βK.

� Chapter 6 examines what happens when the assumptions are violated.

Page 36: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٠٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.2.2 Inferences about the Population Regression Coefficients

If we wish to make an estimate of the effect of a change in one of the x variables on y, use the interval:

this refers to the jth of the K+1 regression coefficients. The multiplier t is selected from the t-distribution with n-K-1 degrees of freedom.

jbKnj stb 1−−±

Multiple Regression I ١٠٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests About the Coefficients

A test about the marginal effect of xj

on y may be obtained from:

H0: ββββj = ββββj*

Ha: ββββj ≠ ββββj*

where ββββj* is some specific value that is

relevant for the jth coefficient.

Multiple Regression I ١٠٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test Statistic

The test would be performed by using the standardized test statistic:

The most common form of this test is for the parameter to be 0. In this case the test statistic is just the estimate divided by its standard error.

bS

t

j

jjb β*

−=

Page 37: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٠٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 4.2 Meddicorp (Continued)

Refer again to the portion of the regression output about the individual regression coefficients:

This lists the estimates, their standard errors and the ratio of the estimates to their standard errors.

Predictor Coef SE Coef T P

Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

Multiple Regression I ١١٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests For Effect of Advertising

To see if an increase in advertising expenditure affects sales, we can test:

H0: ββββADV = 0 (An increase in advertisinghas no effect on sales)

Ha: ββββADV ≠ 0 (Sales do change whenadvertising increases)

The df are n-K-1 = 25–2-1 = 22. At a 5% significance level, the critical point from the t-table is 2.074

Multiple Regression I ١١١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test Result

From the output we get:

t = ( 2.4732 – 0)/.2753 = 8.98

This is above the critical value of 2.074, so we reject H0.

Note that we could also make use of the p-value (.000) for the test.

Page 38: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١١٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

One-Sided Test on Bonus

We can modify the test to make it one sided

H0: ββββBONUS = 0 (Increased bonuses

do not affect sales)

Ha: ββββBONUS > 0 (Sales increase when

bonuses are higher)

At a 5% significance level, the (one-sided) critical point is 1.717.

Multiple Regression I ١١٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

One-Sided Test Result

From the output we get:

t = 1.8562/.7157 = 2.59 which is > 1.717

We reject H0 but this time make a more specific conclusion.

The listed p-value (.017) is for a two-sided test. For our one-sided test, cut it in half.

Multiple Regression I ١١٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interval Effect of advertising

Recall that sales are measured in 1000$ and ADV in 100$

badv = 2.4732 and has standard error = .2753

2.4732 ± 2.074(.2753) = 2.4732 ± .5709

= 1.902 to 3.044

Each $100 spent on advertising returns $1902 to $3044 in sales.

Page 39: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١١٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.3 Assessing the Fit of the Model

Recall how we partitioned the variation in the previous chapter:

SST = Total variation in the sample of Y values

Split up into two components SSE, SSR

SSE = Error or unexplained variation

SSR = Explained by the Yhat function

Multiple Regression I ١١٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.3.1 The ANOVA Table and R2

� These are the same statistics we briefly examined in simple regression.

� They are perhaps more important here because they measure how well all the variables in the equation work together.

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

Multiple Regression I ١١٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

R2 – a Universal Measure of Fit

R2 = SSR / SST = proportion of variation explained by the regression equation.

If multiplied by 100, interpret as %

If only one x, R2 is square of correlation

For multiple, R2 is square of correlation between the Y values and Y-hat values

Page 40: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١١٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

For our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

R2 = 1067797 / 1248974 = .85494

85.5% of the variation in sales in the 25 territories is explained by the different levels of advertising and bonus

Multiple Regression I ١١٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Adjusted R2

� If there are many predictor variables to choose from, the best R2 is always obtained by throwing them all in the model.

� Some of these predictors could be insignificant, suggesting they contribute little to the model's R2.

� Adjusted R2 is a way to balance the desire for high R2 against the desire to include only important variables.

Multiple Regression I ١٢٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Computation

The "adjustment" is for the number of variables in the model.

Although regular R2 may decrease when you remove a variable, the adjusted version may actually increase if that variable did not have much significance.

)1/(

)1/(12

−−−=

nSST

KnSSERadj

Page 41: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٢١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

4.3.2 The F Statistic

� Since R2 is so high, you would certainly think that the model contains significant predictive power.

� In other problems it is perhaps not so obvious. For example, would an R2 of 20% show any prediction ability at all?

� We can test for the predictive power of the entire model using the F statistic.

Multiple Regression I ١٢٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

F Tests

� Generally these compare two sources of variation

� F = V1/V2 and has two df parameters

� Here V1 = SSR/K has K df

� And V2 = SSE/(n-K-1) has n-k-1 df

Multiple Regression I ١٢٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Usually will see several pages of these; one or

two pages at each specific level of significance

(.10, .05, .01).

Value of F at a

specific significance level

Numerator d.f.

denom.

d.f.

F Tables

Page 42: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Multiple Regression I ١٢٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

F Test Hypotheses

H0: β1 = β2 = …= βK = 0 (None of the Xs

help explain Y)

Ha: Not all βs are 0 (At least one X is useful)

H0: R2 = 0 is an equivalent hypothesis

Multiple Regression I ١٢٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

F test for our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

F = 533899 / 8235 = 64.83 has p-value = 0.000

From tables, F2,22,.05 = 3.44 and F2,22,.01 = 5.72

Confirms that R2 = 85.5% is not near zero

Chapter 4Multiple Regression Analysis

(Part 2)

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Page 43: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

4.4 Comparing Two Regression Models

So far we have looked at two types of hypothesis tests. One was about the overall fit:

H0: β1 = β2 = …= βK = 0

The other was about individual terms:

H0: βj = 0

Ha: βj ≠ 0

4.4.1 Full and Reduced Model Using Separate Regressions

� Suppose we wanted to test a subset of the x variables for significance as a group.

� We could do this by comparing two models.

� The first (Full Model) has K variables in it.

� The second (Reduced Model) contains only the L variables that are NOT in our group.

The Two Models

For convenience, let's assume the group is the last (K-L) variables. The Full Model is:

The Reduced Model is just:

exxxxy KKLLLL +++++++= ++ βββββ LL 11110

exxy LL ++++= βββ L110

Page 44: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

The Partial F Test

We test the group for significance with another F test. The hypothesis is:

H0: βL+1 = βL+2 = …= βK = 0

Ha: At least one β ≠ 0

The test is performed by seeing how much SSE changes between models.

The Partial F Statistic

Let SSEF and SSER denote the SSE in the full and reduced models.

(SSER – SSEF) / (K – L)F =

SSEF / (n-K-1)

The statistic has (K-L) numerator and (n-K-1) denominator d.f.

The "Group"

� In many problems the group of variables has a natural definition.

� In later chapters we look at groups that provide curvature, measure location and model seasonal variation.

� Here we are just going to look at the effect of adding two new variables.

Page 45: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Example 4.4 Meddicorp (yet again)

In addition to the variables for advertising and bonuses paid, we now consider variables for market share and competition.

x3 = Meddicorp market share in each area

x4 = largest competitor's sales in each area

The New Regression ModelThe regression equation is

SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET

Predictor Coef SE Coef T P

Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000

BONUS 1.9059 0.7424 2.57 0.018

MKTSHR 2.651 4.636 0.57 0.574

COMPET -0.1207 0.3718 -0.32 0.749

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

Analysis of Variance

Source DF SS MS F P

Regression 4 1073119 268280 30.51 0.000

Residual Error 20 175855 8793

Total 24 1248974

Did We Gain Anything?

� The old model had R2 = 85.5% so we gained only .4%.

� The t ratios for the two new variables are .57 and -.32.

� It does not look like we have an improvement, but we really need the F test to be sure.

Page 46: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

The Formal Test

Numerator df = (K-L) = 4-2 = 2Denominator df = (n-K-1) = 20

At a 5% level, F2,20 = 3.49

H0: βMKTSHR = βCOMPET = 0Ha: At least one is ≠ 0

Reject H0 if F > 3.49

Things We Need

Full Model: (K = 4)SSEF = 175855(n-K-1) = 20

Reduced Model: (L = 2)Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

SSER

Computations

(SSER – SSEF) / (K – L)F =

SSEF / (n-K-1)

(181176 – 175855)/ (4 – 2)=

175855 / (25-4-1)

5321/2= = .3026

8793

Page 47: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

4.4.2 Full and Reduced Model Comparisons Using Conditional Sums of Squares

� In the standard ANOVA table, SSR shows the amount of variation explained by all variables together.

� Alternate forms of the table break SSR down into components.

� For example, Minitab shows sequential SSR which shows how much SSR increases as each new term is added.

Sequential SSR for MeddicorpS = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

Analysis of Variance

Source DF SS MS F P

Regression 4 1073119 268280 30.51 0.000

Residual Error 20 175855 8793

Total 24 1248974

Source DF Seq SS

ADV 1 1012408

BONUS 1 55389

MKTSHR 1 4394

COMPET 1 927

Meaning What?

1. If ADV was added to the model first, SSR would rise from 0 to 1012408.

2. Addition of BONUS would yield a nice increase of 55389.

3. If MKTSHR entered third, SSR would rise a paltry 4394.

4. Finally, if COMPET came in last, SSR would barely budge by 927.

Page 48: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Implications

� This is another way of showing that once you account for advertising and bonuses paid, you do not get much more from the last two variables.

� The last two sequential SSR values add up to 5321, which was the same as the (SSER – SSEF) quantity computed in the partial F test.

� Given that, it is not surprising to learn that the partial F test can be stated in terms of sequential sums of squares.

4.5 Prediction With a Multiple Regression Equation

As in simple regression, we will look at two types of computations:

1. Estimating the mean y that can occur at a set of x values.

2. Predicting an individual value of y that can occur at a set of x values.

4.5.1 Estimating the Conditional Mean of y Given x1, x2, ..., xK

This is our estimate of the point on our regression surface that occurs at a specific set of x values.

For two x variables, we are estimating:

22110,| 21xxxxy βββµ ++=

Page 49: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Computations

The point estimate is straightforward, just plug in the x values.

The difficult part is computing a standard error to use in a confidence interval. Thankfully, most computer programs can do that.

22110ˆ xbxbby m ++=

4.5.2 Predicting an Individual Value of y Given x1, x2, ..., xK

Now the quantity we are trying to estimate is:

Our interval will have to account for the extra term ( ei ) in the equation, thus will be wider than the interval for the mean.

iiii exxy +++= 22110 βββ

Prediction in Minitab

Here we predict sales for a territory with 500 units of advertising and 250 units of bonus

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 1184.2 25.2 (1131.8, 1236.6) ( 988.8, 1379.5)

Values of Predictors for New Observations

New Obs ADV BONUS

1 500 250

Page 50: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Interpretations

We are 95% sure that the average sales in territories with $50,000 advertising and $25,000 of bonuses will be between $1,131,800 and $1,236,600.

We are 95% sure that any individualterritory with this level of advertising and bonuses will have between $988,800 and $1,379,500 of sales

4.6 Multicollinearity: A Potential Problemin Multiple Regression

� In multiple regression, we like the x variables to be highly correlated with y because this implies good prediction ability.

� If the x variables are highly correlated among themselves, however, much of this prediction ability is redundant.

� Sometimes this redundancy is so severe that it causes some instability in the coefficient estimation. When that happens we say multicollinearity has occurred.

4.6.1 Consequences of Multicollinearity

1. The standard errors of the bj are larger than they should be. This could cause all the t statistics to be near 0 even though the F is large.

2. It is hard to get good estimates of the βj. The bj may have the wrong sign. They may have large changes in value if another variable is dropped from or added to the regression.

Page 51: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

4.6.2 Detecting Multicollinearity

Several methods appear in the literature. Some of these are:

1. Examining pairwise correlations

2. Seeing large F but small t ratios

3. Computing Variance Inflation Factors

Examining Pairwise Correlations

� If it is only a collinearity problem, you can detect it by examining the correlations for pairs of xvalues.

� How large the correlation needs to be before it suggests a problem is debatable. One rule of thumb is .5, another is the maximum correlation between y and the various x values.

� The major limitation of this is that it will not help if there is a linear relationship involving several xvalues, for example,

x1 = 2x2 - .07x3 + a small random error

Large F, Small t

� With a significant F statistic you would expect to see at least one significant predictor, but that may not happen if all the variables are fighting each other for significance.

� This method of detection may not work if there are, say, six good predictors but the multicollinearity only involves four of them.

� This method also may not help identify what variables are involved.

Page 52: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Variance Inflation Factors

� This is probably the most reliable method for detection because it both shows the problem exists and what variables are involved.

� We can compute a VIF for each variable. A high VIF is an indication that the variable's standard error is "inflated" by its relationship to the other x variables.

Auxiliary Regressions

Suppose we regressed each x value, in turn, on all of the other x variables.

Let Rj2 denote the model's R2 we get

when xj was the "temporary y".

1The variable's VIF is VIFj =

1 - Rj2

VIFj and Rj2

If xj was totally uncorrelated with the other xvariables, its VIF would be 1.

This table shows some other values.

10099%

1090%

580%

250%

10%

VIFjRj2

Page 53: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Auxiliary Regressions: A Lot of Work?

� If there were a large number of xvariables in the model, obtaining the auxiliaries would be tedious.

� Most statistics package will compute the VIF statistics for you and report them with the coefficient output.

� You can then do the auxiliary regressions, if needed, for the variables with high VIF.

Using VIFs

� A general rule is that any VIF > 10 is a problem.

� Another is that if the average VIF is considerably larger than 1, SSE may be inflated.

� The average VIF indicates how many times larger SSE is due to multicollinearity than if the predictors were uncorrelated.

� Freund and Wilson suggest comparing the VIF to 1/(1-R2) for the main model. If the VIF are less than this, multicollinearity is not a problem.

Our Example

Pairwise correlations

The maximum correlation among the x variables is .452 so if multicollinearity exists it is well hidden.

Correlations: SALES, ADV, BONUS, MKTSHR, COMPET

SALES ADV BONUS MKTSHR

ADV 0.900

BONUS 0.568 0.419

MKTSHR 0.023 -0.020 -0.085

COMPET 0.377 0.452 0.229 -0.287

Page 54: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

VIFs in Minitab

The regression equation is

SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - .121 COMPET

Predictor Coef SE Coef T P VIF

Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000 1.5

BONUS 1.9059 0.7424 2.57 0.018 1.2

MKTSHR 2.651 4.636 0.57 0.574 1.1

COMPET -0.1207 0.3718 -0.32 0.749 1.4

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

No Problem!

4.6.3 Correction for Multicollinearity

� One solution would be to leave out one or more of the redundant predictors.

� Another would be to use the variables differently. If x1 and x2 are collinear, you might try using x1 and the ratio x2/ x1instead.

� Finally, there are specialized statistical procedures that can be used in place of ordinary least squares.

4.7 Lagged Variables as Explanatory Variables in Time-Series Regression

� When using time series data in a regression, the relationship between y and x may be concurrentor x may serve as a leading indicator.

� In the latter, a past value of x appears as a predictor, either with or without the current value of x.

� An example would be the relationship between housing starts as y and interest rates as x. When rates drop, it is several months before housing starts increase.

Page 55: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Lagged Variables

The effect of advertising on sales is often cumulative so it would not be surprising see it modeled as:

Here xt is advertising in the current month and the lagged variables xt-1 and xt-2represent advertising in the two previous months.

ttttt exxxy ++++= −− 231210 ββββ

Potential Pitfalls

� If several lags of the same variable are used, it could cause multicollinearity if xtwas highly autocorrelated (correlated with its own past values).

� Lagging causes lost data. If xt-2 is included in the model, the first time it can be computed is at time period t = 3. We lose any information in the first two observations.

Lagged y Values

� Sometimes a past value of y is used as a predictor as well. A relationship of this type might be:

� This implies that this month's sales ytare related to by two months of advertising expense xt and xt-1 plus last month's sales yt-1.

ttttt exxyy ++++= −− 132110 ββββ

Page 56: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Example 4.6 Unemployment Rate

� The file UNEMP4 contains the national unemployment rates (seasonally-adjusted) from January 1983 through December 2002.

� On the next few slides are a time series plot of the data and regression models employing first and second lags of the rates.

10.5

9.5

8.5

7.5

6.5

5.5

4.5

3.5

Aug-1999Jun-1995Apr-1991Feb-1987

UN

EM

P

Date/Time

Time Series PlotAutocorrelation is .97 at lag 1 and .94 at lag 2

Regression With First Lag

The regression equation is

UNEMP = 0.153 + 0.971 Unemp1

239 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 0.15319 0.04460 3.44 0.001

Unemp1 0.971495 0.007227 134.43 0.000

S = 0.1515 R-Sq = 98.7% R-Sq(adj) = 98.7%

Analysis of Variance

Source DF SS MS F P

Regression 1 414.92 414.92 18070.47 0.000

Residual Error 237 5.44 0.02

Total 238 420.36

High R2 because of autocorrelation

Page 57: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Regression With Two LagsThe regression equation is

UNEMP = 0.168 + 0.890 Unemp1 + 0.0784 Unemp2

238 cases used 2 cases contain missing values

Predictor Coef SE Coef T P VIF

Constant 0.16764 0.04565 3.67 0.000

Unemp1 0.89032 0.06497 13.70 0.000 77.4

Unemp2 0.07842 0.06353 1.23 0.218 77.4

S = 0.1514 R-Sq = 98.7% R-Sq(adj) = 98.6%

Analysis of Variance

Source DF SS MS F P

Regression 2 395.55 197.77 8630.30 0.000

Residual Error 235 5.39 0.02

Total 237 400.93

Comments

� It does not appear that the second lag term is needed. Its t statistic is 1.23.

� Because we got R2 = 98.7% from the model with just one term, there was not much variation left for the second lag term to explain.

� Note that the second model also had a lot of multicollinearity.

Fitting Curves to Data ١٧١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 5:

Fitting Curves to DataTerry Dielman

Applied Regression Analysis:A Second Course in Business and Economic Statistics, fourth edition

Page 58: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٧٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.1 Introduction

In Chapter 4 , the model was presented as:

where we assumed linear relationships between y and the x variables.

In this chapter we find that this may not be true and consider curvilinear relationships between the variables.

ikikiii exxxy +++++= ββββ L22110

Fitting Curves to Data ١٧٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Modeling

� In general, we regress Y on some X which is not a linear function.

�Common functions are X2 , 1/X or log(X)

� In economics, sometimes regress log(y) on log(x)

Fitting Curves to Data ١٧٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.2 Fitting Curvilinear Relationships

� Polynomial Regression – a common correction for nonlinearity is to add powers of the explanatory variable

� In practice a second-order model is often sufficient to describe the relationship

i

k

ikiii exxxy +++++= ββββ L2

210

Page 59: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٧٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 5.1: Telemarketing

n = 20 telemarketing employees

Y = average calls per day over 20 workdays

X = Months on the job

Data set TELEMARKET5

Fitting Curves to Data ١٧٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Plot of Calls versus Months

302010

35

30

25

20

MONTHS

CA

LL

S

There is an increase in calls with experience, but the rate of increase slows over time.

Fitting Curves to Data ١٧٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Fit of a First-Order Model

� For comparison purposes, we first fit the linear equation and obtained:

CALLS = 13.6708 + .7435 MONTHS

� This equation, which has an R2 of 87.4%, implies that each month of experience leads to .7435 more calls per day.

Page 60: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٧٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Fitting a Second-Order Model

302010

35

30

25

20

MONTHS

CA

LL

S

S = 1.00325 R-Sq = 96.2 % R-Sq(adj) = 95.8 %

- 0.0401182 MONTHS**2

CALLS = -0.140471 + 2.31020 MONTHS

Regression Plot

Fitting Curves to Data ١٧٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression OutputRegression Analysis: CALLS versus MONTHS, MonthSQ

The regression equation is

CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ

Predictor Coef SE Coef T P

Constant -0.140 2.323 -0.06 0.952

MONTHS 2.3102 0.2501 9.24 0.000

MonthSQ -0.040118 0.006333 -6.33 0.000

S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%

Analysis of Variance

Source DF SS MS F P

Regression 2 437.84 218.92 217.50 0.000

Residual Error 17 17.11 1.01

Total 19 454.95

Fitting Curves to Data ١٨٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Test on β2

H0: β2 = 0 (Use the linear equation)Ha: β2 ≠ 0 (Quadratic has improved fit)

Test as usual with t = b2/SE(b2)

Here t = -.0402/.00633 = -6.33 is significant with p-value = .000

Not surprising since R2 increased 9%

Page 61: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٨١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Hypothesis Tests "Top Down"

� The usual practice is to keep lower-order terms when a high-order term is significant.

� In b0 + b1 x + b2 x2 we would retain the b1 term even if it had an insignificant t-ratio, if the b2 term was significant.

Fitting Curves to Data ١٨٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Higher and higher?

� To see if the polynomial has even a higher order, we fit a cubic equation.

� The table below shows the second-order model was sufficient.

Model p for highest order term

R2 Adj R2 Se

Linear 0.000 87.4% 86.7% 1.787

Quadratic 0.000 96.2% 95.8% 1.003

Cubic 0.509 96.3% 95.7% 1.020

Fitting Curves to Data ١٨٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Centering the X

� When polynomial regression is used, multicollinearity often results because x and x2 are correlated.

� This can be eliminated by subtracting x-bar (the mean) from each x

2andUse )x(xxx −−

Page 62: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٨٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.2.2 Reciprocal Transformation of the x Variable

� Another curvilinear relationship that is in common use is:

� Here y and x are inversely related but the relationship is not linear.

i

i

i ex

y +

+=

110 ββ

Fitting Curves to Data ١٨٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 5.2

� We are interested in the relationship between gas mileage and a car's horsepower.

� An the next page is a plot of the highway mpg (HWYMPG) and horsepower (HP) for 147 cars listed in the October 2002 Road and Track.

Fitting Curves to Data ١٨٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

7006005004003002001000

70

60

50

40

30

20

10

HP

HW

YM

PG

Highway MPG versus Horsepower

Page 63: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٨٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Modeling the Relationship

� A regression of HWYMPG on HP yieldsHWYMPG = 38.73 - .0477 HP with R2 = 59.4%

� This does not fit too well because as horsepower increases, mileage decreases, but the rate of decrease is slower for more-powerful cars.

� Although other models, including a quadratic, might work, we regressed HWYMPG on 1/HP.

Fitting Curves to Data ١٨٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression Results

The regression equation is

HWYMPG = 13.6 + 2692 HPINV

Predictor Coef SE Coef T P

Constant 13.6310 0.6493 20.99 0.000

HPINV 2962.4675 111.7526 24.09 0.000

S = 2.93107 R-Sq = 80.0% R-Sq(adj) = 79.9%

Analysis of Variance

Source DF SS MS F P

Regression 1 4987.0 4987.0 580.48 0.000

Residual Error 145 1245.1 8.6

Total 146 6232.7

Fitting Curves to Data ١٨٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

7006005004003002001000

70

60

50

40

30

20

10

HP

HW

YM

PG

Data and Reciprocal Fit

Page 64: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٩٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.2.3 Log Transformation of the x Variable

� Yet another curvilinear equation is:

where ln(x) is the natural logarithm of x.

� It is assumed that the x values are positive because ln(0) is undefined.

iii exy ++= )ln(10 ββ

Fitting Curves to Data ١٩١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 5.4 Fuel Consumption

n = 51 (50 states plus Washington, D.C.)

FUELCON = fuel consumption per capita

POP = state population

AREA = area of state in square miles

POPDENS = population density

Data Set FUELCON5

Fitting Curves to Data ١٩٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1000050000

700

600

500

400

300

DENSITY

FU

EL

CO

N

Plot of Fuelcon versus Density

r = -.454

Page 65: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٩٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Effect of the Transformation

� The graph has one point (D.C.) on the right with all others clumped to the left.

� It is hard to see what type of relationship there is until some adjustments are made.

� Here take the natural log of density to "pull" the extreme point back in.

Fitting Curves to Data ١٩٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

9876543210

700

600

500

400

300

LogDensity

FU

EL

CO

N

Consumption versus Logdensity

r = -.527

Fitting Curves to Data ١٩٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Linear and Log Regressions

The regression equation is

FUELCON = 495 - 0.025 DENSITY

Predictor Coef SE Coef T P

Constant 465.628 9.481 52.28 0.000

DENSITY -0.025 0.007 -3.56 0.001

S = 65.1675 R-Sq = 20.6% R-Sq(adj) = 19.0%

The regression equation is

FUELCON = 597 – 24.5 LOGDENS

Predictor Coef SE Coef T P

Constant 597.19 29.96 22.15 0.000

LOGDENS -24.53 5.65 -4.34 0.000

S = 62.1561 R-Sq = 27.8% R-Sq(adj) = 26.3%

Page 66: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٩٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.2.4 Log Transformations of Both the y and x Variables

� Here the natural log of y is the dependent variable and the natural log of x is the independent variable:

� Comparing results with other models may be difficult since we are not modeling yitself.

� Economists sometimes use this to estimate price elasticity (y is demand andx price; b1 is estimated elasticity).

iii exy ++= )ln()ln( 10 ββ

Fitting Curves to Data ١٩٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 5.4 Imports and GDP

The gross domestic product (GDP) and dollar amount of total imports (IMPORTS) for 25 countries was obtained from the World Fact Book.

For both variables, low values clump together and higher values spread out, suggesting log transformations for both.

Fitting Curves to Data ١٩٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1000050000

1000

500

0

GDP

IMP

OR

TS

Scatterplot of Imports vs GDP

Page 67: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ١٩٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

1050

7

6

5

4

3

2

1

0

-1

-2

LogGDP

Lo

gIm

p

Scatterplot of LogImp vs LogGDP

Fitting Curves to Data ٢٠٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Two Regression ModelsRegression Analysis: IMPORTS versus GDP

Predictor Coef SE Coef T P

Constant 22.32 19.24 1.16 0.258

GDP 0.105671 0.008452 12.50 0.000

S = 87.00 R-Sq = 87.2% R-Sq(adj) = 86.6%

Regression Analysis: LogImp versus LogGDP

Predictor Coef SE Coef T P

Constant -1.1275 0.4346 -2.59 0.016

LogGDP 0.86703 0.07877 11.01 0.000

S = 0.9142 R-Sq = 84.0% R-Sq(adj) = 83.4%

Not directly comparable

Fitting Curves to Data ٢٠١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The R2 Compare Different Things

� The 87.2 % R2 for the no-log model is the percentage of variation in Imports explained.

� The 84.0% for the second model is the percentage of variation in ln(Imports) explained.

� If you converted the fitted values of the second model back to Imports you might find the log model better.

Page 68: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ٢٠٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What Transformation to Use

� It is probably best to try several.

� A quadratic is most flexible because it uses two parameters to fit the relationship between to fit the relationship between y and x.

� Some further analysis is in Chapter 6 where tests for nonlinearity are discussed.

Fitting Curves to Data ٢٠٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

5.2.5 Fitting Curved Trends

If the data is collected over time, we may want to consider variations on the linear trend model of Chapter 3.

Another is the S-Curve trend:

tt etty +++= 2

210:trendQuadratic βββ

+

+= tt e

ty

1exp 10 ββ

Fitting Curves to Data ٢٠٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

S Curve Model

Many products have a demand curve like this.

1. Initial demand increases slowly

2. As product matures, demand picks up and steadily grows.

3. At some saturation point demand levels off.

Page 69: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Fitting Curves to Data ٢٠٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Exponential Growth Model

Another alternative is an exponential trend:

This can be fit by least squares if you model ln(y).

( )tt ety ++= 10exp ββ

Checking Assumptions ٢٠٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 6Assessing the Assumptions of

the Regression Model

Terry Dielman

Applied Regression Analysisfor Business and Economics

Checking Assumptions ٢٠٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.1 Introduction

In Chapter 4 the multiple linear regression model was presented as

Certain assumptions were made about how the errors ei behaved. In this chapter we will check to see if those assumptions appear reasonable.

ikikiii exxxy +++++= ββββ L22110

Page 70: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٠٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.2 Assumptions of the Multiple Linear Regression Model

a. We expect the average disturbance ei to be zero so the regression line passes through the average value of Y.

b. The disturbances have constant variance σe

2.c. The disturbances are normally

distributed.d. The disturbances are independent.

Checking Assumptions ٢٠٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.3 The Regression Residuals

� We cannot check to see if the disturbances ei behave correctly because they are unknown.

� Instead, we work with their sample counterpart, the residuals

which represent the unexplained variation in the y values.

iii yye ˆˆ −=

Checking Assumptions ٢١٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

PropertiesProperty 1: They will always average 0

because the least squares estimation procedure makes that happen.

Property 2: If assumptions a, b and d of Section 6.2 are true then the residuals should be randomly distributed around their mean of 0. There should be no systematic pattern in a residual plot.

Property 3: If assumptions a through d hold, the residuals should look like a random sample from a normal distribution.

Page 71: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢١١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Suggested Residual Plots

1. Plot the residuals versus each explanatory variable.

2. Plot the residuals versus the predicted values.

3. For data collected over time or in any other sequence, plot the residuals in that sequence.

In addition, a histogram and box plot are useful for assessing normality.

Checking Assumptions ٢١٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Standardized residuals

� The residuals can be standardized by dividing by their standard error.

� This will not change the pattern in a plot but will affect the vertical scale.

� Standardized residuals are always scaled so that most are between -2 and +2 as in a standard normal distribution.

Checking Assumptions ٢١٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A plot meeting property 2

11010510095

3

2

1

0

-1

-2

X

Res

a. mean of 0 b. Same scatter d. No pattern with X

Page 72: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢١٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A plot showing a violation

302010

2

1

0

-1

-2

MONTHS

Sta

ndard

ized R

esid

ual

Residuals Versus MONTHS

(response is CALLS)

Checking Assumptions ٢١٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.4 Checking Linearity

� Although sometimes we can see evidence of nonlinearity in an X-Y scatterplot, in other cases we can only see it in a plot of the residuals versus X.

� If the plot of the residuals versus an X shows any kind of pattern, it both shows a violation and a way to improve the model.

Checking Assumptions ٢١٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 6.1: Telemarketing

n = 20 telemarketing employees

Y = average calls per day over 20 workdays

X = Months on the job

Data set TELEMARKET6

Page 73: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢١٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Plot of Calls versus Months

302010

35

30

25

20

MONTHS

CA

LL

S

There is some curvature, but it is masked by the more obvious linearity.

Checking Assumptions ٢١٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

If you are not sure, fit the linear model and save the residuals

The regression equation is

CALLS = 13.7 + 0.744 MONTHS

Predictor Coef SE Coef T P

Constant 13.671 1.427 9.58 0.000

MONTHS 0.74351 0.06666 11.15 0.000

S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%

Analysis of Variance

Source DF SS MS F P

Regression 1 397.45 397.45 124.41 0.000

Residual Error 18 57.50 3.19

Total 19 454.95

Checking Assumptions ٢١٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residuals from model

With the linearity "taken out" the curvature is more obvious

Page 74: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٢٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.4.2 Tests for lack of fit

� The residuals contain the variation in the sample of Y values that is not explained by the Yhat equation.

� This variation can be attributed to many things, including:

• natural variation (random error)

• omitted explanatory variables

• incorrect form of model

Checking Assumptions ٢٢١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Lack of fit

� If nonlinearity is suspected, there are tests available for lack of fit.

� Minitab has two versions of this test, one requiring there to be repeated observations at the same X values.

� These are on the Options submenu off the Regression menu

Checking Assumptions ٢٢٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The pure error lack of fit test

� In the 20 observations for the telemarketing data, there are two at 10, 20 and 22 months, and four at 25 months.

� These replicates allow the SSE to be decomposed into two portions, "pure error" and "lack of fit".

Page 75: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٢٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The test

H0: The relationship is linear

Ha: The relationship is not linear

The test statistic follows an F distribution with c – k – 1 numerator df and n – c denominator df

c = number of distinct levels of X

n = 20 and there were 6 replicates so c = 14

Checking Assumptions ٢٢٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Minitab's outputThe regression equation is

CALLS = 13.7 + 0.744 MONTHS

Predictor Coef SE Coef T P

Constant 13.671 1.427 9.58 0.000

MONTHS 0.74351 0.06666 11.15 0.000

S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%

Analysis of Variance

Source DF SS MS F P

Regression 1 397.45 397.45 124.41 0.000

Residual Error 18 57.50 3.19

Lack of Fit 12 52.50 4.38 5.25 0.026

Pure Error 6 5.00 0.83

Total 19 454.95

Checking Assumptions ٢٢٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test results

At a 5% level of significance, the critical value (from F12, 6 distribution) is 4.00.

The computed F is 5.25 is significant (p value of .026) so we conclude the relationship is not linear.

Page 76: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٢٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests without replication

� Minitab also has a series of lack of fit tests that can be applied when there is no replication.

� When they are applied here, these messages appear:

� The small p values suggest lack of fit.

Lack of fit test

Possible curvature in variable MONTHS (P-Value = 0.000)

Possible lack of fit at outer X-values (P-Value = 0.097)

Overall lack of fit test is significant at P = 0.000

Checking Assumptions ٢٢٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.4.3 Corrections for nonlinearity

� If the linearity assumption is violated, the appropriate correction is not always obvious.

� Several alternative models were presented in Chapter 5.

� In this case, it is not too hard to see that adding an X2 term works well.

Checking Assumptions ٢٢٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Quadratic model

The regression equation is

CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ

Predictor Coef SE Coef T P

Constant -0.140 2.323 -0.06 0.952

MONTHS 2.3102 0.2501 9.24 0.000

MonthSQ -0.040118 0.006333 -6.33 0.000

S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%

Analysis of Variance

Source DF SS MS F P

Regression 2 437.84 218.92 217.50 0.000

Residual Error 17 17.11 1.01

Total 19 454.95

No evidence of lack of fit (P > 0.1)

Page 77: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٢٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residuals from quadratic model

302010

1

0

-1

MONTHS

RE

SI1

No violations evident

Checking Assumptions ٢٣٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.5 Check for constant variance

� Assumption b states that the errors eishould have the same variance everywhere.

� This implies that if residuals are plotted against an explanatory variable, the scatter should be the same at each value of the X variable.

� In economic data, however, it is fairly common to see that a variable that increases in value often will also increase in scatter.

Checking Assumptions ٢٣١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 6.3 FOC Sales

n = 265 months of sales data for a fibre-optic company

Y = Sales

X= Mon ( 1 thru 265)

Data set FOCSALES6

Page 78: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٣٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Data over time

40000

30000

20000

10000

0

200100

SA

LES

Index

Note: This uses Minitab’s Time Series Plot

Checking Assumptions ٢٣٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residual plot

3002001000

20000

10000

0

-10000

-20000

Mon

Resid

ual

Residuals Versus Mon

(response is SALES)

Checking Assumptions ٢٣٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Implications

� When the errors ei do not have a constant variance, the usual statistical properties of the least squares estimates may not hold.

� In particular, the hypothesis tests on the model may provide misleading results.

Page 79: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٣٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.5.2 A Test for Nonconstant Variance

� Szroeter developed a test that can be applied if the observations appear to increase in variance according to some sequence (often, over time).

� To perform it, save the residuals, square them, then multiply by i (the observation number).

� Details are in the text.

Checking Assumptions ٢٣٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.5.3 Corrections for Nonconstant Variance

Several common approaches for correcting nonconstant variance are:

1. Use ln(y) instead of y

2. Use √y instead of y

3. Use some other power of y, yp, where the Box-Cox method is used to determine the value for p.

4. Regress (y/x) on (1/x)

Checking Assumptions ٢٣٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

LogSales over time

10

9

8

200100

Log

Sale

s

Index

Page 80: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٣٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residuals from Regression

3002001000

0.5

0.0

-0.5

-1.0

Mon

Resid

ual

Residuals Versus Mon

(response is LogSales)

This looks real good after I put this text box on top of those six large outliers.

Checking Assumptions ٢٣٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.6 Assessing the Assumption That the Disturbances are Normally Distributed

� There are many tools available to check the assumption that the disturbances are normally distributed.

� If the assumption holds, the standardized residuals should behave like they came from a standard normal distribution.

– about 68% between -1 and +1

– about 95% between -2 and +2

– about 99% between -3 and +3

Checking Assumptions ٢٤٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.6.1 Using Plots to Assess Normality

� You can plot the standardized residuals versus fitted values and count how many are beyond -2 and +2; about 1 in 20 would be the usual case.

� Minitab will do this for you if ask it to check for unusual observations (those flagged by an R have a standardized residual beyond ±2.

Page 81: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٤١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Other tools

� Use a Normal Probability plot to test for normality.

� Use a histogram (perhaps with a superimposed normal curve) to look at shape.

� Use a Boxplot for outlier detection. It will show all outliers with an *.

Checking Assumptions ٢٤٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 6.5 Communication Nodes

Data in COMNODE6

n = 14 communication networks

Y = Cost

X1 = Number of ports

X2 = Bandwidth

Checking Assumptions ٢٤٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression with unusuals flaggedThe regression equation is

COST = 17086 + 469 NUMPORTS + 81.1 BANDWIDTH

Predictor Coef SE Coef T P

Constant 17086 1865 9.16 0.000

NUMPORTS 469.03 66.98 7.00 0.000

BANDWIDT 81.07 21.65 3.74 0.003

S = 2983 R-Sq = 95.0% R-Sq(adj) = 94.1%

Analysis of Variance

(deleted)

Unusual Observations

Obs NUMPORTS COST Fit SE Fit Residual St Resid

1 68.0 52388 53682 2532 -1294 -0.82 X

10 24.0 23444 29153 1273 -5709 -2.12R

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Page 82: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٤٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

55000450003500025000

1

0

-1

-2

Fitted Value

Sta

ndard

ized R

esi

dual

Residuals Versus the Fitted Values

(response is COST)

Residuals versus fits (from regression graphs)

Checking Assumptions ٢٤٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.6.2 Tests for normality

� There are several formal tests for the hypothesis that the disturbances eiare normal versus nonnormal.

� These are often accompanied by graphs* which are scaled so that data which are normally-distributed appear in a straight line.

* Your Minitab output may appear a little different depending on whether you have the student or professional version, and which release you have.

Checking Assumptions ٢٤٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

10-1-2

2

1

0

-1

-2

Norm

al S

core

Standardized Residual

Normal Probability Plot of the Residuals

(response is COST)

Normal plot (from regression graphs)

If normal, should follow straight line

Page 83: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٤٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Normal probability plot (graph menu)

3210-1-2-3

99

95

90

80

70

60

50

40

30

20

10

5

1

Data

Pe

rce

nt 1.187AD*

Goodness of Fit

Normal Probability Plot for SRES1ML Estimates - 95% CI

Mean

StDev

-0.0547797

1.02044

ML Estimates

Checking Assumptions ٢٤٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Test for Normality (Basic Statistics Menu)

P-Value: 0.216

A-Squared: 0.463

Anderson-Darling Normality Test

N: 14

StDev: 1.05896

Average: -0.0547797

10-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

SRES1

Normal Probability Plot

AcceptsHo: Normality

Checking Assumptions ٢٤٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 6.7 S&L Rate of Return

Data set SL6

n =35 Saving and Loans stocksY = rate of return for 5 years ending 1982X1 = the "Beta" of the stockX2 = the "Sigma" of the stock

Beta is a measure of nondiversifiable risk and Sigma a measure of total risk

Page 84: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٥٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Basic exploration

2.51.50.5

10

5

0

BETA

RE

TU

RN

20100

10

5

0

SIGMA

RE

TU

RN

Correlations: RETURN, BETA, SIGMA

RETURN BETA

BETA 0.180

SIGMA 0.351 0.406

Checking Assumptions ٢٥١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Not much explanatory powerThe regression equation is

RETURN = - 1.33 + 0.30 BETA + 0.231 SIGMA

Predictor Coef SE Coef T P

Constant -1.330 2.012 -0.66 0.513

BETA 0.300 1.198 0.25 0.804

SIGMA 0.2307 0.1255 1.84 0.075

S = 2.377 R-Sq = 12.5% R-Sq(adj) = 7.0%

Analysis of Variance

(deleted)

Unusual Observations

Obs BETA RETURN Fit SE Fit Residual St Resid

19 2.22 0.300 -0.231 2.078 0.531 0.46 X

29 1.30 13.050 2.130 0.474 10.920 4.69R

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Checking Assumptions ٢٥٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

One in every crowd?

43210

5

4

3

2

1

0

-1

Fitted Value

Sta

ndard

ized R

esi

dual

Residuals Versus the Fitted Values

(response is RETURN)

Page 85: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٥٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Normality Test

P-Value: 0.000

A-Squared: 2.235

Anderson-Darling Normality Test

N: 35

StDev: 2.30610

Average: 0.0000000

1050

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

RESI1

Normal Probability Plot

RejectH0: Normality

Checking Assumptions ٢٥٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.6.3 Corrections for Nonnormality

� Normality is not necessary for making inference with large samples.

� It is required for inference with small samples.

� The remedies are similar to those used to correct for nonconstant variance.

Checking Assumptions ٢٥٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.7 Influential Observations

� In minimizing SSE, the least squares procedure tries to avoid large residuals.

� It thus "pays a lot of attention" to y values that don't fit the usual pattern in the data. Refer to the example in Figures 6.42(a) and 6.42(b).

� That probably also happened in the S&L data where the one very high return masked the relationship between rate of return, beta and sigma for the other 34 stocks.

Page 86: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٥٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.7.2 Identifying outliers

� Minitab flags any residual bigger than 2 in absolute value as a potential outlier.

� A boxplot of the residuals uses a slightly different rule, but should give similar results.

� There is also a third type of residual that is often used for this purpose.

Checking Assumptions ٢٥٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Deleted residuals

� If you (temporarily) eliminate the ith

observation from the data set, it cannot influence the estimation process.

� You can then compute a "deleted" residual to see if this point fits the pattern in the other observations.

Checking Assumptions ٢٥٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Deleted Residual IllustrationThe regression equation is

ReturnWO29 = - 2.51 + 0.846 BETA + 0.232 SIGMA

34 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant -2.510 1.153 -2.18 0.037

BETA 0.8463 0.6843 1.24 0.225

SIGMA 0.23220 0.07135 3.25 0.003

S = 1.352 R-Sq = 37.2% R-Sq(adj) = 33.1%

Without observation 29, we get a much better fit.

Predicted Y29 = -2.51 + .846(1.2973) + .232(13.3110) = 1.678

Prediction SE is 1.379

Deleted residual29 = (13.05 – 1.678)/1.379 = 8.24

Page 87: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٥٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The influence of observation 29

� When it was temporarily removed, the R2 went from 12.5% to 37.2% and we got a very different equation

� The deleted residual for this observation was a whopping 8.24, which shows it had a lot of weight in determining the original equation.

Checking Assumptions ٢٦٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.7.3 Identifying Leverage Points

� Outliers have unusual y values; data points with unusual X values are said to have leverage. Minitab flags these with an X.

� These points can have a lot of influence in determining the Yhatequation, particularly if they don't fit well. Minitab would flag these with both an R and an X.

Checking Assumptions ٢٦١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Leverage

� The leverage of the ith observation is hi (it is hard to show where this comes from without matrix algebra).

� If h > 2(K+1)/n it has high leverage.

� For S&P returns, k = 2 and n = 35 so the benchmark is 2(3)/35 = .171

� Observation 19 has a very small value for Sigma, this is the reason why it has h19 = .764

Page 88: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٦٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.7.4 Combined Measures

� The effect of an observation on the regression line is a function of both the y and X values.

� Several statistics have been developed that attempt to measure combined influence.

� The DFIT statistic and Cook's D are two more-popular measures.

Checking Assumptions ٢٦٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The DFIT statistic

� The DFIT statistic is a function of both the residual and the leverage.

� Minitab can compute and save these under "Storage".

� Sometimes a cutoff is used, but it is perhaps best just to look for values that are high.

Checking Assumptions ٢٦٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

DFIT Graphed

35302520151050

1.5

1.0

0.5

0.0

Observation Number

DF

IT1

29

19

Page 89: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٦٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Cook's D

� Often called Cook's Distance

� Minitab also will compute these and store them.

� Again, it might be best just to look for high values rather than use a cutoff.

Checking Assumptions ٢٦٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Cook's D Graphed

35302520151050

0.3

0.2

0.1

0.0

Observation Number

CO

OK

1

19

29

Checking Assumptions ٢٦٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.7.5 What to do with Unusual Observations

� Observation 19 (First Lincoln Financial Bank) has high influence because of its very low Sigma.

� Observation 29 (Mercury Saving) had a very high return of 13.05 but its Beta and Sigma were not unusual.

� Since both values are out of line with the other S&L banks, they may represent data recording errors.

Page 90: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٦٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Eliminate? Adjust?

� If you can do further research you might find out the true story.

� You should eliminate an outlier data point only when you are convinced it does not belong with the others (for example, if Mercury was speculating wildly).

� An alternative is to keep the data point but add an indicator variable to the model that signals there is something unusual about this observation.

Checking Assumptions ٢٦٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.8 Assessing the Assumption That the Disturbances are Independent

� If the disturbances are independent, the residuals should not display any patterns.

� One such pattern was the curvature in the residuals from the linear model in the telemarketing example.

� Another pattern occurs frequently in data collected over time.

Checking Assumptions ٢٧٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.8.1 Autocorrelation

� In time series data we often find that the disturbances tend to stay at the same level over consecutive observations.

� If this feature, called autocorrelation, is present, all our model inferences may be misleading.

Page 91: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٧١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

First-order autocorrelation

If the disturbances have first-order autocorrelation, they behave as:

ei = ρ ei-1 + µi

where µi is a disturbance with expected value 0 and independent over time.

Checking Assumptions ٢٧٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The effect of autocorrelation

If you knew that e56 was 10 and ρ was .7, you would expect e57 to be 7 instead of zero.

This dependence can lead to high standard errors for the bj coefficients and wider confidence intervals.

Checking Assumptions ٢٧٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.8.2 A Test for First-Order Autocorrelation

Durbin and Watson developed a test for positive autocorrelation of the form:

H0: ρ = 0Ha: ρ > 0

Their test statistic d is scaled so that it is 2 if no autocorrelation is present and near 0 if it is very strong.

Page 92: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٧٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A Three-Part Decision Rule

The Durbin-Watson test distribution depends on n and K. The tables (Table B.7) list two decision points dL and dU.

If d < dL reject H0 and conclude there is positive autocorrelation.

If d > dU accept H0 and conclude there is no autocorrelation.

If dL ≤ d ≤ dU the test is inconclusive.

Checking Assumptions ٢٧٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 6.10 Sales and Advertising

n = 36 years of annual data

Y = Sales (in million $)

X = Advertising expenditures ($1000s)

Data in Table 6.6

Checking Assumptions ٢٧٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The Test

n = 36 and K = 1 X-variable

At a 5% level of significance, Table B.7 gives dL = 1.41 and dU = 1.52

Decision Rule:Reject H0 if d < 1.41Accept H0 if d > 1.52Inconclusive if 1.41 ≤≤≤≤ d ≤≤≤≤ 1.52

Page 93: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٧٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression With DW StatisticThe regression equation is

Sales = - 633 + 0.177 Adv

Predictor Coef SE Coef T P

Constant -632.69 47.28 -13.38 0.000

Adv 0.177233 0.007045 25.16 0.000

S = 36.49 R-Sq = 94.9% R-Sq(adj) = 94.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 842685 842685 632.81 0.000

Residual Error 34 45277 1332

Total 35 887961

Unusual Observations

Obs Adv Sales Fit SE Fit Residual St Resid

1 5317 381.00 309.62 11.22 71.38 2.06R

15 6272 376.10 478.86 6.65 -102.76 -2.86R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 0.47 Significant autocorrelation

Checking Assumptions ٢٧٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Plot of Residuals over Time

2

1

0

-1

-2

-3

302010

SR

ES

1

Index

Shows first-order autocorrelation with r = .71

Checking Assumptions ٢٧٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

6.8.3 Correction for First-Order Autocorrelation

One popular approach creates a new y and x variable.

First, obtain an estimate of ρ. Here we use r = .71 from Minitab's Autocorrelation analysis.

Then compute yi* = yi – r yi-1

and xi* = xi – r xi-1

Page 94: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٨٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

First Observation Missing

Because the transformation depends on lagged y and x values, the first observation requires special handling.

The text suggests y1* = √1 – r2 y1

and a similar computation for x1*

Checking Assumptions ٢٨١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Other Approaches

� An alternative is to use an estimation technique (such as SAS's Autoregprocedure) that automatically adjusts for autocorrelation.

� A third option is to include a lagged value of y as an explanatory variable. In this model, the DW test is no longer appropriate.

Checking Assumptions ٢٨٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression With Lagged Sales as a PredictorThe regression equation is

Sales = - 234 + 0.0631 Adv + 0.675 LagSales

35 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant -234.48 78.07 -3.00 0.005

Adv 0.06307 0.02023 3.12 0.004

LagSales 0.6751 0.1123 6.01 0.000

S = 24.12 R-Sq = 97.8% R-Sq(adj) = 97.7%

Analysis of Variance

(deleted)

Unusual Observations

Obs Adv Sales Fit SE Fit Residual St Resid

15 6272 376.10 456.24 5.54 -80.14 -3.41R

16 6383 454.60 422.02 12.95 32.58 1.60 X

21 6794 512.00 559.41 4.46 -47.41 -2.00R

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Page 95: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Checking Assumptions ٢٨٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Residuals From Model With Lagged Sales

2

1

0

-1

-2

-3

302010

SR

ES

2

Index

Now r = -.23 is not significant

Indicator Variables ٢٨٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 7Using Indicator and Interaction Variables

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Indicator Variables ٢٨٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

7.1 Using and Interpreting Indicator Variables

�Suppose some observations have a particular characteristic or attribute, while others do not.

�We can include this information in the regression model by using dummy or indicator variables.

Page 96: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٢٨٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Add the info thru a coding scheme

Use a binary (dummy) variable to “indicate”when the characteristic is present

Di = 1 if observation i has the attribute

Di = 0 if observation i does not have it

Indicator Variables ٢٨٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

An Example

Di = 1 if individual i is employed

Di = 0 if individual i is not employed

We could do it the other way and use the "1" to indicate an unemployed individual.

Indicator Variables ٢٨٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Multiple Categories

� For multiple categories, use multiple indicators.

� For example, to indicate where a firm's stock is listed, we could define 3 indicator variables; one each for the NYSE, AMEX and NASDAQ.

� For computational reasons, we would include only two of these in the regression.

Page 97: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٢٨٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 7.1 Employment Discrimination

If two groups have apparently different salary structures, you first need to account for differences in education, training and experience before any claim of discrimination can be made.

Regression analysis with an indicator variable for the group is a way to investigate this.

Indicator Variables ٢٩٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Treasury Versus Harris

The data set HARRIS7 contains information on the salaries of 93 employees of the Harris Trust and Savings Bank. They were sued by the US Department of Treasury in 1981.

Here we examine how salary depends on education, also accounting for gender.

Indicator Variables ٢٩١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Salary Versus Years of Education

At all levels of education, the male salaries appear higher.

Page 98: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٢٩٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression Analysis

The regression equation is

SALARY = 4173 + 80.7 EDUCAT + 692 MALES

Predictor Coef SE Coef T P

Constant 4173.1 339.2 12.30 0.000

EDUCAT 80.70 27.67 2.92 0.004

MALES 691.8 132.2 5.23 0.000

S = 572.4 R-Sq = 36.3% R-Sq(adj) = 34.9%

How do we interpret this equation?

Indicator Variables ٢٩٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

An Intercept AdjusterFor an indicator variable, the bj is not really a slope.

To see this, evaluate the equation for the two groups.

FEMALES (MALES = 0)SALARY = 4173 + 80.7 EDUCAT + 692 MALES

= 4173 + 80.7 EDUCAT + 692 (0)

= 4173 + 80.7 EDUCAT

MALES (MALES = 1)SALARY = 4173 + 80.7 EDUCAT + 692 MALES

= 4173 + 80.7 EDUCAT + 692 (1)

= 4173 + 80.7 EDUCAT + 692

= 4865 + 80.7 EDUCAT

Indicator Variables ٢٩٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Parallel Salary Equations

Page 99: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٢٩٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is The Difference Significant?

H0: βMALES = 0

Ha: βMALES ≠ 0

Use t = b/SEb as usual

t = 5.23 is significant

(After accounting for years of education, there is no salary difference)

(After accounting for education, there IS a salary difference)

Indicator Variables ٢٩٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What if the Coding Was Different?

� If we had an indicator for females and used it, the equation would be:

SALARY = 4865 + 80.7 EDUCAT - 692 FEMALES

� The difference between the groups is the same. For females, the intercept in the equation is 4865 – 692 = 4173

Indicator Variables ٢٩٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Multiple Categories

� Pick one category as the "base category".

� Create one indicator variable for each other category.

� In general, if there are m categories, use m – 1 indicator variables.

Page 100: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٢٩٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 7.3 Meddicorp Sales

Y = Sales in one of 25 territories

X1 = advertising in territory

X2 = bonuses paid in territory

Also Region: 1 = South

2 = West

3 = Midwest

Indicator Variables ٢٩٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

How do you use region?

What happens if you just put it in the model?

Sales = -84 + 1.55 ADV + 1.11 BONUS + 119 Region

R2 = 92.0% and Se = 68.89

SE(Region) = 28.69 so tstat = 4.14 is significant

Indicator Variables ٣٠٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Region as an X

This implies the difference between Region 3 (MW) and Region 2 (W) = b3 = 119

And the difference between Region 2 (W) and Region 1 (S) is also 119

The sales differences may not be equal but this forces them to be estimated that way

Page 101: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣٠١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A more flexible approach

� Use two indicator variables to tell the three regions apart

� Can use any one of the three as the “base” category.

� Here is what it looks like if Midwest is selected as the base.

Indicator Variables ٣٠٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Coding scheme

Region

D1

South

D2

West

SOUTH

WEST

MIDWEST

1

0

0

0

1

0

Indicator Variables ٣٠٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Results

SALES = 435 + 1.37ADV + .975 BONUS

- 258 South – 210 West

R2 = 94.7 and Se = 57.63

Both indicators are significant

Page 102: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣٠٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

This Defines Three Equations

SALES = 435 + 1.37ADV + .975 BONUS

- 258 South – 210 West

S: SALES = 177 + 1.37ADV + .975 BONUS

W: SALES = 225 + 1.37ADV + .975 BONUS

MW: SALES = 435 + 1.37ADV + .975 BONUS

Indicator Variables ٣٠٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Is Location Significant?

� Because location is measured by two variables in a group, we need to do a partial F test.

� The full Model has ADV, BONUS, SOUTH and WEST and has R2 = 94.7

� The reduced model has only ADV and BONUS, with R2 = 85.5

Indicator Variables ٣٠٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Output For F-TestFULL MODEL

S = 57.63 R-Sq = 94.7% R-Sq(adj) = 93.6%

Analysis of Variance

Source DF SS MS F P

Regression 4 1182560 295640 89.03 0.000

Residual Error 20 66414 3321

Total 24 1248974

REDUCED MODEL

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000

Residual Error 22 181176 8235

Total 24 1248974

Page 103: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣٠٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Partial F Computations

(SSER – SSEF) / (K – L)F =

MSEF

(181176 - 66414)/ (4-2)= = 17.3

3321

Indicator Variables ٣٠٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

7.2 Interaction Variables

� Another type of variable used in regression models is an interaction variable.

� This is usually formulated as the product of two variables; for example, x3 = x1x2

� With this variable in the model, it means the level of x2 changes how x1affects Y

Indicator Variables ٣٠٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interaction Model

With two x variables the model is:

If we factor out x1 we get:

so each value of x2 yields a different slope in the relationship between y and x1

exxxxy ++++= 21322110 ββββ

exxxy ++++= 2212310 )( ββββ

Page 104: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣١٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Interaction Involving an Indicator

If one of the two variables is binary, the interaction produces a model with two different slopes.

When x2 = 0

When x2 = 1

exy ++= 110 ββ

exy ++++= 13120 )()( ββββ

Indicator Variables ٣١١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 7.4 Discrimination (again)

� In the Harris Bank case, suppose we suspected that the salary difference by gender changed with different levels of education.

� To investigate this, we created a new variable MSLOPE = EDUCAT*MALES and added it to the model.

Indicator Variables ٣١٢

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Regression Output

The regression equation is

SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE

Predictor Coef SE Coef T P

Constant 4395.3 389.2 11.29 0.000

EDUCAT 62.13 31.94 1.95 0.055

MALES -274.9 845.7 -0.32 0.746

MSLOPE 73.59 63.59 1.16 0.250

S = 571.4 R-Sq = 37.3% R-Sq(adj) = 35.2%

How do we interpret the equation this time?

Page 105: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣١٣

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A Slope AdjusterTo see the interaction effect, once again evaluate

the equation for the two groups.

FEMALES (MALES = 0)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE

= 4395 + 62.1 EDUCAT - 275 (0) + 73.6 (EDUCAT*0)= 4395 + 62.1 EDUCAT

MALES (MALES = 1)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE

= 4395 + 62.1 EDUCAT - 275 (1) + 73.6 (EDUCAT*1)= 4395 + 62.1 EDUCAT – 275 + 73.6 EDUCAT= 4120 + 135.7 EDUCAT

Indicator Variables ٣١٤

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Lines With Two Different Slopes

A bigger gap occurs at higher education levels

Indicator Variables ٣١٥

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Tests in This Model� Although the slope adjuster implies

the salary gap increases with education, this effect is not really significant (tMSLOPE = 1.16).

� The overall affect of gender is now contained in two variables, so a partial F test would be needed to test for differences between male and female salaries.

Page 106: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣١٦

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

7.3 Seasonal Effects in Time Series Regression

� Data collected over time (say quarterly)

� If we think the Y variable depends on the calendar can do a kind of “seasonal adjustment” by adding quarter dummies

� Q1 = 1 if this was first quarter, Q2 = 1 if a second quarter, Q3 = 1 if third

� Don’t use Q4 since that is the “base”

Indicator Variables ٣١٧

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example 7.5 ABX Company Sales

� We fit a trend to these sales in Example 3.11 by regressing sales on a time index variable.

� Because this company sells winter sports merchandise, including seasonal effects should markedly improve the fit.

Indicator Variables ٣١٨

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

403020100

300

250

200

TIME

SA

LE

S

ABX Company Sales

4th qtr

Page 107: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Indicator Variables ٣١٩

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Two RegressionsThe regression equation is

SALES = 199 + 2.56 TIME

Predictor Coef SE Coef T P

Constant 199.017 5.128 38.81 0.000

TIME 2.5559 0.2180 11.73 0.000

S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%

The regression equation is

SALES = 211 + 2.57 TIME + 3.75 Q1 - 26.1 Q2 - 25.8 Q3

Predictor Coef SE Coef T P

Constant 210.846 3.148 66.98 0.000

TIME 2.56610 0.09895 25.93 0.000

Q1 3.748 3.229 1.16 0.254

Q2 -26.118 3.222 -8.11 0.000

Q3 -25.784 3.217 -8.01 0.000

S = 7.190 R-Sq = 95.9% R-Sq(adj) = 95.5%

Indicator Variables ٣٢٠

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Are the Seasonal Effects Significant?

� The strong t-ratios for Q2 and Q3 say "yes" and the model R2 increased by 17.6% when we added the seasonal indicators.

� With evidence this strong we probably don't need to test further.

� In general, however, we would need another partial F test to see if the overall seasonal effect is significant.

Indicator Variables ٣٢١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Partial F Computations

(SSER – SSEF) / (K – L)F =

MSEF

(9622 - 1810)/ (4-1)= = 17.3

(1810/35= 52)

F(0.05,3,35)=2.92

Page 108: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Variable Selection ٣٢٢

Chapter 8Variable Selection

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٣

8.1 Introduction

� Previously we discussed some tests (t-test and partial F) that helped us determine whether certain variables should be in the regression.

� Here we will look at several variable selection strategies that expand on this idea.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٤

Why is This Important?

� If an important variable is omitted, the estimated regression coefficients can become biased (systematically too high or low).

� Their standard errors can become inflated, leading to imprecise intervals and poor power in hypothesis tests.

Page 109: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٥

Strategies

� All possible regressions: computer procedures that briefly examine every possible combination of Xs and report summaries of fit ability.

� Selection algorithms: rules for deciding when to drop or add variables

1. Backwards Elimination

2. Forward Selection

3. Stepwise Regression

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٦

Words of Caution

� None guarantee you get the right model because they do not check assumptions or search for omitted factors like curvature.

� None have the ability to use a researcher's knowledge about the business or economic situation being analyzed.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٧

8.2 All Possible Regressions

� If there are k x variables to consider using, there are 2k possible subsets. For example, with only k=5, there are 32 regression equations.

� Obtaining these sounds like a ton of work but programs like SAS or Minitab have algorithms that can measure fit ability without really producing the equation.

Page 110: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٨

Typical Output

� The program will usually give you a summary table.

� Each line on the table will tell you which variables were in the model, plus measures of fit ability.

� These measures include R2, adjusted R2, Se and a new one, Cp

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٩

The Cp Statistic

p = k + 1 is the number of terms in the model, including the intercept.

SSEp is the SSE of this model

MSEF is the MSE in the "full model" (with all the variables)

SSEpCp = - (n – 2p)

MSEF

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٠

Using The Cp Statistic

Theory says that in a model with bias, Cp will be large.

It also says that in a model with no bias, Cp should be equal to p.

It is thus recommended that we consider models with a small Cp and those with Cp near p = k + 1.

Page 111: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣١

Example 8.1 Meddicorp Revisited

n = 25 sales territoriesy = Sales (in $1000s) in each territoryx1 = Advertising ($100s) in the territoryx2 = Bonuses paid (in $100s) in the territoryx3 = Market share in the territoryx4 = largest competitor's sales ($1000s)x5 = Region code (1 = S, 2 = W, 3 = MW)

We are not using region here because it should be converted to indicator variables which should be examined as a group.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٢

Summary Results For All Possible Regressions

Variables in the Regression R2 R2

adj Cp Se

ADV 81.1 80.2 5.90 101.42

BONUS 32.3 29.3 75.19 191.76

COMPET 14.2 10.5 100.85 215.83

MKTSHR 0.1 0.0 120.97 232.97

ADV, BONUS 85.5 84.2 1.61 90.75

ADV, MKTSHR 81.2 79.5 7.66 103.23

ADV, COMPET 81.2 79.5 7.74 103.38

BONUS, COMPET 38.7 33.2 68.03 186.51

BONUS, MKTSHR 32.8 26.7 76.46 195.33

COMPET, MKTSHR 16.1 8.5 100.18 218.20

ADV, BONUS, MKTSHR 85.8 83.8 3.10 91.75

ADV, BONUS, COMPET 85.7 83.6 3.30 92.26

ADV, MKTSHR, COMPET 81.3 78.6 9.60 105.52

BONUS, MKTSHR, COMPET 40.9 32.5 66.90 187.48

ADV, BONUS, MKTSHR, COMPET 85.9 83.1 5.00 93.77

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٣

The Best Model?

� The two variable model with ADV and BONUS has the smallest Cp and highest adjusted R2.

� The three variable models adding either MKTSHR or COMPET also have small Cp values but only modest increases in R2.

� The two-variable model is probably the best.

Page 112: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٤

M C

B K O

O T M

A N S P

D U H E

Vars R-Sq R-Sq(adj) C-p S V S R T

1 81.1 80.2 5.9 101.42 X

1 32.3 29.3 75.2 191.76 X

2 85.5 84.2 1.6 90.749 X X

2 81.2 79.5 7.7 103.23 X X

3 85.8 83.8 3.1 91.751 X X X

3 85.7 83.6 3.3 92.255 X X X

4 85.9 83.1 5.0 93.770 X X X X

Minitab Results

By default, the Best Subsets procedure prints two models for each number of X variables. This can be increased up to 5.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٥

Limitations

� With a large number of potential xvariables, the all possible approach becomes unwieldy.

� Minitab can use up to 31 predictors, but warns that computational time can be long when as few as 15 are used.

� "Obviously good" predictors can be forced into the model, thus reducing search time, but this is not always what you want.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٦

8.3 Other Variable Selection Techniques

� With a large number of potential xvariables, it may be best to use one of the iterative selection methods.

� These will look at only the set of models that their rules will lead them too, so they may not yield a model as good as that returned by the all possible regressions approach.

Page 113: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٧

8.3.1 Backwards Elimination1. Start with all variables in the equation.

2. Examine the variables in the model for significance and identify the least significant one.

3. Remove this variable if it does not meet some minimum significance level.

4. Run a new regression and repeat until all remaining variables are significant.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٨

No Search Routine Needed?

� Although most software packages have automatic procedures for backwards elimination, it is fairly easy to do interactively.

� Run a model, check its t-tests for significance, and identify the variable to drop.

� Run again with one less variable and repeat the steps.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٣٩

Step 1 – All Variables

Regression Analysis: SALES versus ADV, BONUS, MKTSHR, COMPET

The regression equation is

SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET

Predictor Coef SE Coef T P

Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000

BONUS 1.9059 0.7424 2.57 0.018

MKTSHR 2.651 4.636 0.57 0.574

COMPET -0.1207 0.3718 -0.32 0.749

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

Least Significant

Page 114: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٠

Step 2 – COMPET Eliminated

Regression Analysis: SALES versus ADV, BONUS, MKTSHR

The regression equation is

SALES = - 621 + 2.47 ADV + 1.90 BONUS + 3.12 MKTSHR

Predictor Coef SE Coef T P

Constant -620.6 240.1 -2.58 0.017

ADV 2.4698 0.2784 8.87 0.000

BONUS 1.9003 0.7262 2.62 0.016

MKTSHR 3.116 4.314 0.72 0.478

S = 91.75 R-Sq = 85.8% R-Sq(adj) = 83.8%

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤١

Step 3 – MKTSHR Eliminated

Regression Analysis: SALES versus ADV, BONUS

The regression equation is

SALES = - 516 + 2.47 ADV + 1.86 BONUS

Predictor Coef SE Coef T P

Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٢

8.3.2 Forward Selection

� At each stage, it looks at the xvariables not in the current equation and tests to see if they will be significant if they are added.

� In the first stage, the x with the highest correlation with y is added.

� At later stages it is much harder to see how the next x is selected.

Page 115: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٣

MinitabOutput

forForwardSelection

An option in the

Stepwiseprocedure

Forward selection. Alpha-to-Enter: 0.25

Response is SALES on 4 predictors, with N = 25

Step 1 2

Constant -157.3 -516.4

ADV 2.77 2.47

T-Value 9.92 8.98

P-Value 0.000 0.000

BONUS 1.86

T-Value 2.59

P-Value 0.017

S 101 90.7

R-Sq 81.06 85.49

R-Sq(adj) 80.24 84.18

C-p 5.9 1.6

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٤

Same Model as Backwards

� This data set is not too complex, so both procedures returned the same model.

� With larger data sets, particularly when the x variables are correlated among themselves, results can be different.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٥

8.3.3 Stepwise Regression

� A limitation with the backwards procedure is that a variable that gets eliminated is never considered again.

� With forward selection, variables entering stay in, even if they lose significance.

� Stepwise regression corrects these flaws. A variable entering can later leave. A variable eliminated can later go back in.

Page 116: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٦

MinitabOutput

forStepwise

Regression

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is SALES on 4 predictors, with N = 25

Step 1 2

Constant -157.3 -516.4

ADV 2.77 2.47

T-Value 9.92 8.98

P-Value 0.000 0.000

BONUS 1.86

T-Value 2.59

P-Value 0.017

S 101 90.7

R-Sq 81.06 85.49

R-Sq(adj) 80.24 84.18

C-p 5.9 1.6

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٧

Selection Parameters� For backwards elimination, the user specifies

"Alpha to Remove", which is the maximum p-value a variable can have and stay in the equation.

� For forward selection, the user specifies "Alpha to Enter", which is the minimum p-vale a variable needs to enter the equation.

� Stepwise regression gets both.

� Often we use values like .15 or .20 because this encourages the procedures to look at models with more variables.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٨

8.4 Which Procedure is Best?

� Unless there are too many xvariables, the all possible models approach is favored because it looks at all combinations of variables.

� Of the other strategies, stepwise regression is probably best.

� If no search programs are available, backwards elimination can still provide a useful sifting of the data.

Page 117: Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٤٩

No Guarantees

� Because they do not check assumptions or examine the model residuals, there is no guarantee of returning the right model.

� Nonetheless, these can be effective tools filtering the data and identifying which variables to pay more attention to.