Download - Chapter 3 Simple Regression Analysis (Part 1)site.iugaza.edu.ps/ssafi/files/2010/02/PP-Regression2.pdf · A Second Course in Business and Economic Statistics, fourth edition ... Regression

Multiple Regression I

Simple Regression I ١

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 3Simple Regression Analysis

(Part 1)

Terry DielmanApplied Regression Analysis:

A Second Course in Business and Economic Statistics, fourth edition

Simple Regression I ٢


3.1 Using Simple Regression to Describe a Relationship

� Regression analysis is a statistical technique used to describe relationships among variables.

� The simplest case is one where a dependent variable y may be related to an independent or explanatory variable x.

� The equation expressing this relationship is the line:

xbby 10 +=

Simple Regression I ٣


Slope and Intercept

� For a given set of data, we need to calculate values for the slope b1 and the intercept b0.

� Figure 3.1 shows the graph of a set of six (x, y) pairs that have an exact relationship.

� Ordinary algebra is all you need to compute y = 1 + 2x


Simple Regression I ٤


Figure 3.1 Graph of An Exact Relationship

654321

13

8

3

x

y

x y

1 3

2 5

3 7

4 9

5 11

6 13

Simple Regression I ٥


Error in the Relationship

� In real life, we usually do not have exact relationships.

� Figure 3.2 shows a situation where the yand x have a strong tendency to increase together but it is not perfect.

� You can use a ruler to put a line in approximately the "right place" and use algebra again.

^� A good guess might be y = 1 + 2.5x

Simple Regression I ٦


Figure 3.2 Graph of a Relationship That is NOT Exact

x y

1 3

2 2

3 8

4 8

5 11

6 13654321

12

7

2

x

y

S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression Plot


Simple Regression I ٧


Everybody Is Different

� The drawback to this technique is that everybody will have their own opinion about where the line goes.

� There would be ever greater differences if there were more data with a wider scatter.

� We need a precise mathematical technique to use for this task.

Simple Regression I ٨


Residuals

� Figure 3.3 shows the previous graph where the "fit error" of each point is indicated.

� These residuals are positive if the point is above the line and negative if the line is above the point.

� We want a technique that will make the + and – even out.

Simple Regression I ٩


654321

12

7

2

x

y

S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %

y = -0.2 + 2.2 x

Regression PlotFigure 3.3 Deviations From the Line

- deviations

+ deviations


Simple Regression I ١٠


Computation Ideas (1)

We can search for a line that minimizes the sum of the residuals:

While this is a good idea, it can be shown that any line passing through the point (x, y) will have this sum = 0.

)ˆ(1

i

n

i

iyy −∑

=

Simple Regression I ١١



We can work with absolute values and search for a line that minimizes:

Such a procedure—called LAV or least absolute value regression—does exist but usually is found only in specialized software.

|ˆ|1

i

n

i

iyy −∑

=

Simple Regression I ١٢



By far the most popular approach is to square the residuals and minimize:

This procedure is called least squaresand is widely available in software. It uses calculus to solve for the b0 and b1 terms and gives a unique solution.

2

1

)ˆ( i

n

i

i yy −∑=


Simple Regression I ١٣


Least Squares Estimators

� There are several formula for the b1term. If doing it by hand, we might want to use:

_ _� The intercept is b0 = y – b1 x

∑ ∑

∑ ∑ ∑

= =

= = =

−

−

=n

i

n

i

ii

n

i

n

i

n

i

iiii

xn

x

yxn

yx

b

1

2

1

2

1 1 11

1

1

Simple Regression I ١٤


Figure 3.5 Computations

Requiredfor b1 and b0

xi yi xi2 xiyi

1 3 1 3

2 2 4 4

3 8 9 24

4 8 16 32

5 11 25 55

6 13 36 78

21 45 91 196Totals

Simple Regression I ١٥


Calculations

=

−

−

=

∑ ∑

∑ ∑ ∑

= =

= = =

n

i

n

i

ii

n

i

n

i

n

i

iiii

xn

x

yxn

yx

b

1

2

1

2

1 1 11

1

1

_ _b0 = y – b1 x =


Simple Regression I ١٦


The Unique Minimum

� The line we obtained was:

� The sum of squared errors (SSE) is:

� No other linear equation will yield a smaller SSE. For the line 1 + 2.5x we guessed earlier, the SSE is 10.75

xy 2.22.0ˆ +−=

80.8)ˆ( 2

1

=−∑=

i

n

i

i yy

Simple Regression I ١٧


3.2 Examples of Regression as a Descriptive Technique

Example 3.2 Pricing Communications Nodes

A Ft. Worth manufacturing company was concerned about the cost of adding nodes to a communications network. They obtained data on 14 existing nodes.

They did a regression of cost (the y) on number of ports (x).

Simple Regression I ١٨


70605040302010

60000

50000

40000

30000

20000

NUMPORTS

CO

ST

Pricing Communications Nodes

Cost = 16594 + 650 NUMPORTS


Simple Regression I ١٩


Example 3.3 Estimating Residential Real Estate Values

The Tarrant County Appraisal District uses data such as house size, location and depreciation to help appraise property.

Regression can be used to establish a weight for each factor. Here we look at how price depends on size for a set of 100 homes. The data are from 1990.

Simple Regression I ٢٠


4500350025001500500

300000

200000

100000

0

SIZE

VA

LU

E

Tarrant County Real Estate

VALUE = -50035 + 72.8 SIZE

Simple Regression I ٢١


Example 3.4 Forecasting Housing Starts

Forecasts of various economic measures is important to the government and various industries.

Here we analyze the relationship between US housing starts and mortgage rates. The rate used is the US average for new home purchases.

Annual data from 1963 to 2002 is used.


Simple Regression I ٢٢


15105

2400

2200

2000

1800

1600

1400

1200

1000

RATES

ST

AR

TS

US Housing Starts

STARTS = 1726 - 22.2 RATES

Simple Regression I ٢٣


3.3 Inferences From a Simple Regression Analysis

� So far regression has been used as a way to describe the relationship between the two variables.

� Here we will use our sample data to make inferences about what is going on in the underlying population.

� To do that, we first need some assumptions about how things are.

Simple Regression I ٢٤


3.3.1 Assumptions Concerning the Population Regression Line

� Lets use the communications nodes example to illustrate. Costs ranged from roughly $23000 to $57000 and number of ports from 12 to 68.

� Three times we had projects with 24 ports, but the three costs were all different. The same thing occurred at repeated observations at 52 and 56 ports.

� This illustrates how we view things: at each value of x there is a distribution of potential y values that can occur.


Simple Regression I ٢٥


The Conditional Mean

� Our first assumption is that the means of these distributions all lie on a straight line:

� For example, at projects with 30 ports, we have:

� The actual cost of projects with 30 ports are going to be distributed about the mean. This also happens at other sizes of projects, so you might see something like the next slide.

xxy 10| ββµ +=

1030| 30ββµ +==xy

Simple Regression I ٢٦


Cost

Nodes

12 30 68

Figure 3.12 Distribution of Costs around the Regression Line

ββββ0 + ββββ1 Nodes

Simple Regression I ٢٧


The Disturbance Terms

� Because of the variation around the regression line, it is convenient to view the individual costs as:

� The ei are called the disturbances and represent how yi differs from its conditional mean. If yi is above the mean, its disturbance has a + value.

iii exy ++= 10 ββ


Simple Regression I ٢٨


Assumptions

1. We expect the average disturbance ei to be zero so the regression line passes through the conditional mean of y.

2. The ei have constant variance σe2.

3. The ei are normally distributed.

4. The ei are independent.

Simple Regression I ٢٩


3.3.2 Inferences About β0 and β1

� We use our sample data to estimate β0 byb0 and β1 by b1. If we had a different sample, we would not be surprised to get different estimates.

� Understanding how much they would vary from sample to sample is an important part of the inference process.

� We use the assumptions, together with our data, to construct the sampling distributions for b0 and b1.

Simple Regression I ٣٠


The Sampling Distributions

� The estimators have many good statistical properties. They are unbiased, consistent and minimum variance.

� They have normal distributions with standard errors that are functions of the x values and σe

2.

� Full details are in Section 3.3.2


Simple Regression I ٣١


Estimate of σe2

� This is an unknown quantity that needs to be estimated from data.

� We estimate it by the formula:

� The term MSE stands for mean squared error and is more or less the average squared residual.

MSEn

SSE

n

yy

Si

n

i

i

e =−

=−

−

=∑

=

22

)ˆ( 2

12

Simple Regression I ٣٢


Standard Error of the Regression

� The divisor n-2 used in the previous calculation follows our general rule that degrees of freedom are sample size – the number of estimates we make (b0 and b1) before estimating the variance.

� The square root of MSE is Se which we call the standard error of the regression.

� Se can be roughly interpreted as the "typical" amount we miss in estimating each y value.

Simple Regression I ٣٣


Inference About β1

� Interval estimates and hypothesis tests are constructed using the sampling distribution of b1.

� The standard error of b1 is:

� Computer programs routinely compute this and report its value.

2)1(

11

x

ebSn

SS−

=


Simple Regression I ٣٤


Interval Estimate

� The distribution we use is a t with n-2 degrees of freedom.

� The interval is:

� The value of t, of course, depends on the selected confidence level.

121 bn stb −±

Simple Regression I ٣٥


Tests About β1

The most common test is that a change in the x variable does not induce a change in y, which can be stated:

H0: β1 = 0 Ha: β1 ≠ 0

If H1 is true it implies the population regression equation is a flat line; that is, regardless of the value of x, y has the same distribution.

Simple Regression I ٣٦


Test Statistic

The test would be performed by using the standardized test statistic:

Most computer programs compute this, and its associated p-value. and include them on the output.

The p-value is for the two-sided version of the test.

bS

bt

1

01 −=


Simple Regression I ٣٧


Inference About β0

� We can also compute confidence intervals and perform hypothesis tests about the intercept in the population equation.

� Details about the tests and intervals are in Section 3.3.2, but in most problems we are not interested in this.

� The intercept is the value of y at x=0 and in many problems this is not relevant; for example, we never see houses with zero square feet of floor space.

� Sometimes it is relevant, anyway. If we are estimating costs, we could interpret the intercept as the fixed cost. Even though we never see communication nodes with zero ports, there is likely to be a fixed cost associated with setting up each project.

Simple Regression I ٣٨


Example 3.6 Pricing Communications Nodes (continued)

Inference questions:1. What is the equation relating NUMPORTS to

COST?

2. Is the relationship significant?

3. What is an interval estimate of β1?

4. Is the relationship positive?

5. Can we claim each port costs at least $1000?

6. What is our estimate of fixed cost?

7. Is the intercept 0?

Simple Regression I ٣٩


Minitab Regression OutputRegression Analysis: COST versus NUMPORTS

The regression equation is

COST = 16594 + 650 NUMPORTS

Predictor Coef SE Coef T P

Constant 16594 2687 6.18 0.000

NUMPORTS 650.17 66.91 9.72 0.000

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 1751268376 1751268376 94.41 0.000

Residual Error 12 222594146 18549512

Total 13 1973862521


Simple Regression I ٤٠


Is the relationship significant?

H0: β1 = 0 (Cost does not change whennumber of ports increase)

Ha: β1 ≠ 0 (Cost does change)

We will use a 5% level of significance and the t distribution with (n-2) = 12 degrees of freedom.

Decision rule: Reject H0 if t > 2.179or if t < -2.179

from Minitab output t = 9.72 (p-value =.000)

We conclude that there is a significant relationship between project size and cost.

Simple Regression I ٤١


What is an interval estimate of β1?

Interval is:

For a 95% interval use t = 2.179

650.17 ± 2.179(66.91) = 650.17 ± 145.80

We are 95% sure that the average cost for each additional node is between $504 and $796.

121 bn stb −±

Simple Regression I ٤٢


Can we claim a positive relationship?

H0: β1 = 0 (Cost does not change when size increases)

Ha: β1 > 0 (Cost increases when size increases)

We will use a 5% level of significance and the tdistribution with (n-2) = 12 degrees of freedom.

Decision rule: Reject H0 if t > 1.782

From Minitab output t = 9.72 (p-value is half of the listed value of .000, which is still .000)

We conclude that the project cost does increase with project size.


Simple Regression I ٤٣


Is the cost per port at least $1000?

H0: β1 ≥ 1000 (Cost per port at least $1000)

Ha: β1 < 1000 (Cost is less than $1000)

Again we will use a 5% level of significance and 12 degrees of freedom.

Decision rule: Reject H0 if t < -1.782

We conclude that the cost per node is (much) less than $1000.

23.591.66

100017.6501000useHere

1

1 −=−

=−

=

bS

bt

Simple Regression I ٤٤


What is our estimate of fixed cost?

We can interpret the intercept of the equation as fixed cost, and the slope as variable cost. For the intercept, an interval is:

16594 ± 2.179(2687) = 16954 ± 5855

We are 95% sure the fixed cost is between $11,099 and $22,809.

020 bn stb −±

Simple Regression I ٤٥


Is the intercept 0?

H0: β0 = 0 (Fixed cost is 0)

Ha: β0 ≠ 0 (Fixed cost is not 0)

Again, use a 5% level of significance and 12 d.f.

Decision rule: Reject H0 if t > 2.179or if t < -2.179

from Minitab output t = 6.18 (p-value =.000)

We conclude that the fixed cost is not zero.


Simple Regression II ٤٦


Chapter 3Simple Regression Analysis

(Part 2)



Simple Regression II ٤٧


3.4 Assessing the Fit of the Regression Line

� It some problems, it may not be possible to find a good predictor of the y values.

� We know the least squares procedure finds the best possible fit, but that deosnot guarantee good predictive power.

� In this section we discuss some methods for summarizing the fit quality.

Simple Regression II ٤٨


3.4.1 The ANOVA Table

Let us start by looking at the amount of variation in the y values. The variation about the mean is:

which we will call SST, the total sum of squares.

Text equations (3.14) and (3.15) show how this can be split up into two parts.

2

1

)( yyn

i

i −∑=


Simple Regression II ٤٩


Partitioning SST

SST can be split into two pieces which are the previously introduced SSE and a new quantity, SSR, the regression sum of squares.

SST = SSR + SSE

2

1

2

1

2

1

)ˆ()ˆ()( i

n

i

i

n

i

i

n

i

i yyyyyy −+−=− ∑∑∑===

Simple Regression II ٥٠


Explained and Unexplained Variation

� We know that SSE is the sum of all the squared residuals, which represent lack of fit in the observations.

� We call this the unexplained variation in the sample.

� Because SSR contains the remainder of the variation in the sample, it is thus the variation explained by the regression equation.

Simple Regression II ٥١


The ANOVA Table

Most statistics packages organize these quantities in an ANalysis Of VAriance table.

Source DF SS MS F

Regression 1 SSR MSR MSR/MSE

Residual n-2 SSE MSE

Total n-1 SST


Simple Regression II ٥٢


3.4.2 The Coefficient of Determination

� If we had an exact relationship between y and x, then SSE would be zero and SSR = SST.

� Since that does not happen often it is convenient to use the ratio of SSR to SST as measure of how close we get to the exact relationship.

� This ratio is called the Coefficient of Determination or R2.

Simple Regression II ٥٣


R2

SSRR2 = —— is a fraction between 0 and 1

SST

In an exact model, R2 would be 1. Most of the time we multiply by 100 and report it as a percentage.

Thus, R2 is the percentage of the variation in the sample of y values that is explained by the regression equation.

Simple Regression II ٥٤


Correlation Coefficient

� Some programs also report the square root of R2 as the correlation between the y and y-hat values.

� When there is only a single predictor variable, as here, the R2 is just the square of the correlation between yand x.


Simple Regression II ٥٥


3.4.3 The F Test

� An additional measure of fit is provided by the F statistic, which is the ratio of MSR to MSE.

� This can be used as another way to test the hypothesis that β1 = 0.

� This test is not real important in simple regression because it is redundant with the t test on the slope.

� In multiple regression (next chapter) it is much more important.

Simple Regression II ٥٦


F Test Setup

The hypotheses are:H0: β1 = 0 Ha: β1 ≠ 0

The F ratio has 1 numerator degree of freedom and n-2 denominator degrees of freedom.

A critical value for the test is selected from that distribution and H0 is rejected if the computed F ratio exceeds the critical value.

Simple Regression II ٥٧


Example 3.8 Pricing Communications Nodes (continued)

Below we see the portion of the Minitab output that lists the statistics we have just discussed.

S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%


Source DF SS MS F P

Regression 1 1751268376 1751268376 94.41 0.000

Residual Error 12 222594146 18549512

Total 13 1973862521


Simple Regression II ٥٨


R2 and F

R2 = SSR/SST = 1751268376/ 1973862521

= .8872 or 88.7%

F = MSR/MSE = 1751268376/222594146

= 94.41

From the F1,12 distribution, the critical value at a 5% significance level is 4.75

Simple Regression II ٥٩


3.5 Prediction or Forecasting With a Simple Linear Regression Equation

� Suppose we are interested in predicting the cost of a new communications node that had 40 ports.

� If this size project is something we would see often, we might be interested in estimating the average cost of all projects with 40 nodes.

� If it was something we expect to see only once, we would be interested in predicting the cost of the individual project.

Simple Regression II ٦٠


3.5.1 Estimating the Conditional Mean of y Given x.

At xm = 40 ports, the quantity we are estimating is:

Our best guess of this is just the point on the regression line:

1040| 40ββµ +==xy

10 40ˆ bby m +=


Simple Regression II ٦١


Standard Error of the Mean

� We will want to make an interval estimate, so we need some kind of standard error.

� Because our point estimate is a function of the random variables b0 and b1 their standard errors figure into our computation.

� The result is:

2

2

)1(

)(1

x

mem

Sn

xx

nSS

−

−+=

Simple Regression II ٦٢


Where Are We Most Accurate?

� For estimating the mean at the point xm

the standard error is Sm.

� If you examine the formula:

you can see that the second term will be zero if we predict at the mean value of x.

� That makes sense—it says you do your best prediction right in the center of your data.

2

2

)1(

)(1

x

mem

Sn

xx

nSS

−

−+=

Simple Regression II ٦٣


Interval Estimate

� For estimating the conditional mean of y that occurs at xm we use:

^ym ± tn-2 Sm

� We call this a confidence interval for the mean value of y at xm.


Simple Regression II ٦٤


Hypothesis Test

� We could also perform a hypothesis test about the conditional mean.

� The hypothesis would be:

H0: µy|x=40 = (some value)

and we would construct a t ratio from the point estimate and standard error.

Simple Regression II ٦٥


3.5.2 Predicting an Individual Value of y Given x

� If we are trying to say something about an individual value of y it is a little bit harder.

� We not only have to first estimate the conditional mean, but we also have to tack on an allowance for ybeing above or below its mean.

� We use the same point estimate but our standard error is larger.

Simple Regression II ٦٦


Prediction Standard Error

� It can be show that the prediction standard error is:

� This looks a lot like the previous one but has an additional term under he square root sign.

� The relationship is:

2

2

)1(

)(11

x

mep

Sn

xx

nSS

−

−++=

222

emp SSS +=


Simple Regression II ٦٧


Predictive Inference

� Although we could be interested in a hypothesis test, the most common type of predictive inference is a prediction interval.

� The interval is just like the one for the conditional mean, except that Sp

is used in the computation.

Simple Regression II ٦٨


Example 3.10 Pricing Communications Nodes (one last time)

What do we get when there are 40 ports?

Many statistics packages have a way for you to do the prediction. Here is Minitab's output:

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 42600 1178 ( 40035, 45166) ( 32872, 52329)

Values of Predictors for New Observations

New Obs NUMPORTS

1 40.0

Simple Regression II ٦٩


From the Output

^ym = 42600 Sm = 1178

Confidence interval: 40035 to 45166computed: 42600 ± 2.179(1178)

Prediction interval: 32872 to 52329computed: 42600 ± 2.179(????)

it does not list Sp


Simple Regression II ٧٠


Interpretations

For all projects with 40 nodes, we are 95% sure that the average cost is between $40,035 and $45,166.

We are 95% sure that any individual project will have a cost between $32,872 and $52,329.

Simple Regression II ٧١


3.5.3 Assessing Quality of Prediction

� We use the model's R2 as a measure of fit ability, but this may overestimate the model's ability to predict.

� The reason for that is that R2 is optimized by the least squares procedure, for the data in our sample.

� It is not necessarily optimal for data outside our sample, which is what we are predicting.

Simple Regression II ٧٢


Data Splitting

� We can split the data into two pieces. Use the first part to obtain the equation and use it to predict the data in the second part.

� By comparing the actual y values in the second part to their corresponding predicted values, you get an idea of how well you predict data that is not in the "fit" sample.

� The biggest drawback to this is that it won't work too well unless we have a lot of data. To be really reliable we should have at least 25 to 30 observations in both samples.


Simple Regression II ٧٣


The PRESS Statistic

� Suppose you temporarily deleted observation i from the data set, fit a new equation, then used it to predict the yivalue.

� Because the new equation did not use any information from this data point, we get a clearer picture of the model's ability to predict it.

� The sum of these squared prediction errors is the PRESS statistic.

Simple Regression II ٧٤


Prediction R2

� It sounds like a lot of work to do by hand, but most statistics packages will do it for you.

� You can then compute an R2-like measure called the prediction R2:

SST

PRESSRPRED −=12

Simple Regression II ٧٥


In Our Example

For the communications node data we have been using, SSE = 222594146, SST =1973862521 and R2 = 88.7%

Minitab reports that PRESS = 345066019

Our prediction R2:

1 – (345066019/1973862521) = 1 - .175 = .825 or 82.5%

Although there is a little loss, it implies we still have good prediction ability.


Simple Regression II ٧٦


3.6 Fitting a Linear Trend Model to Time-Series Data

� Data gathered on different units at the same point in time are called cross sectional data.

� Data gathered on a single unit (person, firm, etc.) over a sequence of time periods are called time-series data.

� With this type of data, the primary goal is often building a model that can forecast the future

Simple Regression II ٧٧


Time Series Models

� There are many types of models that attempt to identify patterns of behavior in a time series in order to extrapolate it into the future.

� Some of these will be examined in Chapter 11, but here we will just employ a simple linear trend model.

Simple Regression II ٧٨


The Linear Trend ModelWe assume the series displays a steady upward or

downward behavior over time that can be described by:

where t is the time index (t =1 for the first observation, t=2 for the second, and so forth).

The forecast for this model is quite simple:

You just insert the appropriate value for T into the regression equation.

tt ety ++= 10 ββ

Tbby T 10ˆ +=


Simple Regression II ٧٩


Example 3.11 ABX Company Sales

� The ABX Company sells winter sports merchandise including skates and skis. The quarterly sales (in $1000s) from first quarter 1994 through fourth quarter 2003 are graphed on the next slide.

� The time-series plot shows a strong upward trend. There are also some seasonal fluctuations which will be addressed in Chapter 7.

Simple Regression II ٨٠


300

250

200

40302010

SA

LE

S

Index

ABX Company Sales

Simple Regression II ٨١


Obtaining the Trend Equation

� We first need to create the time index variable which is equal to 1 for first quarter 1994 and 40 for fourth quarter 2003.

� Once this is created we can obtain the trend equation by linear regression.


Simple Regression II ٨٢


Trend Line EstimationThe regression equation is

SALES = 199 + 2.56 TIME


Constant 199.017 5.128 38.81 0.000

TIME 2.5559 0.2180 11.73 0.000

S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%


Source DF SS MS F P

Regression 1 34818 34818 137.50 0.000

Residual Error 38 9622 253

Total 39 44440

Simple Regression II ٨٣


The Slope Coefficient

The slope in the equation is 2.5559. This implies that over this 10-year period, we saw an average growth in sales of $2,556 per quarter.

The hypothesis test on the slope has a t value of 11.73, so this is indeed significantly greater than zero.

Simple Regression II ٨٤


Forecasts For 2004

� Forecasts for 2004 can be obtained by evaluating the equation at t = 41, 42, 43 and 44.

� For example, the sales in fourth quarter are forecast:

SALES = 199 + 2.56 (44) = 311.48

� A graph of the data, the estimated trend and the forecasts is next.


Simple Regression II ٨٥


454035302520151050

300

250

200

TIME

SA

LE

S

Data, Trend (—) and Forecast (---)

Simple Regression II ٨٦


3.7 Some Cautions in Interpreting Regression Results

Two common mistakes that are made when using regression analysis are:

1. That x causes y to happen, and

2. That you can use the equation to predict y for any value of x.

Simple Regression II ٨٧


3.7.1 Association Versus Causality

� If you have a model with a high R2, it does not automatically mean that a change in xcauses y to change in a very predictable way.

� It could be just the opposite, that y causes x to change. A high correlation goes both ways.

� It could also be that both y and x are changing in response to a third variable that we don't know about.


Simple Regression II ٨٨


The Third Factor

� One example of this third factor is the price and gasoline mileage of automobiles. As price increases, there is a sharp drop in mpg. This is caused by size. Larger cars cost more and get less mileage.

� Another is mortality rate in a country versus percentage of homes with television. As TV ownership increases, mortality rate drops. This is probably due to better economic conditions improving quality of life and simultaneously allowing for greater ownership.

Simple Regression II ٨٩


3.7.2 Forecasting Outside the Range of the Explanatory Variable

� When we have a model with a high R2, it means we know a good deal about the relationship of y and x for the range of x values in our study.

� Think of our communication nodes example where number of ports ranged from 12 to 68. Does our model even hold if we wanted to price a massive project of 200 ports?

Simple Regression II ٩٠


An Extrapolation Penalty

� Recall that our prediction intervals were always narrowest when we predicted right in the middle of our data set.

� As we go farther and farther outside the range of our data, the interval gets wider and wider, implying we know less and less about what is going on.


Multiple Regression I ٩١


Chapter 4Multiple Regression Analysis

(Part 1)



Multiple Regression I ٩٢


4.1 Using Multiple Regression

� In Chapter 3, the method of least squares was used to describe the relationship between a dependent variable y and an explanatory variable x.

� Here we extend that to two or more predictor variables, using an equation of the form:

kk xbxbxbby ++++= L22110ˆ

Multiple Regression I ٩٣


Basic Exploration

� In Chapter 3 our main graphic tool was the X-Y scatter plot.

� Exploratory graphics are a bit harder to produce here because they need to be multidimensional.

� Even if there were just two xvariables a 3-D display is needed.


Multiple Regression I ٩٤


Estimation of Coefficients

We want an equation of the form:

As before we use least squares. The coefficients b0 b1 b2 ... bk are determined by minimizing the sum of squared residuals.

kk xbxbxbby ++++= L22110ˆ

Multiple Regression I ٩٥


Formulae are Very Complex

� Can show exact formula when k=1 (simple regression). Refer to Section 3.1.

� Few texts show the formulae for k=2 (the simplest of multiple regressions)

� Appendix D shows formula in matrix notation

� This is totally a computer problem

Multiple Regression I ٩٦


Example 4.1 Meddicorp Sales

n = 25 sales territories

Y = Sales (1000$) in each territory

X1 = Advertising (100$) in territory

X2 = Amount of bonuses (100$) paid

to salespersons in the territory

Data set MEDDICORP4


Multiple Regression I ٩٧


600500400

1700

1600

1500

1400

1300

1200

1100

1000

900

ADV

SA

LE

S

Plotsand

Correlation

340290240

1700

1600

1500

1400

1300

1200

1100

1000

900

BONUS

SA

LE

S

Correlations

SALES ADV

ADV 0.900

BONUS 0.568 0.419

Multiple Regression I ٩٨


3D Graphics

900

1000

1100

1200

600

1300

1400

1500

1600

SALES

1700

500 240ADV400

290Bon

340

Meddicorp Sales

Multiple Regression I ٩٩


Minitab Regression OutputThe regression equation is

SALES = - 516 + 2.47 ADV + 1.86 BONUS


Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974


Multiple Regression I ١٠٠


3D Surface Graph

900

600

1000

1100

1200

1300

1400

1500

EstSales

1600

500 240Advert

290

400Bonus340

Estimated Meddicorp Sales


SALES = - 516 + 2.47 ADV + 1.86 BONUS

Multiple Regression I ١٠١


Interpretation of Coefficients

� Recall that sales is in $1000s and advertising and bonus in $100s.

� If advertising is held fixed, sales increase $1860 for each $100 of bonus paid.

� If bonus were fixed, sales increase $2470 for each $100 spent on adv.


SALES = - 516 + 2.47 ADV + 1.86 BONUS

Multiple Regression I ١٠٢


4.2 Inferences From a MultipleRegression Analysis

In general, the population regression equation

involving K predictors is:

This says the mean value of y at a given set of x

values is a point on the surface described by the

terms on the right-hand side of the equation.

KKxxxy xxxK

ββββµ ++++= L22110,...,,| 21


Multiple Regression I ١٠٣


4.2.1 Assumptions Concerning the Population Regression Line

An alternative way of writing the relationship is:

where i denotes the ith observation and eidenotes a random error or disturbance (deviation from the mean).

We make certain assumptions about the ei.

ikikiii exxxy +++++= ββββ L22110

Multiple Regression I ١٠٤


Assumptions

1. We expect the average disturbance ei to be zero so the regression line passes through the average value of y.

2. The ei have constant variance σe2.

3. The ei are normally distributed.

4. The ei are independent.

Multiple Regression I ١٠٥


Inferences

� The assumptions allow inferences about the population relationship to be made from a sample equation.

� The first inferences considered are those about the individual population coefficients β1 β2 ... βK.

� Chapter 6 examines what happens when the assumptions are violated.


Multiple Regression I ١٠٦


4.2.2 Inferences about the Population Regression Coefficients

If we wish to make an estimate of the effect of a change in one of the x variables on y, use the interval:

this refers to the jth of the K+1 regression coefficients. The multiplier t is selected from the t-distribution with n-K-1 degrees of freedom.

jbKnj stb 1−−±

Multiple Regression I ١٠٧


Tests About the Coefficients

A test about the marginal effect of xj

on y may be obtained from:

H0: ββββj = ββββj*

Ha: ββββj ≠ ββββj*

where ββββj* is some specific value that is

relevant for the jth coefficient.

Multiple Regression I ١٠٨


Test Statistic

The test would be performed by using the standardized test statistic:

The most common form of this test is for the parameter to be 0. In this case the test statistic is just the estimate divided by its standard error.

bS

t

j

jjb β*

−=


Multiple Regression I ١٠٩


Example 4.2 Meddicorp (Continued)

Refer again to the portion of the regression output about the individual regression coefficients:

This lists the estimates, their standard errors and the ratio of the estimates to their standard errors.


Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

Multiple Regression I ١١٠


Tests For Effect of Advertising

To see if an increase in advertising expenditure affects sales, we can test:

H0: ββββADV = 0 (An increase in advertisinghas no effect on sales)

Ha: ββββADV ≠ 0 (Sales do change whenadvertising increases)

The df are n-K-1 = 25–2-1 = 22. At a 5% significance level, the critical point from the t-table is 2.074

Multiple Regression I ١١١


Test Result

From the output we get:

t = ( 2.4732 – 0)/.2753 = 8.98

This is above the critical value of 2.074, so we reject H0.

Note that we could also make use of the p-value (.000) for the test.


Multiple Regression I ١١٢


One-Sided Test on Bonus

We can modify the test to make it one sided

H0: ββββBONUS = 0 (Increased bonuses

do not affect sales)

Ha: ββββBONUS > 0 (Sales increase when

bonuses are higher)

At a 5% significance level, the (one-sided) critical point is 1.717.

Multiple Regression I ١١٣


One-Sided Test Result

From the output we get:

t = 1.8562/.7157 = 2.59 which is > 1.717

We reject H0 but this time make a more specific conclusion.

The listed p-value (.017) is for a two-sided test. For our one-sided test, cut it in half.

Multiple Regression I ١١٤


Interval Effect of advertising

Recall that sales are measured in 1000$ and ADV in 100$

badv = 2.4732 and has standard error = .2753

2.4732 ± 2.074(.2753) = 2.4732 ± .5709

= 1.902 to 3.044

Each $100 spent on advertising returns $1902 to $3044 in sales.


Multiple Regression I ١١٥


4.3 Assessing the Fit of the Model

Recall how we partitioned the variation in the previous chapter:

SST = Total variation in the sample of Y values

Split up into two components SSE, SSR

SSE = Error or unexplained variation

SSR = Explained by the Yhat function

Multiple Regression I ١١٦


4.3.1 The ANOVA Table and R2

� These are the same statistics we briefly examined in simple regression.

� They are perhaps more important here because they measure how well all the variables in the equation work together.

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974

Multiple Regression I ١١٧


R2 – a Universal Measure of Fit

R2 = SSR / SST = proportion of variation explained by the regression equation.

If multiplied by 100, interpret as %

If only one x, R2 is square of correlation

For multiple, R2 is square of correlation between the Y values and Y-hat values


Multiple Regression I ١١٨


For our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974

R2 = 1067797 / 1248974 = .85494

85.5% of the variation in sales in the 25 territories is explained by the different levels of advertising and bonus

Multiple Regression I ١١٩


Adjusted R2

� If there are many predictor variables to choose from, the best R2 is always obtained by throwing them all in the model.

� Some of these predictors could be insignificant, suggesting they contribute little to the model's R2.

� Adjusted R2 is a way to balance the desire for high R2 against the desire to include only important variables.

Multiple Regression I ١٢٠


Computation

The "adjustment" is for the number of variables in the model.

Although regular R2 may decrease when you remove a variable, the adjusted version may actually increase if that variable did not have much significance.

)1/(

)1/(12

−

−−−=

nSST

KnSSERadj


Multiple Regression I ١٢١


4.3.2 The F Statistic

� Since R2 is so high, you would certainly think that the model contains significant predictive power.

� In other problems it is perhaps not so obvious. For example, would an R2 of 20% show any prediction ability at all?

� We can test for the predictive power of the entire model using the F statistic.

Multiple Regression I ١٢٢


F Tests

� Generally these compare two sources of variation

� F = V1/V2 and has two df parameters

� Here V1 = SSR/K has K df

� And V2 = SSE/(n-K-1) has n-k-1 df

Multiple Regression I ١٢٣


Usually will see several pages of these; one or

two pages at each specific level of significance

(.10, .05, .01).

Value of F at a

specific significance level

Numerator d.f.

denom.

d.f.

F Tables


Multiple Regression I ١٢٤


F Test Hypotheses

H0: β1 = β2 = …= βK = 0 (None of the Xs

help explain Y)

Ha: Not all βs are 0 (At least one X is useful)

H0: R2 = 0 is an equivalent hypothesis

Multiple Regression I ١٢٥


F test for our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974

F = 533899 / 8235 = 64.83 has p-value = 0.000

From tables, F2,22,.05 = 3.44 and F2,22,.01 = 5.72

Confirms that R2 = 85.5% is not near zero

Chapter 4Multiple Regression Analysis

(Part 2)




4.4 Comparing Two Regression Models

So far we have looked at two types of hypothesis tests. One was about the overall fit:

H0: β1 = β2 = …= βK = 0

The other was about individual terms:

H0: βj = 0

Ha: βj ≠ 0

4.4.1 Full and Reduced Model Using Separate Regressions

� Suppose we wanted to test a subset of the x variables for significance as a group.

� We could do this by comparing two models.

� The first (Full Model) has K variables in it.

� The second (Reduced Model) contains only the L variables that are NOT in our group.

The Two Models

For convenience, let's assume the group is the last (K-L) variables. The Full Model is:

The Reduced Model is just:

exxxxy KKLLLL +++++++= ++ βββββ LL 11110

exxy LL ++++= βββ L110


The Partial F Test

We test the group for significance with another F test. The hypothesis is:

H0: βL+1 = βL+2 = …= βK = 0

Ha: At least one β ≠ 0

The test is performed by seeing how much SSE changes between models.

The Partial F Statistic

Let SSEF and SSER denote the SSE in the full and reduced models.

(SSER – SSEF) / (K – L)F =

SSEF / (n-K-1)

The statistic has (K-L) numerator and (n-K-1) denominator d.f.

The "Group"

� In many problems the group of variables has a natural definition.

� In later chapters we look at groups that provide curvature, measure location and model seasonal variation.

� Here we are just going to look at the effect of adding two new variables.


Example 4.4 Meddicorp (yet again)

In addition to the variables for advertising and bonuses paid, we now consider variables for market share and competition.

x3 = Meddicorp market share in each area

x4 = largest competitor's sales in each area

The New Regression ModelThe regression equation is

SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET


Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000

BONUS 1.9059 0.7424 2.57 0.018

MKTSHR 2.651 4.636 0.57 0.574

COMPET -0.1207 0.3718 -0.32 0.749

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%


Source DF SS MS F P

Regression 4 1073119 268280 30.51 0.000


Total 24 1248974

Did We Gain Anything?

� The old model had R2 = 85.5% so we gained only .4%.

� The t ratios for the two new variables are .57 and -.32.

� It does not look like we have an improvement, but we really need the F test to be sure.


The Formal Test

Numerator df = (K-L) = 4-2 = 2Denominator df = (n-K-1) = 20

At a 5% level, F2,20 = 3.49

H0: βMKTSHR = βCOMPET = 0Ha: At least one is ≠ 0

Reject H0 if F > 3.49

Things We Need

Full Model: (K = 4)SSEF = 175855(n-K-1) = 20

Reduced Model: (L = 2)Analysis of Variance

Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974

SSER

Computations


SSEF / (n-K-1)

(181176 – 175855)/ (4 – 2)=

175855 / (25-4-1)

5321/2= = .3026

8793


4.4.2 Full and Reduced Model Comparisons Using Conditional Sums of Squares

� In the standard ANOVA table, SSR shows the amount of variation explained by all variables together.

� Alternate forms of the table break SSR down into components.

� For example, Minitab shows sequential SSR which shows how much SSR increases as each new term is added.

Sequential SSR for MeddicorpS = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%


Source DF SS MS F P

Regression 4 1073119 268280 30.51 0.000


Total 24 1248974

Source DF Seq SS

ADV 1 1012408

BONUS 1 55389

MKTSHR 1 4394

COMPET 1 927

Meaning What?

1. If ADV was added to the model first, SSR would rise from 0 to 1012408.

2. Addition of BONUS would yield a nice increase of 55389.

3. If MKTSHR entered third, SSR would rise a paltry 4394.

4. Finally, if COMPET came in last, SSR would barely budge by 927.


Implications

� This is another way of showing that once you account for advertising and bonuses paid, you do not get much more from the last two variables.

� The last two sequential SSR values add up to 5321, which was the same as the (SSER – SSEF) quantity computed in the partial F test.

� Given that, it is not surprising to learn that the partial F test can be stated in terms of sequential sums of squares.

4.5 Prediction With a Multiple Regression Equation

As in simple regression, we will look at two types of computations:

1. Estimating the mean y that can occur at a set of x values.

2. Predicting an individual value of y that can occur at a set of x values.

4.5.1 Estimating the Conditional Mean of y Given x1, x2, ..., xK

This is our estimate of the point on our regression surface that occurs at a specific set of x values.

For two x variables, we are estimating:

22110,| 21xxxxy βββµ ++=


Computations

The point estimate is straightforward, just plug in the x values.

The difficult part is computing a standard error to use in a confidence interval. Thankfully, most computer programs can do that.

22110ˆ xbxbby m ++=

4.5.2 Predicting an Individual Value of y Given x1, x2, ..., xK

Now the quantity we are trying to estimate is:

Our interval will have to account for the extra term ( ei ) in the equation, thus will be wider than the interval for the mean.

iiii exxy +++= 22110 βββ

Prediction in Minitab

Here we predict sales for a territory with 500 units of advertising and 250 units of bonus

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI

1 1184.2 25.2 (1131.8, 1236.6) ( 988.8, 1379.5)

Values of Predictors for New Observations

New Obs ADV BONUS

1 500 250


Interpretations

We are 95% sure that the average sales in territories with $50,000 advertising and $25,000 of bonuses will be between $1,131,800 and $1,236,600.

We are 95% sure that any individualterritory with this level of advertising and bonuses will have between $988,800 and $1,379,500 of sales

4.6 Multicollinearity: A Potential Problemin Multiple Regression

� In multiple regression, we like the x variables to be highly correlated with y because this implies good prediction ability.

� If the x variables are highly correlated among themselves, however, much of this prediction ability is redundant.

� Sometimes this redundancy is so severe that it causes some instability in the coefficient estimation. When that happens we say multicollinearity has occurred.

4.6.1 Consequences of Multicollinearity

1. The standard errors of the bj are larger than they should be. This could cause all the t statistics to be near 0 even though the F is large.

2. It is hard to get good estimates of the βj. The bj may have the wrong sign. They may have large changes in value if another variable is dropped from or added to the regression.


4.6.2 Detecting Multicollinearity

Several methods appear in the literature. Some of these are:

1. Examining pairwise correlations

2. Seeing large F but small t ratios

3. Computing Variance Inflation Factors

Examining Pairwise Correlations

� If it is only a collinearity problem, you can detect it by examining the correlations for pairs of xvalues.

� How large the correlation needs to be before it suggests a problem is debatable. One rule of thumb is .5, another is the maximum correlation between y and the various x values.

� The major limitation of this is that it will not help if there is a linear relationship involving several xvalues, for example,

x1 = 2x2 - .07x3 + a small random error

Large F, Small t

� With a significant F statistic you would expect to see at least one significant predictor, but that may not happen if all the variables are fighting each other for significance.

� This method of detection may not work if there are, say, six good predictors but the multicollinearity only involves four of them.

� This method also may not help identify what variables are involved.


Variance Inflation Factors

� This is probably the most reliable method for detection because it both shows the problem exists and what variables are involved.

� We can compute a VIF for each variable. A high VIF is an indication that the variable's standard error is "inflated" by its relationship to the other x variables.

Auxiliary Regressions

Suppose we regressed each x value, in turn, on all of the other x variables.

Let Rj2 denote the model's R2 we get

when xj was the "temporary y".

1The variable's VIF is VIFj =

1 - Rj2

VIFj and Rj2

If xj was totally uncorrelated with the other xvariables, its VIF would be 1.

This table shows some other values.

10099%

1090%

580%

250%

10%

VIFjRj2


Auxiliary Regressions: A Lot of Work?

� If there were a large number of xvariables in the model, obtaining the auxiliaries would be tedious.

� Most statistics package will compute the VIF statistics for you and report them with the coefficient output.

� You can then do the auxiliary regressions, if needed, for the variables with high VIF.

Using VIFs

� A general rule is that any VIF > 10 is a problem.

� Another is that if the average VIF is considerably larger than 1, SSE may be inflated.

� The average VIF indicates how many times larger SSE is due to multicollinearity than if the predictors were uncorrelated.

� Freund and Wilson suggest comparing the VIF to 1/(1-R2) for the main model. If the VIF are less than this, multicollinearity is not a problem.

Our Example

Pairwise correlations

The maximum correlation among the x variables is .452 so if multicollinearity exists it is well hidden.

Correlations: SALES, ADV, BONUS, MKTSHR, COMPET

SALES ADV BONUS MKTSHR

ADV 0.900

BONUS 0.568 0.419

MKTSHR 0.023 -0.020 -0.085

COMPET 0.377 0.452 0.229 -0.287


VIFs in Minitab


SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - .121 COMPET

Predictor Coef SE Coef T P VIF

Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000 1.5

BONUS 1.9059 0.7424 2.57 0.018 1.2

MKTSHR 2.651 4.636 0.57 0.574 1.1

COMPET -0.1207 0.3718 -0.32 0.749 1.4

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

No Problem!

4.6.3 Correction for Multicollinearity

� One solution would be to leave out one or more of the redundant predictors.

� Another would be to use the variables differently. If x1 and x2 are collinear, you might try using x1 and the ratio x2/ x1instead.

� Finally, there are specialized statistical procedures that can be used in place of ordinary least squares.

4.7 Lagged Variables as Explanatory Variables in Time-Series Regression

� When using time series data in a regression, the relationship between y and x may be concurrentor x may serve as a leading indicator.

� In the latter, a past value of x appears as a predictor, either with or without the current value of x.

� An example would be the relationship between housing starts as y and interest rates as x. When rates drop, it is several months before housing starts increase.


Lagged Variables

The effect of advertising on sales is often cumulative so it would not be surprising see it modeled as:

Here xt is advertising in the current month and the lagged variables xt-1 and xt-2represent advertising in the two previous months.

ttttt exxxy ++++= −− 231210 ββββ

Potential Pitfalls

� If several lags of the same variable are used, it could cause multicollinearity if xtwas highly autocorrelated (correlated with its own past values).

� Lagging causes lost data. If xt-2 is included in the model, the first time it can be computed is at time period t = 3. We lose any information in the first two observations.

Lagged y Values

� Sometimes a past value of y is used as a predictor as well. A relationship of this type might be:

� This implies that this month's sales ytare related to by two months of advertising expense xt and xt-1 plus last month's sales yt-1.

ttttt exxyy ++++= −− 132110 ββββ


Example 4.6 Unemployment Rate

� The file UNEMP4 contains the national unemployment rates (seasonally-adjusted) from January 1983 through December 2002.

� On the next few slides are a time series plot of the data and regression models employing first and second lags of the rates.

10.5

9.5

8.5

7.5

6.5

5.5

4.5

3.5

Aug-1999Jun-1995Apr-1991Feb-1987

UN

EM

P

Date/Time

Time Series PlotAutocorrelation is .97 at lag 1 and .94 at lag 2

Regression With First Lag


UNEMP = 0.153 + 0.971 Unemp1

239 cases used 1 cases contain missing values


Constant 0.15319 0.04460 3.44 0.001

Unemp1 0.971495 0.007227 134.43 0.000

S = 0.1515 R-Sq = 98.7% R-Sq(adj) = 98.7%


Source DF SS MS F P

Regression 1 414.92 414.92 18070.47 0.000

Residual Error 237 5.44 0.02

Total 238 420.36

High R2 because of autocorrelation


Regression With Two LagsThe regression equation is

UNEMP = 0.168 + 0.890 Unemp1 + 0.0784 Unemp2


Predictor Coef SE Coef T P VIF

Constant 0.16764 0.04565 3.67 0.000

Unemp1 0.89032 0.06497 13.70 0.000 77.4

Unemp2 0.07842 0.06353 1.23 0.218 77.4

S = 0.1514 R-Sq = 98.7% R-Sq(adj) = 98.6%


Source DF SS MS F P

Regression 2 395.55 197.77 8630.30 0.000


Total 237 400.93

Comments

� It does not appear that the second lag term is needed. Its t statistic is 1.23.

� Because we got R2 = 98.7% from the model with just one term, there was not much variation left for the second lag term to explain.

� Note that the second model also had a lot of multicollinearity.

Fitting Curves to Data ١٧١


Chapter 5:

Fitting Curves to DataTerry Dielman

Applied Regression Analysis:A Second Course in Business and Economic Statistics, fourth edition


Fitting Curves to Data ١٧٢


5.1 Introduction

In Chapter 4 , the model was presented as:

where we assumed linear relationships between y and the x variables.

In this chapter we find that this may not be true and consider curvilinear relationships between the variables.


Fitting Curves to Data ١٧٣


Modeling

� In general, we regress Y on some X which is not a linear function.

�Common functions are X2 , 1/X or log(X)

� In economics, sometimes regress log(y) on log(x)

Fitting Curves to Data ١٧٤


5.2 Fitting Curvilinear Relationships

� Polynomial Regression – a common correction for nonlinearity is to add powers of the explanatory variable

� In practice a second-order model is often sufficient to describe the relationship

i

k

ikiii exxxy +++++= ββββ L2

210


Fitting Curves to Data ١٧٥


Example 5.1: Telemarketing

n = 20 telemarketing employees

Y = average calls per day over 20 workdays

X = Months on the job

Data set TELEMARKET5

Fitting Curves to Data ١٧٦


Plot of Calls versus Months

302010

35

30

25

20

MONTHS

CA

LL

S

There is an increase in calls with experience, but the rate of increase slows over time.

Fitting Curves to Data ١٧٧


Fit of a First-Order Model

� For comparison purposes, we first fit the linear equation and obtained:

CALLS = 13.6708 + .7435 MONTHS

� This equation, which has an R2 of 87.4%, implies that each month of experience leads to .7435 more calls per day.


Fitting Curves to Data ١٧٨


Fitting a Second-Order Model

302010

35

30

25

20

MONTHS

CA

LL

S

S = 1.00325 R-Sq = 96.2 % R-Sq(adj) = 95.8 %

- 0.0401182 MONTHS**2

CALLS = -0.140471 + 2.31020 MONTHS

Regression Plot

Fitting Curves to Data ١٧٩


Regression OutputRegression Analysis: CALLS versus MONTHS, MonthSQ


CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ


Constant -0.140 2.323 -0.06 0.952

MONTHS 2.3102 0.2501 9.24 0.000

MonthSQ -0.040118 0.006333 -6.33 0.000

S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%


Source DF SS MS F P

Regression 2 437.84 218.92 217.50 0.000


Total 19 454.95

Fitting Curves to Data ١٨٠


Hypothesis Test on β2

H0: β2 = 0 (Use the linear equation)Ha: β2 ≠ 0 (Quadratic has improved fit)

Test as usual with t = b2/SE(b2)

Here t = -.0402/.00633 = -6.33 is significant with p-value = .000

Not surprising since R2 increased 9%


Fitting Curves to Data ١٨١


Hypothesis Tests "Top Down"

� The usual practice is to keep lower-order terms when a high-order term is significant.

� In b0 + b1 x + b2 x2 we would retain the b1 term even if it had an insignificant t-ratio, if the b2 term was significant.

Fitting Curves to Data ١٨٢


Higher and higher?

� To see if the polynomial has even a higher order, we fit a cubic equation.

� The table below shows the second-order model was sufficient.

Model p for highest order term

R2 Adj R2 Se

Linear 0.000 87.4% 86.7% 1.787

Quadratic 0.000 96.2% 95.8% 1.003

Cubic 0.509 96.3% 95.7% 1.020

Fitting Curves to Data ١٨٣


Centering the X

� When polynomial regression is used, multicollinearity often results because x and x2 are correlated.

� This can be eliminated by subtracting x-bar (the mean) from each x

2andUse )x(xxx −−


Fitting Curves to Data ١٨٤


5.2.2 Reciprocal Transformation of the x Variable

� Another curvilinear relationship that is in common use is:

� Here y and x are inversely related but the relationship is not linear.

i

i

i ex

y +

+=

110 ββ

Fitting Curves to Data ١٨٥


Example 5.2

� We are interested in the relationship between gas mileage and a car's horsepower.

� An the next page is a plot of the highway mpg (HWYMPG) and horsepower (HP) for 147 cars listed in the October 2002 Road and Track.

Fitting Curves to Data ١٨٦


7006005004003002001000

70

60

50

40

30

20

10

HP

HW

YM

PG

Highway MPG versus Horsepower


Fitting Curves to Data ١٨٧


Modeling the Relationship

� A regression of HWYMPG on HP yieldsHWYMPG = 38.73 - .0477 HP with R2 = 59.4%

� This does not fit too well because as horsepower increases, mileage decreases, but the rate of decrease is slower for more-powerful cars.

� Although other models, including a quadratic, might work, we regressed HWYMPG on 1/HP.

Fitting Curves to Data ١٨٨


Regression Results


HWYMPG = 13.6 + 2692 HPINV


Constant 13.6310 0.6493 20.99 0.000

HPINV 2962.4675 111.7526 24.09 0.000

S = 2.93107 R-Sq = 80.0% R-Sq(adj) = 79.9%


Source DF SS MS F P

Regression 1 4987.0 4987.0 580.48 0.000


Total 146 6232.7

Fitting Curves to Data ١٨٩


7006005004003002001000

70

60

50

40

30

20

10

HP

HW

YM

PG

Data and Reciprocal Fit


Fitting Curves to Data ١٩٠


5.2.3 Log Transformation of the x Variable

� Yet another curvilinear equation is:

where ln(x) is the natural logarithm of x.

� It is assumed that the x values are positive because ln(0) is undefined.

iii exy ++= )ln(10 ββ

Fitting Curves to Data ١٩١


Example 5.4 Fuel Consumption

n = 51 (50 states plus Washington, D.C.)

FUELCON = fuel consumption per capita

POP = state population

AREA = area of state in square miles

POPDENS = population density

Data Set FUELCON5

Fitting Curves to Data ١٩٢


1000050000

700

600

500

400

300

DENSITY

FU

EL

CO

N

Plot of Fuelcon versus Density

r = -.454


Fitting Curves to Data ١٩٣


Effect of the Transformation

� The graph has one point (D.C.) on the right with all others clumped to the left.

� It is hard to see what type of relationship there is until some adjustments are made.

� Here take the natural log of density to "pull" the extreme point back in.

Fitting Curves to Data ١٩٤


9876543210

700

600

500

400

300

LogDensity

FU

EL

CO

N

Consumption versus Logdensity

r = -.527

Fitting Curves to Data ١٩٥


Linear and Log Regressions


FUELCON = 495 - 0.025 DENSITY


Constant 465.628 9.481 52.28 0.000

DENSITY -0.025 0.007 -3.56 0.001

S = 65.1675 R-Sq = 20.6% R-Sq(adj) = 19.0%


FUELCON = 597 – 24.5 LOGDENS


Constant 597.19 29.96 22.15 0.000

LOGDENS -24.53 5.65 -4.34 0.000

S = 62.1561 R-Sq = 27.8% R-Sq(adj) = 26.3%


Fitting Curves to Data ١٩٦


5.2.4 Log Transformations of Both the y and x Variables

� Here the natural log of y is the dependent variable and the natural log of x is the independent variable:

� Comparing results with other models may be difficult since we are not modeling yitself.

� Economists sometimes use this to estimate price elasticity (y is demand andx price; b1 is estimated elasticity).

iii exy ++= )ln()ln( 10 ββ

Fitting Curves to Data ١٩٧


Example 5.4 Imports and GDP

The gross domestic product (GDP) and dollar amount of total imports (IMPORTS) for 25 countries was obtained from the World Fact Book.

For both variables, low values clump together and higher values spread out, suggesting log transformations for both.

Fitting Curves to Data ١٩٨


1000050000

1000

500

0

GDP

IMP

OR

TS

Scatterplot of Imports vs GDP


Fitting Curves to Data ١٩٩


1050

7

6

5

4

3

2

1

0

-1

-2

LogGDP

Lo

gIm

p

Scatterplot of LogImp vs LogGDP

Fitting Curves to Data ٢٠٠


Two Regression ModelsRegression Analysis: IMPORTS versus GDP


Constant 22.32 19.24 1.16 0.258

GDP 0.105671 0.008452 12.50 0.000

S = 87.00 R-Sq = 87.2% R-Sq(adj) = 86.6%

Regression Analysis: LogImp versus LogGDP


Constant -1.1275 0.4346 -2.59 0.016

LogGDP 0.86703 0.07877 11.01 0.000

S = 0.9142 R-Sq = 84.0% R-Sq(adj) = 83.4%

Not directly comparable

Fitting Curves to Data ٢٠١


The R2 Compare Different Things

� The 87.2 % R2 for the no-log model is the percentage of variation in Imports explained.

� The 84.0% for the second model is the percentage of variation in ln(Imports) explained.

� If you converted the fitted values of the second model back to Imports you might find the log model better.


Fitting Curves to Data ٢٠٢


What Transformation to Use

� It is probably best to try several.

� A quadratic is most flexible because it uses two parameters to fit the relationship between to fit the relationship between y and x.

� Some further analysis is in Chapter 6 where tests for nonlinearity are discussed.

Fitting Curves to Data ٢٠٣


5.2.5 Fitting Curved Trends

If the data is collected over time, we may want to consider variations on the linear trend model of Chapter 3.

Another is the S-Curve trend:

tt etty +++= 2

210:trendQuadratic βββ

+

+= tt e

ty

1exp 10 ββ

Fitting Curves to Data ٢٠٤


S Curve Model

Many products have a demand curve like this.

1. Initial demand increases slowly

2. As product matures, demand picks up and steadily grows.

3. At some saturation point demand levels off.


Fitting Curves to Data ٢٠٥


Exponential Growth Model

Another alternative is an exponential trend:

This can be fit by least squares if you model ln(y).

( )tt ety ++= 10exp ββ

Checking Assumptions ٢٠٦


Chapter 6Assessing the Assumptions of

the Regression Model

Terry Dielman

Applied Regression Analysisfor Business and Economics

Checking Assumptions ٢٠٧


6.1 Introduction

In Chapter 4 the multiple linear regression model was presented as

Certain assumptions were made about how the errors ei behaved. In this chapter we will check to see if those assumptions appear reasonable.



Checking Assumptions ٢٠٨


6.2 Assumptions of the Multiple Linear Regression Model

a. We expect the average disturbance ei to be zero so the regression line passes through the average value of Y.

b. The disturbances have constant variance σe

2.c. The disturbances are normally

distributed.d. The disturbances are independent.

Checking Assumptions ٢٠٩


6.3 The Regression Residuals

� We cannot check to see if the disturbances ei behave correctly because they are unknown.

� Instead, we work with their sample counterpart, the residuals

which represent the unexplained variation in the y values.

iii yye ˆˆ −=

Checking Assumptions ٢١٠


PropertiesProperty 1: They will always average 0

because the least squares estimation procedure makes that happen.

Property 2: If assumptions a, b and d of Section 6.2 are true then the residuals should be randomly distributed around their mean of 0. There should be no systematic pattern in a residual plot.

Property 3: If assumptions a through d hold, the residuals should look like a random sample from a normal distribution.


Checking Assumptions ٢١١


Suggested Residual Plots

1. Plot the residuals versus each explanatory variable.

2. Plot the residuals versus the predicted values.

3. For data collected over time or in any other sequence, plot the residuals in that sequence.

In addition, a histogram and box plot are useful for assessing normality.

Checking Assumptions ٢١٢


Standardized residuals

� The residuals can be standardized by dividing by their standard error.

� This will not change the pattern in a plot but will affect the vertical scale.

� Standardized residuals are always scaled so that most are between -2 and +2 as in a standard normal distribution.

Checking Assumptions ٢١٣


A plot meeting property 2

11010510095

3

2

1

0

-1

-2

X

Res

a. mean of 0 b. Same scatter d. No pattern with X


Checking Assumptions ٢١٤


A plot showing a violation

302010

2

1

0

-1

-2

MONTHS

Sta

ndard

ized R

esid

ual

Residuals Versus MONTHS

(response is CALLS)

Checking Assumptions ٢١٥


6.4 Checking Linearity

� Although sometimes we can see evidence of nonlinearity in an X-Y scatterplot, in other cases we can only see it in a plot of the residuals versus X.

� If the plot of the residuals versus an X shows any kind of pattern, it both shows a violation and a way to improve the model.

Checking Assumptions ٢١٦


Example 6.1: Telemarketing

n = 20 telemarketing employees

Y = average calls per day over 20 workdays

X = Months on the job

Data set TELEMARKET6


Checking Assumptions ٢١٧


Plot of Calls versus Months

302010

35

30

25

20

MONTHS

CA

LL

S

There is some curvature, but it is masked by the more obvious linearity.

Checking Assumptions ٢١٨


If you are not sure, fit the linear model and save the residuals


CALLS = 13.7 + 0.744 MONTHS


Constant 13.671 1.427 9.58 0.000

MONTHS 0.74351 0.06666 11.15 0.000

S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%


Source DF SS MS F P

Regression 1 397.45 397.45 124.41 0.000


Total 19 454.95

Checking Assumptions ٢١٩


Residuals from model

With the linearity "taken out" the curvature is more obvious


Checking Assumptions ٢٢٠


6.4.2 Tests for lack of fit

� The residuals contain the variation in the sample of Y values that is not explained by the Yhat equation.

� This variation can be attributed to many things, including:

• natural variation (random error)

• omitted explanatory variables

• incorrect form of model

Checking Assumptions ٢٢١


Lack of fit

� If nonlinearity is suspected, there are tests available for lack of fit.

� Minitab has two versions of this test, one requiring there to be repeated observations at the same X values.

� These are on the Options submenu off the Regression menu

Checking Assumptions ٢٢٢


The pure error lack of fit test

� In the 20 observations for the telemarketing data, there are two at 10, 20 and 22 months, and four at 25 months.

� These replicates allow the SSE to be decomposed into two portions, "pure error" and "lack of fit".


Checking Assumptions ٢٢٣


The test

H0: The relationship is linear

Ha: The relationship is not linear

The test statistic follows an F distribution with c – k – 1 numerator df and n – c denominator df

c = number of distinct levels of X

n = 20 and there were 6 replicates so c = 14

Checking Assumptions ٢٢٤


Minitab's outputThe regression equation is

CALLS = 13.7 + 0.744 MONTHS


Constant 13.671 1.427 9.58 0.000

MONTHS 0.74351 0.06666 11.15 0.000

S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%


Source DF SS MS F P

Regression 1 397.45 397.45 124.41 0.000


Lack of Fit 12 52.50 4.38 5.25 0.026

Pure Error 6 5.00 0.83

Total 19 454.95

Checking Assumptions ٢٢٥


Test results

At a 5% level of significance, the critical value (from F12, 6 distribution) is 4.00.

The computed F is 5.25 is significant (p value of .026) so we conclude the relationship is not linear.


Checking Assumptions ٢٢٦


Tests without replication

� Minitab also has a series of lack of fit tests that can be applied when there is no replication.

� When they are applied here, these messages appear:

� The small p values suggest lack of fit.

Lack of fit test

Possible curvature in variable MONTHS (P-Value = 0.000)

Possible lack of fit at outer X-values (P-Value = 0.097)

Overall lack of fit test is significant at P = 0.000

Checking Assumptions ٢٢٧


6.4.3 Corrections for nonlinearity

� If the linearity assumption is violated, the appropriate correction is not always obvious.

� Several alternative models were presented in Chapter 5.

� In this case, it is not too hard to see that adding an X2 term works well.

Checking Assumptions ٢٢٨


Quadratic model


CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ


Constant -0.140 2.323 -0.06 0.952

MONTHS 2.3102 0.2501 9.24 0.000

MonthSQ -0.040118 0.006333 -6.33 0.000

S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%


Source DF SS MS F P

Regression 2 437.84 218.92 217.50 0.000


Total 19 454.95

No evidence of lack of fit (P > 0.1)


Checking Assumptions ٢٢٩


Residuals from quadratic model

302010

1

0

-1

MONTHS

RE

SI1

No violations evident

Checking Assumptions ٢٣٠


6.5 Check for constant variance

� Assumption b states that the errors eishould have the same variance everywhere.

� This implies that if residuals are plotted against an explanatory variable, the scatter should be the same at each value of the X variable.

� In economic data, however, it is fairly common to see that a variable that increases in value often will also increase in scatter.

Checking Assumptions ٢٣١


Example 6.3 FOC Sales

n = 265 months of sales data for a fibre-optic company

Y = Sales

X= Mon ( 1 thru 265)

Data set FOCSALES6


Checking Assumptions ٢٣٢


Data over time

40000

30000

20000

10000

0

200100

SA

LES

Index

Note: This uses Minitab’s Time Series Plot

Checking Assumptions ٢٣٣


Residual plot

3002001000

20000

10000

0

-10000

-20000

Mon

Resid

ual

Residuals Versus Mon

(response is SALES)

Checking Assumptions ٢٣٤


Implications

� When the errors ei do not have a constant variance, the usual statistical properties of the least squares estimates may not hold.

� In particular, the hypothesis tests on the model may provide misleading results.


Checking Assumptions ٢٣٥


6.5.2 A Test for Nonconstant Variance

� Szroeter developed a test that can be applied if the observations appear to increase in variance according to some sequence (often, over time).

� To perform it, save the residuals, square them, then multiply by i (the observation number).

� Details are in the text.

Checking Assumptions ٢٣٦


6.5.3 Corrections for Nonconstant Variance

Several common approaches for correcting nonconstant variance are:

1. Use ln(y) instead of y

2. Use √y instead of y

3. Use some other power of y, yp, where the Box-Cox method is used to determine the value for p.

4. Regress (y/x) on (1/x)

Checking Assumptions ٢٣٧


LogSales over time

10

9

8

200100

Log

Sale

s

Index


Checking Assumptions ٢٣٨


Residuals from Regression

3002001000

0.5

0.0

-0.5

-1.0

Mon

Resid

ual

Residuals Versus Mon

(response is LogSales)

This looks real good after I put this text box on top of those six large outliers.

Checking Assumptions ٢٣٩


6.6 Assessing the Assumption That the Disturbances are Normally Distributed

� There are many tools available to check the assumption that the disturbances are normally distributed.

� If the assumption holds, the standardized residuals should behave like they came from a standard normal distribution.

– about 68% between -1 and +1



Checking Assumptions ٢٤٠


6.6.1 Using Plots to Assess Normality

� You can plot the standardized residuals versus fitted values and count how many are beyond -2 and +2; about 1 in 20 would be the usual case.

� Minitab will do this for you if ask it to check for unusual observations (those flagged by an R have a standardized residual beyond ±2.


Checking Assumptions ٢٤١


Other tools

� Use a Normal Probability plot to test for normality.

� Use a histogram (perhaps with a superimposed normal curve) to look at shape.

� Use a Boxplot for outlier detection. It will show all outliers with an *.

Checking Assumptions ٢٤٢


Example 6.5 Communication Nodes

Data in COMNODE6

n = 14 communication networks

Y = Cost

X1 = Number of ports

X2 = Bandwidth

Checking Assumptions ٢٤٣


Regression with unusuals flaggedThe regression equation is

COST = 17086 + 469 NUMPORTS + 81.1 BANDWIDTH


Constant 17086 1865 9.16 0.000

NUMPORTS 469.03 66.98 7.00 0.000

BANDWIDT 81.07 21.65 3.74 0.003

S = 2983 R-Sq = 95.0% R-Sq(adj) = 94.1%


(deleted)

Unusual Observations

Obs NUMPORTS COST Fit SE Fit Residual St Resid

1 68.0 52388 53682 2532 -1294 -0.82 X

10 24.0 23444 29153 1273 -5709 -2.12R

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.


Checking Assumptions ٢٤٤


55000450003500025000

1

0

-1

-2

Fitted Value

Sta

ndard

ized R

esi

dual

Residuals Versus the Fitted Values

(response is COST)

Residuals versus fits (from regression graphs)

Checking Assumptions ٢٤٥


6.6.2 Tests for normality

� There are several formal tests for the hypothesis that the disturbances eiare normal versus nonnormal.

� These are often accompanied by graphs* which are scaled so that data which are normally-distributed appear in a straight line.

* Your Minitab output may appear a little different depending on whether you have the student or professional version, and which release you have.

Checking Assumptions ٢٤٦


10-1-2

2

1

0

-1

-2

Norm

al S

core

Standardized Residual

Normal Probability Plot of the Residuals

(response is COST)

Normal plot (from regression graphs)

If normal, should follow straight line


Checking Assumptions ٢٤٧


Normal probability plot (graph menu)

3210-1-2-3

99

95

90

80

70

60

50

40

30

20

10

5

1

Data

Pe

rce

nt 1.187AD*

Goodness of Fit

Normal Probability Plot for SRES1ML Estimates - 95% CI

Mean

StDev

-0.0547797

1.02044

ML Estimates

Checking Assumptions ٢٤٨


Test for Normality (Basic Statistics Menu)

P-Value: 0.216

A-Squared: 0.463

Anderson-Darling Normality Test

N: 14

StDev: 1.05896

Average: -0.0547797

10-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

SRES1

Normal Probability Plot

AcceptsHo: Normality

Checking Assumptions ٢٤٩


Example 6.7 S&L Rate of Return

Data set SL6

n =35 Saving and Loans stocksY = rate of return for 5 years ending 1982X1 = the "Beta" of the stockX2 = the "Sigma" of the stock

Beta is a measure of nondiversifiable risk and Sigma a measure of total risk


Checking Assumptions ٢٥٠


Basic exploration

2.51.50.5

10

5

0

BETA

RE

TU

RN

20100

10

5

0

SIGMA

RE

TU

RN

Correlations: RETURN, BETA, SIGMA

RETURN BETA

BETA 0.180

SIGMA 0.351 0.406

Checking Assumptions ٢٥١


Not much explanatory powerThe regression equation is

RETURN = - 1.33 + 0.30 BETA + 0.231 SIGMA


Constant -1.330 2.012 -0.66 0.513

BETA 0.300 1.198 0.25 0.804

SIGMA 0.2307 0.1255 1.84 0.075

S = 2.377 R-Sq = 12.5% R-Sq(adj) = 7.0%


(deleted)


Obs BETA RETURN Fit SE Fit Residual St Resid

19 2.22 0.300 -0.231 2.078 0.531 0.46 X

29 1.30 13.050 2.130 0.474 10.920 4.69R



Checking Assumptions ٢٥٢


One in every crowd?

43210

5

4

3

2

1

0

-1

Fitted Value

Sta

ndard

ized R

esi

dual

Residuals Versus the Fitted Values

(response is RETURN)


Checking Assumptions ٢٥٣


Normality Test

P-Value: 0.000

A-Squared: 2.235

Anderson-Darling Normality Test

N: 35

StDev: 2.30610

Average: 0.0000000

1050

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

bab

ility

RESI1

Normal Probability Plot

RejectH0: Normality

Checking Assumptions ٢٥٤


6.6.3 Corrections for Nonnormality

� Normality is not necessary for making inference with large samples.

� It is required for inference with small samples.

� The remedies are similar to those used to correct for nonconstant variance.

Checking Assumptions ٢٥٥


6.7 Influential Observations

� In minimizing SSE, the least squares procedure tries to avoid large residuals.

� It thus "pays a lot of attention" to y values that don't fit the usual pattern in the data. Refer to the example in Figures 6.42(a) and 6.42(b).

� That probably also happened in the S&L data where the one very high return masked the relationship between rate of return, beta and sigma for the other 34 stocks.


Checking Assumptions ٢٥٦


6.7.2 Identifying outliers

� Minitab flags any residual bigger than 2 in absolute value as a potential outlier.

� A boxplot of the residuals uses a slightly different rule, but should give similar results.

� There is also a third type of residual that is often used for this purpose.

Checking Assumptions ٢٥٧


Deleted residuals

� If you (temporarily) eliminate the ith

observation from the data set, it cannot influence the estimation process.

� You can then compute a "deleted" residual to see if this point fits the pattern in the other observations.

Checking Assumptions ٢٥٨


Deleted Residual IllustrationThe regression equation is

ReturnWO29 = - 2.51 + 0.846 BETA + 0.232 SIGMA



Constant -2.510 1.153 -2.18 0.037

BETA 0.8463 0.6843 1.24 0.225

SIGMA 0.23220 0.07135 3.25 0.003

S = 1.352 R-Sq = 37.2% R-Sq(adj) = 33.1%

Without observation 29, we get a much better fit.

Predicted Y29 = -2.51 + .846(1.2973) + .232(13.3110) = 1.678

Prediction SE is 1.379

Deleted residual29 = (13.05 – 1.678)/1.379 = 8.24


Checking Assumptions ٢٥٩


The influence of observation 29

� When it was temporarily removed, the R2 went from 12.5% to 37.2% and we got a very different equation

� The deleted residual for this observation was a whopping 8.24, which shows it had a lot of weight in determining the original equation.

Checking Assumptions ٢٦٠


6.7.3 Identifying Leverage Points

� Outliers have unusual y values; data points with unusual X values are said to have leverage. Minitab flags these with an X.

� These points can have a lot of influence in determining the Yhatequation, particularly if they don't fit well. Minitab would flag these with both an R and an X.

Checking Assumptions ٢٦١


Leverage

� The leverage of the ith observation is hi (it is hard to show where this comes from without matrix algebra).

� If h > 2(K+1)/n it has high leverage.

� For S&P returns, k = 2 and n = 35 so the benchmark is 2(3)/35 = .171

� Observation 19 has a very small value for Sigma, this is the reason why it has h19 = .764


Checking Assumptions ٢٦٢


6.7.4 Combined Measures

� The effect of an observation on the regression line is a function of both the y and X values.

� Several statistics have been developed that attempt to measure combined influence.

� The DFIT statistic and Cook's D are two more-popular measures.

Checking Assumptions ٢٦٣


The DFIT statistic

� The DFIT statistic is a function of both the residual and the leverage.

� Minitab can compute and save these under "Storage".

� Sometimes a cutoff is used, but it is perhaps best just to look for values that are high.

Checking Assumptions ٢٦٤


DFIT Graphed

35302520151050

1.5

1.0

0.5

0.0

Observation Number

DF

IT1

29

19


Checking Assumptions ٢٦٥


Cook's D

� Often called Cook's Distance

� Minitab also will compute these and store them.

� Again, it might be best just to look for high values rather than use a cutoff.

Checking Assumptions ٢٦٦


Cook's D Graphed

35302520151050

0.3

0.2

0.1

0.0

Observation Number

CO

OK

1

19

29

Checking Assumptions ٢٦٧


6.7.5 What to do with Unusual Observations

� Observation 19 (First Lincoln Financial Bank) has high influence because of its very low Sigma.

� Observation 29 (Mercury Saving) had a very high return of 13.05 but its Beta and Sigma were not unusual.

� Since both values are out of line with the other S&L banks, they may represent data recording errors.


Checking Assumptions ٢٦٨


Eliminate? Adjust?

� If you can do further research you might find out the true story.

� You should eliminate an outlier data point only when you are convinced it does not belong with the others (for example, if Mercury was speculating wildly).

� An alternative is to keep the data point but add an indicator variable to the model that signals there is something unusual about this observation.

Checking Assumptions ٢٦٩


6.8 Assessing the Assumption That the Disturbances are Independent

� If the disturbances are independent, the residuals should not display any patterns.

� One such pattern was the curvature in the residuals from the linear model in the telemarketing example.

� Another pattern occurs frequently in data collected over time.

Checking Assumptions ٢٧٠


6.8.1 Autocorrelation

� In time series data we often find that the disturbances tend to stay at the same level over consecutive observations.

� If this feature, called autocorrelation, is present, all our model inferences may be misleading.


Checking Assumptions ٢٧١


First-order autocorrelation

If the disturbances have first-order autocorrelation, they behave as:

ei = ρ ei-1 + µi

where µi is a disturbance with expected value 0 and independent over time.

Checking Assumptions ٢٧٢


The effect of autocorrelation

If you knew that e56 was 10 and ρ was .7, you would expect e57 to be 7 instead of zero.

This dependence can lead to high standard errors for the bj coefficients and wider confidence intervals.

Checking Assumptions ٢٧٣


6.8.2 A Test for First-Order Autocorrelation

Durbin and Watson developed a test for positive autocorrelation of the form:

H0: ρ = 0Ha: ρ > 0

Their test statistic d is scaled so that it is 2 if no autocorrelation is present and near 0 if it is very strong.


Checking Assumptions ٢٧٤


A Three-Part Decision Rule

The Durbin-Watson test distribution depends on n and K. The tables (Table B.7) list two decision points dL and dU.

If d < dL reject H0 and conclude there is positive autocorrelation.

If d > dU accept H0 and conclude there is no autocorrelation.

If dL ≤ d ≤ dU the test is inconclusive.

Checking Assumptions ٢٧٥


Example 6.10 Sales and Advertising

n = 36 years of annual data

Y = Sales (in million $)

X = Advertising expenditures ($1000s)

Data in Table 6.6

Checking Assumptions ٢٧٦


The Test

n = 36 and K = 1 X-variable

At a 5% level of significance, Table B.7 gives dL = 1.41 and dU = 1.52

Decision Rule:Reject H0 if d < 1.41Accept H0 if d > 1.52Inconclusive if 1.41 ≤≤≤≤ d ≤≤≤≤ 1.52


Checking Assumptions ٢٧٧


Regression With DW StatisticThe regression equation is

Sales = - 633 + 0.177 Adv


Constant -632.69 47.28 -13.38 0.000

Adv 0.177233 0.007045 25.16 0.000

S = 36.49 R-Sq = 94.9% R-Sq(adj) = 94.8%


Source DF SS MS F P

Regression 1 842685 842685 632.81 0.000


Total 35 887961


Obs Adv Sales Fit SE Fit Residual St Resid

1 5317 381.00 309.62 11.22 71.38 2.06R

15 6272 376.10 478.86 6.65 -102.76 -2.86R


Durbin-Watson statistic = 0.47 Significant autocorrelation

Checking Assumptions ٢٧٨


Plot of Residuals over Time

2

1

0

-1

-2

-3

302010

SR

ES

1

Index

Shows first-order autocorrelation with r = .71

Checking Assumptions ٢٧٩


6.8.3 Correction for First-Order Autocorrelation

One popular approach creates a new y and x variable.

First, obtain an estimate of ρ. Here we use r = .71 from Minitab's Autocorrelation analysis.

Then compute yi* = yi – r yi-1

and xi* = xi – r xi-1


Checking Assumptions ٢٨٠


First Observation Missing

Because the transformation depends on lagged y and x values, the first observation requires special handling.

The text suggests y1* = √1 – r2 y1

and a similar computation for x1*

Checking Assumptions ٢٨١


Other Approaches

� An alternative is to use an estimation technique (such as SAS's Autoregprocedure) that automatically adjusts for autocorrelation.

� A third option is to include a lagged value of y as an explanatory variable. In this model, the DW test is no longer appropriate.

Checking Assumptions ٢٨٢


Regression With Lagged Sales as a PredictorThe regression equation is

Sales = - 234 + 0.0631 Adv + 0.675 LagSales



Constant -234.48 78.07 -3.00 0.005

Adv 0.06307 0.02023 3.12 0.004

LagSales 0.6751 0.1123 6.01 0.000

S = 24.12 R-Sq = 97.8% R-Sq(adj) = 97.7%


(deleted)


Obs Adv Sales Fit SE Fit Residual St Resid

15 6272 376.10 456.24 5.54 -80.14 -3.41R

16 6383 454.60 422.02 12.95 32.58 1.60 X

21 6794 512.00 559.41 4.46 -47.41 -2.00R




Checking Assumptions ٢٨٣


Residuals From Model With Lagged Sales

2

1

0

-1

-2

-3

302010

SR

ES

2

Index

Now r = -.23 is not significant

Indicator Variables ٢٨٤


Chapter 7Using Indicator and Interaction Variables



Indicator Variables ٢٨٥


7.1 Using and Interpreting Indicator Variables

�Suppose some observations have a particular characteristic or attribute, while others do not.

�We can include this information in the regression model by using dummy or indicator variables.


Indicator Variables ٢٨٦


Add the info thru a coding scheme

Use a binary (dummy) variable to “indicate”when the characteristic is present

Di = 1 if observation i has the attribute

Di = 0 if observation i does not have it

Indicator Variables ٢٨٧


An Example

Di = 1 if individual i is employed

Di = 0 if individual i is not employed

We could do it the other way and use the "1" to indicate an unemployed individual.

Indicator Variables ٢٨٨


Multiple Categories

� For multiple categories, use multiple indicators.

� For example, to indicate where a firm's stock is listed, we could define 3 indicator variables; one each for the NYSE, AMEX and NASDAQ.

� For computational reasons, we would include only two of these in the regression.


Indicator Variables ٢٨٩


Example 7.1 Employment Discrimination

If two groups have apparently different salary structures, you first need to account for differences in education, training and experience before any claim of discrimination can be made.

Regression analysis with an indicator variable for the group is a way to investigate this.

Indicator Variables ٢٩٠


Treasury Versus Harris

The data set HARRIS7 contains information on the salaries of 93 employees of the Harris Trust and Savings Bank. They were sued by the US Department of Treasury in 1981.

Here we examine how salary depends on education, also accounting for gender.

Indicator Variables ٢٩١


0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Salary Versus Years of Education

At all levels of education, the male salaries appear higher.


Indicator Variables ٢٩٢


Regression Analysis


SALARY = 4173 + 80.7 EDUCAT + 692 MALES


Constant 4173.1 339.2 12.30 0.000

EDUCAT 80.70 27.67 2.92 0.004

MALES 691.8 132.2 5.23 0.000

S = 572.4 R-Sq = 36.3% R-Sq(adj) = 34.9%

How do we interpret this equation?

Indicator Variables ٢٩٣


An Intercept AdjusterFor an indicator variable, the bj is not really a slope.

To see this, evaluate the equation for the two groups.

FEMALES (MALES = 0)SALARY = 4173 + 80.7 EDUCAT + 692 MALES

= 4173 + 80.7 EDUCAT + 692 (0)

= 4173 + 80.7 EDUCAT

MALES (MALES = 1)SALARY = 4173 + 80.7 EDUCAT + 692 MALES

= 4173 + 80.7 EDUCAT + 692 (1)

= 4173 + 80.7 EDUCAT + 692

= 4865 + 80.7 EDUCAT

Indicator Variables ٢٩٤


0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Parallel Salary Equations


Indicator Variables ٢٩٥


Is The Difference Significant?

H0: βMALES = 0

Ha: βMALES ≠ 0

Use t = b/SEb as usual

t = 5.23 is significant

(After accounting for years of education, there is no salary difference)

(After accounting for education, there IS a salary difference)

Indicator Variables ٢٩٦


What if the Coding Was Different?

� If we had an indicator for females and used it, the equation would be:

SALARY = 4865 + 80.7 EDUCAT - 692 FEMALES

� The difference between the groups is the same. For females, the intercept in the equation is 4865 – 692 = 4173

Indicator Variables ٢٩٧


Multiple Categories

� Pick one category as the "base category".

� Create one indicator variable for each other category.

� In general, if there are m categories, use m – 1 indicator variables.


Indicator Variables ٢٩٨


Example 7.3 Meddicorp Sales

Y = Sales in one of 25 territories

X1 = advertising in territory

X2 = bonuses paid in territory

Also Region: 1 = South

2 = West

3 = Midwest

Indicator Variables ٢٩٩


How do you use region?

What happens if you just put it in the model?

Sales = -84 + 1.55 ADV + 1.11 BONUS + 119 Region

R2 = 92.0% and Se = 68.89

SE(Region) = 28.69 so tstat = 4.14 is significant

Indicator Variables ٣٠٠


Region as an X

This implies the difference between Region 3 (MW) and Region 2 (W) = b3 = 119

And the difference between Region 2 (W) and Region 1 (S) is also 119

The sales differences may not be equal but this forces them to be estimated that way


Indicator Variables ٣٠١


A more flexible approach

� Use two indicator variables to tell the three regions apart

� Can use any one of the three as the “base” category.

� Here is what it looks like if Midwest is selected as the base.

Indicator Variables ٣٠٢


Coding scheme

Region

D1

South

D2

West

SOUTH

WEST

MIDWEST

1

0

0

0

1

0

Indicator Variables ٣٠٣


Results

SALES = 435 + 1.37ADV + .975 BONUS

- 258 South – 210 West

R2 = 94.7 and Se = 57.63

Both indicators are significant


Indicator Variables ٣٠٤


This Defines Three Equations

SALES = 435 + 1.37ADV + .975 BONUS

- 258 South – 210 West

S: SALES = 177 + 1.37ADV + .975 BONUS

W: SALES = 225 + 1.37ADV + .975 BONUS

MW: SALES = 435 + 1.37ADV + .975 BONUS

Indicator Variables ٣٠٥


Is Location Significant?

� Because location is measured by two variables in a group, we need to do a partial F test.

� The full Model has ADV, BONUS, SOUTH and WEST and has R2 = 94.7

� The reduced model has only ADV and BONUS, with R2 = 85.5

Indicator Variables ٣٠٦


Output For F-TestFULL MODEL

S = 57.63 R-Sq = 94.7% R-Sq(adj) = 93.6%


Source DF SS MS F P

Regression 4 1182560 295640 89.03 0.000


Total 24 1248974

REDUCED MODEL

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


Source DF SS MS F P

Regression 2 1067797 533899 64.83 0.000


Total 24 1248974


Indicator Variables ٣٠٧


Partial F Computations


MSEF

(181176 - 66414)/ (4-2)= = 17.3

3321

Indicator Variables ٣٠٨


7.2 Interaction Variables

� Another type of variable used in regression models is an interaction variable.

� This is usually formulated as the product of two variables; for example, x3 = x1x2

� With this variable in the model, it means the level of x2 changes how x1affects Y

Indicator Variables ٣٠٩


Interaction Model

With two x variables the model is:

If we factor out x1 we get:

so each value of x2 yields a different slope in the relationship between y and x1

exxxxy ++++= 21322110 ββββ

exxxy ++++= 2212310 )( ββββ


Indicator Variables ٣١٠


Interaction Involving an Indicator

If one of the two variables is binary, the interaction produces a model with two different slopes.

When x2 = 0

When x2 = 1

exy ++= 110 ββ

exy ++++= 13120 )()( ββββ

Indicator Variables ٣١١


Example 7.4 Discrimination (again)

� In the Harris Bank case, suppose we suspected that the salary difference by gender changed with different levels of education.

� To investigate this, we created a new variable MSLOPE = EDUCAT*MALES and added it to the model.

Indicator Variables ٣١٢


Regression Output


SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE


Constant 4395.3 389.2 11.29 0.000

EDUCAT 62.13 31.94 1.95 0.055

MALES -274.9 845.7 -0.32 0.746

MSLOPE 73.59 63.59 1.16 0.250

S = 571.4 R-Sq = 37.3% R-Sq(adj) = 35.2%

How do we interpret the equation this time?


Indicator Variables ٣١٣


A Slope AdjusterTo see the interaction effect, once again evaluate

the equation for the two groups.

FEMALES (MALES = 0)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE

= 4395 + 62.1 EDUCAT - 275 (0) + 73.6 (EDUCAT*0)= 4395 + 62.1 EDUCAT

MALES (MALES = 1)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE

= 4395 + 62.1 EDUCAT - 275 (1) + 73.6 (EDUCAT*1)= 4395 + 62.1 EDUCAT – 275 + 73.6 EDUCAT= 4120 + 135.7 EDUCAT

Indicator Variables ٣١٤


0

1

1615141312111098

8000

7000

6000

5000

4000

EDUCAT

SA

LA

RY

Lines With Two Different Slopes

A bigger gap occurs at higher education levels

Indicator Variables ٣١٥


Tests in This Model� Although the slope adjuster implies

the salary gap increases with education, this effect is not really significant (tMSLOPE = 1.16).

� The overall affect of gender is now contained in two variables, so a partial F test would be needed to test for differences between male and female salaries.


Indicator Variables ٣١٦


7.3 Seasonal Effects in Time Series Regression

� Data collected over time (say quarterly)

� If we think the Y variable depends on the calendar can do a kind of “seasonal adjustment” by adding quarter dummies

� Q1 = 1 if this was first quarter, Q2 = 1 if a second quarter, Q3 = 1 if third

� Don’t use Q4 since that is the “base”

Indicator Variables ٣١٧


Example 7.5 ABX Company Sales

� We fit a trend to these sales in Example 3.11 by regressing sales on a time index variable.

� Because this company sells winter sports merchandise, including seasonal effects should markedly improve the fit.

Indicator Variables ٣١٨


403020100

300

250

200

TIME

SA

LE

S

ABX Company Sales

4th qtr


Indicator Variables ٣١٩


Two RegressionsThe regression equation is

SALES = 199 + 2.56 TIME


Constant 199.017 5.128 38.81 0.000

TIME 2.5559 0.2180 11.73 0.000

S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%


SALES = 211 + 2.57 TIME + 3.75 Q1 - 26.1 Q2 - 25.8 Q3


Constant 210.846 3.148 66.98 0.000

TIME 2.56610 0.09895 25.93 0.000

Q1 3.748 3.229 1.16 0.254

Q2 -26.118 3.222 -8.11 0.000

Q3 -25.784 3.217 -8.01 0.000

S = 7.190 R-Sq = 95.9% R-Sq(adj) = 95.5%

Indicator Variables ٣٢٠


Are the Seasonal Effects Significant?

� The strong t-ratios for Q2 and Q3 say "yes" and the model R2 increased by 17.6% when we added the seasonal indicators.

� With evidence this strong we probably don't need to test further.

� In general, however, we would need another partial F test to see if the overall seasonal effect is significant.

Indicator Variables ٣٢١


Partial F Computations


MSEF

(9622 - 1810)/ (4-1)= = 17.3

(1810/35= 52)

F(0.05,3,35)=2.92


Variable Selection ٣٢٢

Chapter 8Variable Selection



Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection

٣٢٣

8.1 Introduction

� Previously we discussed some tests (t-test and partial F) that helped us determine whether certain variables should be in the regression.

� Here we will look at several variable selection strategies that expand on this idea.


٣٢٤

Why is This Important?

� If an important variable is omitted, the estimated regression coefficients can become biased (systematically too high or low).

� Their standard errors can become inflated, leading to imprecise intervals and poor power in hypothesis tests.



٣٢٥

Strategies

� All possible regressions: computer procedures that briefly examine every possible combination of Xs and report summaries of fit ability.

� Selection algorithms: rules for deciding when to drop or add variables

1. Backwards Elimination

2. Forward Selection

3. Stepwise Regression


٣٢٦

Words of Caution

� None guarantee you get the right model because they do not check assumptions or search for omitted factors like curvature.

� None have the ability to use a researcher's knowledge about the business or economic situation being analyzed.


٣٢٧

8.2 All Possible Regressions

� If there are k x variables to consider using, there are 2k possible subsets. For example, with only k=5, there are 32 regression equations.

� Obtaining these sounds like a ton of work but programs like SAS or Minitab have algorithms that can measure fit ability without really producing the equation.



٣٢٨

Typical Output

� The program will usually give you a summary table.

� Each line on the table will tell you which variables were in the model, plus measures of fit ability.

� These measures include R2, adjusted R2, Se and a new one, Cp


٣٢٩

The Cp Statistic

p = k + 1 is the number of terms in the model, including the intercept.

SSEp is the SSE of this model

MSEF is the MSE in the "full model" (with all the variables)

SSEpCp = - (n – 2p)

MSEF


٣٣٠

Using The Cp Statistic

Theory says that in a model with bias, Cp will be large.

It also says that in a model with no bias, Cp should be equal to p.

It is thus recommended that we consider models with a small Cp and those with Cp near p = k + 1.



٣٣١

Example 8.1 Meddicorp Revisited

n = 25 sales territoriesy = Sales (in $1000s) in each territoryx1 = Advertising ($100s) in the territoryx2 = Bonuses paid (in $100s) in the territoryx3 = Market share in the territoryx4 = largest competitor's sales ($1000s)x5 = Region code (1 = S, 2 = W, 3 = MW)

We are not using region here because it should be converted to indicator variables which should be examined as a group.


٣٣٢

Summary Results For All Possible Regressions

Variables in the Regression R2 R2

adj Cp Se

ADV 81.1 80.2 5.90 101.42

BONUS 32.3 29.3 75.19 191.76

COMPET 14.2 10.5 100.85 215.83

MKTSHR 0.1 0.0 120.97 232.97

ADV, BONUS 85.5 84.2 1.61 90.75

ADV, MKTSHR 81.2 79.5 7.66 103.23

ADV, COMPET 81.2 79.5 7.74 103.38

BONUS, COMPET 38.7 33.2 68.03 186.51

BONUS, MKTSHR 32.8 26.7 76.46 195.33

COMPET, MKTSHR 16.1 8.5 100.18 218.20

ADV, BONUS, MKTSHR 85.8 83.8 3.10 91.75

ADV, BONUS, COMPET 85.7 83.6 3.30 92.26

ADV, MKTSHR, COMPET 81.3 78.6 9.60 105.52

BONUS, MKTSHR, COMPET 40.9 32.5 66.90 187.48

ADV, BONUS, MKTSHR, COMPET 85.9 83.1 5.00 93.77


٣٣٣

The Best Model?

� The two variable model with ADV and BONUS has the smallest Cp and highest adjusted R2.

� The three variable models adding either MKTSHR or COMPET also have small Cp values but only modest increases in R2.

� The two-variable model is probably the best.



٣٣٤

M C

B K O

O T M

A N S P

D U H E

Vars R-Sq R-Sq(adj) C-p S V S R T

1 81.1 80.2 5.9 101.42 X

1 32.3 29.3 75.2 191.76 X

2 85.5 84.2 1.6 90.749 X X

2 81.2 79.5 7.7 103.23 X X

3 85.8 83.8 3.1 91.751 X X X

3 85.7 83.6 3.3 92.255 X X X

4 85.9 83.1 5.0 93.770 X X X X

Minitab Results

By default, the Best Subsets procedure prints two models for each number of X variables. This can be increased up to 5.


٣٣٥

Limitations

� With a large number of potential xvariables, the all possible approach becomes unwieldy.

� Minitab can use up to 31 predictors, but warns that computational time can be long when as few as 15 are used.

� "Obviously good" predictors can be forced into the model, thus reducing search time, but this is not always what you want.


٣٣٦

8.3 Other Variable Selection Techniques

� With a large number of potential xvariables, it may be best to use one of the iterative selection methods.

� These will look at only the set of models that their rules will lead them too, so they may not yield a model as good as that returned by the all possible regressions approach.



٣٣٧

8.3.1 Backwards Elimination1. Start with all variables in the equation.

2. Examine the variables in the model for significance and identify the least significant one.

3. Remove this variable if it does not meet some minimum significance level.

4. Run a new regression and repeat until all remaining variables are significant.


٣٣٨

No Search Routine Needed?

� Although most software packages have automatic procedures for backwards elimination, it is fairly easy to do interactively.

� Run a model, check its t-tests for significance, and identify the variable to drop.

� Run again with one less variable and repeat the steps.


٣٣٩

Step 1 – All Variables

Regression Analysis: SALES versus ADV, BONUS, MKTSHR, COMPET


SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET


Constant -593.5 259.2 -2.29 0.033

ADV 2.5131 0.3143 8.00 0.000

BONUS 1.9059 0.7424 2.57 0.018

MKTSHR 2.651 4.636 0.57 0.574

COMPET -0.1207 0.3718 -0.32 0.749

S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%

Least Significant



٣٤٠

Step 2 – COMPET Eliminated

Regression Analysis: SALES versus ADV, BONUS, MKTSHR


SALES = - 621 + 2.47 ADV + 1.90 BONUS + 3.12 MKTSHR


Constant -620.6 240.1 -2.58 0.017

ADV 2.4698 0.2784 8.87 0.000

BONUS 1.9003 0.7262 2.62 0.016

MKTSHR 3.116 4.314 0.72 0.478

S = 91.75 R-Sq = 85.8% R-Sq(adj) = 83.8%


٣٤١

Step 3 – MKTSHR Eliminated

Regression Analysis: SALES versus ADV, BONUS


SALES = - 516 + 2.47 ADV + 1.86 BONUS


Constant -516.4 189.9 -2.72 0.013

ADV 2.4732 0.2753 8.98 0.000

BONUS 1.8562 0.7157 2.59 0.017

S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%


٣٤٢

8.3.2 Forward Selection

� At each stage, it looks at the xvariables not in the current equation and tests to see if they will be significant if they are added.

� In the first stage, the x with the highest correlation with y is added.

� At later stages it is much harder to see how the next x is selected.



٣٤٣

MinitabOutput

forForwardSelection

An option in the

Stepwiseprocedure

Forward selection. Alpha-to-Enter: 0.25

Response is SALES on 4 predictors, with N = 25

Step 1 2

Constant -157.3 -516.4

ADV 2.77 2.47

T-Value 9.92 8.98

P-Value 0.000 0.000

BONUS 1.86

T-Value 2.59

P-Value 0.017

S 101 90.7

R-Sq 81.06 85.49

R-Sq(adj) 80.24 84.18

C-p 5.9 1.6


٣٤٤

Same Model as Backwards

� This data set is not too complex, so both procedures returned the same model.

� With larger data sets, particularly when the x variables are correlated among themselves, results can be different.


٣٤٥

8.3.3 Stepwise Regression

� A limitation with the backwards procedure is that a variable that gets eliminated is never considered again.

� With forward selection, variables entering stay in, even if they lose significance.

� Stepwise regression corrects these flaws. A variable entering can later leave. A variable eliminated can later go back in.



٣٤٦

MinitabOutput

forStepwise

Regression

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is SALES on 4 predictors, with N = 25

Step 1 2

Constant -157.3 -516.4

ADV 2.77 2.47

T-Value 9.92 8.98

P-Value 0.000 0.000

BONUS 1.86

T-Value 2.59

P-Value 0.017

S 101 90.7

R-Sq 81.06 85.49

R-Sq(adj) 80.24 84.18

C-p 5.9 1.6


٣٤٧

Selection Parameters� For backwards elimination, the user specifies

"Alpha to Remove", which is the maximum p-value a variable can have and stay in the equation.

� For forward selection, the user specifies "Alpha to Enter", which is the minimum p-vale a variable needs to enter the equation.

� Stepwise regression gets both.

� Often we use values like .15 or .20 because this encourages the procedures to look at models with more variables.


٣٤٨

8.4 Which Procedure is Best?

� Unless there are too many xvariables, the all possible models approach is favored because it looks at all combinations of variables.

� Of the other strategies, stepwise regression is probably best.

� If no search programs are available, backwards elimination can still provide a useful sifting of the data.



٣٤٩

No Guarantees

� Because they do not check assumptions or examine the model residuals, there is no guarantee of returning the right model.

� Nonetheless, these can be effective tools filtering the data and identifying which variables to pay more attention to.