Chapter 3 Simple Regression Analysis (Part...
Transcript of Chapter 3 Simple Regression Analysis (Part...
Multiple Regression I
Simple Regression I ١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 3Simple Regression Analysis
(Part 1)
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Simple Regression I ٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.1 Using Simple Regression to Describe a Relationship
� Regression analysis is a statistical technique used to describe relationships among variables.
� The simplest case is one where a dependent variable y may be related to an independent or explanatory variable x.
� The equation expressing this relationship is the line:
xbby 10 +=
Simple Regression I ٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Slope and Intercept
� For a given set of data, we need to calculate values for the slope b1 and the intercept b0.
� Figure 3.1 shows the graph of a set of six (x, y) pairs that have an exact relationship.
� Ordinary algebra is all you need to compute y = 1 + 2x
Multiple Regression I
Simple Regression I ٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Figure 3.1 Graph of An Exact Relationship
654321
13
8
3
x
y
x y
1 3
2 5
3 7
4 9
5 11
6 13
Simple Regression I ٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Error in the Relationship
� In real life, we usually do not have exact relationships.
� Figure 3.2 shows a situation where the yand x have a strong tendency to increase together but it is not perfect.
� You can use a ruler to put a line in approximately the "right place" and use algebra again.
^� A good guess might be y = 1 + 2.5x
Simple Regression I ٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Figure 3.2 Graph of a Relationship That is NOT Exact
x y
1 3
2 2
3 8
4 8
5 11
6 13654321
12
7
2
x
y
S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %
y = -0.2 + 2.2 x
Regression Plot
Multiple Regression I
Simple Regression I ٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Everybody Is Different
� The drawback to this technique is that everybody will have their own opinion about where the line goes.
� There would be ever greater differences if there were more data with a wider scatter.
� We need a precise mathematical technique to use for this task.
Simple Regression I ٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residuals
� Figure 3.3 shows the previous graph where the "fit error" of each point is indicated.
� These residuals are positive if the point is above the line and negative if the line is above the point.
� We want a technique that will make the + and – even out.
Simple Regression I ٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
654321
12
7
2
x
y
S = 1.48324 R-Sq = 90.6 % R-Sq(adj) = 88.2 %
y = -0.2 + 2.2 x
Regression PlotFigure 3.3 Deviations From the Line
- deviations
+ deviations
Multiple Regression I
Simple Regression I ١٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Computation Ideas (1)
We can search for a line that minimizes the sum of the residuals:
While this is a good idea, it can be shown that any line passing through the point (x, y) will have this sum = 0.
)ˆ(1
i
n
i
iyy −∑
=
Simple Regression I ١١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Computation Ideas (2)
We can work with absolute values and search for a line that minimizes:
Such a procedure—called LAV or least absolute value regression—does exist but usually is found only in specialized software.
|ˆ|1
i
n
i
iyy −∑
=
Simple Regression I ١٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Computation Ideas (3)
By far the most popular approach is to square the residuals and minimize:
This procedure is called least squaresand is widely available in software. It uses calculus to solve for the b0 and b1 terms and gives a unique solution.
2
1
)ˆ( i
n
i
i yy −∑=
Multiple Regression I
Simple Regression I ١٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Least Squares Estimators
� There are several formula for the b1term. If doing it by hand, we might want to use:
_ _� The intercept is b0 = y – b1 x
∑ ∑
∑ ∑ ∑
= =
= = =
−
−
=n
i
n
i
ii
n
i
n
i
n
i
iiii
xn
x
yxn
yx
b
1
2
1
2
1 1 11
1
1
Simple Regression I ١٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Figure 3.5 Computations
Requiredfor b1 and b0
xi yi xi2 xiyi
1 3 1 3
2 2 4 4
3 8 9 24
4 8 16 32
5 11 25 55
6 13 36 78
21 45 91 196Totals
Simple Regression I ١٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Calculations
=
−
−
=
∑ ∑
∑ ∑ ∑
= =
= = =
n
i
n
i
ii
n
i
n
i
n
i
iiii
xn
x
yxn
yx
b
1
2
1
2
1 1 11
1
1
_ _b0 = y – b1 x =
Multiple Regression I
Simple Regression I ١٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Unique Minimum
� The line we obtained was:
� The sum of squared errors (SSE) is:
� No other linear equation will yield a smaller SSE. For the line 1 + 2.5x we guessed earlier, the SSE is 10.75
xy 2.22.0ˆ +−=
80.8)ˆ( 2
1
=−∑=
i
n
i
i yy
Simple Regression I ١٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.2 Examples of Regression as a Descriptive Technique
Example 3.2 Pricing Communications Nodes
A Ft. Worth manufacturing company was concerned about the cost of adding nodes to a communications network. They obtained data on 14 existing nodes.
They did a regression of cost (the y) on number of ports (x).
Simple Regression I ١٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
70605040302010
60000
50000
40000
30000
20000
NUMPORTS
CO
ST
Pricing Communications Nodes
Cost = 16594 + 650 NUMPORTS
Multiple Regression I
Simple Regression I ١٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.3 Estimating Residential Real Estate Values
The Tarrant County Appraisal District uses data such as house size, location and depreciation to help appraise property.
Regression can be used to establish a weight for each factor. Here we look at how price depends on size for a set of 100 homes. The data are from 1990.
Simple Regression I ٢٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4500350025001500500
300000
200000
100000
0
SIZE
VA
LU
E
Tarrant County Real Estate
VALUE = -50035 + 72.8 SIZE
Simple Regression I ٢١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.4 Forecasting Housing Starts
Forecasts of various economic measures is important to the government and various industries.
Here we analyze the relationship between US housing starts and mortgage rates. The rate used is the US average for new home purchases.
Annual data from 1963 to 2002 is used.
Multiple Regression I
Simple Regression I ٢٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
15105
2400
2200
2000
1800
1600
1400
1200
1000
RATES
ST
AR
TS
US Housing Starts
STARTS = 1726 - 22.2 RATES
Simple Regression I ٢٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.3 Inferences From a Simple Regression Analysis
� So far regression has been used as a way to describe the relationship between the two variables.
� Here we will use our sample data to make inferences about what is going on in the underlying population.
� To do that, we first need some assumptions about how things are.
Simple Regression I ٢٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.3.1 Assumptions Concerning the Population Regression Line
� Lets use the communications nodes example to illustrate. Costs ranged from roughly $23000 to $57000 and number of ports from 12 to 68.
� Three times we had projects with 24 ports, but the three costs were all different. The same thing occurred at repeated observations at 52 and 56 ports.
� This illustrates how we view things: at each value of x there is a distribution of potential y values that can occur.
Multiple Regression I
Simple Regression I ٢٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Conditional Mean
� Our first assumption is that the means of these distributions all lie on a straight line:
� For example, at projects with 30 ports, we have:
� The actual cost of projects with 30 ports are going to be distributed about the mean. This also happens at other sizes of projects, so you might see something like the next slide.
xxy 10| ββµ +=
1030| 30ββµ +==xy
Simple Regression I ٢٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Cost
Nodes
12 30 68
Figure 3.12 Distribution of Costs around the Regression Line
ββββ0 + ββββ1 Nodes
Simple Regression I ٢٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Disturbance Terms
� Because of the variation around the regression line, it is convenient to view the individual costs as:
� The ei are called the disturbances and represent how yi differs from its conditional mean. If yi is above the mean, its disturbance has a + value.
iii exy ++= 10 ββ
Multiple Regression I
Simple Regression I ٢٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Assumptions
1. We expect the average disturbance ei to be zero so the regression line passes through the conditional mean of y.
2. The ei have constant variance σe2.
3. The ei are normally distributed.
4. The ei are independent.
Simple Regression I ٢٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.3.2 Inferences About β0 and β1
� We use our sample data to estimate β0 byb0 and β1 by b1. If we had a different sample, we would not be surprised to get different estimates.
� Understanding how much they would vary from sample to sample is an important part of the inference process.
� We use the assumptions, together with our data, to construct the sampling distributions for b0 and b1.
Simple Regression I ٣٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Sampling Distributions
� The estimators have many good statistical properties. They are unbiased, consistent and minimum variance.
� They have normal distributions with standard errors that are functions of the x values and σe
2.
� Full details are in Section 3.3.2
Multiple Regression I
Simple Regression I ٣١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Estimate of σe2
� This is an unknown quantity that needs to be estimated from data.
� We estimate it by the formula:
� The term MSE stands for mean squared error and is more or less the average squared residual.
MSEn
SSE
n
yy
Si
n
i
i
e =−
=−
−
=∑
=
22
)ˆ( 2
12
Simple Regression I ٣٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standard Error of the Regression
� The divisor n-2 used in the previous calculation follows our general rule that degrees of freedom are sample size – the number of estimates we make (b0 and b1) before estimating the variance.
� The square root of MSE is Se which we call the standard error of the regression.
� Se can be roughly interpreted as the "typical" amount we miss in estimating each y value.
Simple Regression I ٣٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Inference About β1
� Interval estimates and hypothesis tests are constructed using the sampling distribution of b1.
� The standard error of b1 is:
� Computer programs routinely compute this and report its value.
2)1(
11
x
ebSn
SS−
=
Multiple Regression I
Simple Regression I ٣٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interval Estimate
� The distribution we use is a t with n-2 degrees of freedom.
� The interval is:
� The value of t, of course, depends on the selected confidence level.
121 bn stb −±
Simple Regression I ٣٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests About β1
The most common test is that a change in the x variable does not induce a change in y, which can be stated:
H0: β1 = 0 Ha: β1 ≠ 0
If H1 is true it implies the population regression equation is a flat line; that is, regardless of the value of x, y has the same distribution.
Simple Regression I ٣٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test Statistic
The test would be performed by using the standardized test statistic:
Most computer programs compute this, and its associated p-value. and include them on the output.
The p-value is for the two-sided version of the test.
bS
bt
1
01 −=
Multiple Regression I
Simple Regression I ٣٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Inference About β0
� We can also compute confidence intervals and perform hypothesis tests about the intercept in the population equation.
� Details about the tests and intervals are in Section 3.3.2, but in most problems we are not interested in this.
� The intercept is the value of y at x=0 and in many problems this is not relevant; for example, we never see houses with zero square feet of floor space.
� Sometimes it is relevant, anyway. If we are estimating costs, we could interpret the intercept as the fixed cost. Even though we never see communication nodes with zero ports, there is likely to be a fixed cost associated with setting up each project.
Simple Regression I ٣٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.6 Pricing Communications Nodes (continued)
Inference questions:1. What is the equation relating NUMPORTS to
COST?
2. Is the relationship significant?
3. What is an interval estimate of β1?
4. Is the relationship positive?
5. Can we claim each port costs at least $1000?
6. What is our estimate of fixed cost?
7. Is the intercept 0?
Simple Regression I ٣٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab Regression OutputRegression Analysis: COST versus NUMPORTS
The regression equation is
COST = 16594 + 650 NUMPORTS
Predictor Coef SE Coef T P
Constant 16594 2687 6.18 0.000
NUMPORTS 650.17 66.91 9.72 0.000
S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%
Analysis of Variance
Source DF SS MS F P
Regression 1 1751268376 1751268376 94.41 0.000
Residual Error 12 222594146 18549512
Total 13 1973862521
Multiple Regression I
Simple Regression I ٤٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is the relationship significant?
H0: β1 = 0 (Cost does not change whennumber of ports increase)
Ha: β1 ≠ 0 (Cost does change)
We will use a 5% level of significance and the t distribution with (n-2) = 12 degrees of freedom.
Decision rule: Reject H0 if t > 2.179or if t < -2.179
from Minitab output t = 9.72 (p-value =.000)
We conclude that there is a significant relationship between project size and cost.
Simple Regression I ٤١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
What is an interval estimate of β1?
Interval is:
For a 95% interval use t = 2.179
650.17 ± 2.179(66.91) = 650.17 ± 145.80
We are 95% sure that the average cost for each additional node is between $504 and $796.
121 bn stb −±
Simple Regression I ٤٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Can we claim a positive relationship?
H0: β1 = 0 (Cost does not change when size increases)
Ha: β1 > 0 (Cost increases when size increases)
We will use a 5% level of significance and the tdistribution with (n-2) = 12 degrees of freedom.
Decision rule: Reject H0 if t > 1.782
From Minitab output t = 9.72 (p-value is half of the listed value of .000, which is still .000)
We conclude that the project cost does increase with project size.
Multiple Regression I
Simple Regression I ٤٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is the cost per port at least $1000?
H0: β1 ≥ 1000 (Cost per port at least $1000)
Ha: β1 < 1000 (Cost is less than $1000)
Again we will use a 5% level of significance and 12 degrees of freedom.
Decision rule: Reject H0 if t < -1.782
We conclude that the cost per node is (much) less than $1000.
23.591.66
100017.6501000useHere
1
1 −=−
=−
=
bS
bt
Simple Regression I ٤٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
What is our estimate of fixed cost?
We can interpret the intercept of the equation as fixed cost, and the slope as variable cost. For the intercept, an interval is:
16594 ± 2.179(2687) = 16954 ± 5855
We are 95% sure the fixed cost is between $11,099 and $22,809.
020 bn stb −±
Simple Regression I ٤٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is the intercept 0?
H0: β0 = 0 (Fixed cost is 0)
Ha: β0 ≠ 0 (Fixed cost is not 0)
Again, use a 5% level of significance and 12 d.f.
Decision rule: Reject H0 if t > 2.179or if t < -2.179
from Minitab output t = 6.18 (p-value =.000)
We conclude that the fixed cost is not zero.
Multiple Regression I
Simple Regression II ٤٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 3Simple Regression Analysis
(Part 2)
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Simple Regression II ٤٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.4 Assessing the Fit of the Regression Line
� It some problems, it may not be possible to find a good predictor of the y values.
� We know the least squares procedure finds the best possible fit, but that deosnot guarantee good predictive power.
� In this section we discuss some methods for summarizing the fit quality.
Simple Regression II ٤٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.4.1 The ANOVA Table
Let us start by looking at the amount of variation in the y values. The variation about the mean is:
which we will call SST, the total sum of squares.
Text equations (3.14) and (3.15) show how this can be split up into two parts.
2
1
)( yyn
i
i −∑=
Multiple Regression I
Simple Regression II ٤٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Partitioning SST
SST can be split into two pieces which are the previously introduced SSE and a new quantity, SSR, the regression sum of squares.
SST = SSR + SSE
2
1
2
1
2
1
)ˆ()ˆ()( i
n
i
i
n
i
i
n
i
i yyyyyy −+−=− ∑∑∑===
Simple Regression II ٥٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Explained and Unexplained Variation
� We know that SSE is the sum of all the squared residuals, which represent lack of fit in the observations.
� We call this the unexplained variation in the sample.
� Because SSR contains the remainder of the variation in the sample, it is thus the variation explained by the regression equation.
Simple Regression II ٥١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The ANOVA Table
Most statistics packages organize these quantities in an ANalysis Of VAriance table.
Source DF SS MS F
Regression 1 SSR MSR MSR/MSE
Residual n-2 SSE MSE
Total n-1 SST
Multiple Regression I
Simple Regression II ٥٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.4.2 The Coefficient of Determination
� If we had an exact relationship between y and x, then SSE would be zero and SSR = SST.
� Since that does not happen often it is convenient to use the ratio of SSR to SST as measure of how close we get to the exact relationship.
� This ratio is called the Coefficient of Determination or R2.
Simple Regression II ٥٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
R2
SSRR2 = —— is a fraction between 0 and 1
SST
In an exact model, R2 would be 1. Most of the time we multiply by 100 and report it as a percentage.
Thus, R2 is the percentage of the variation in the sample of y values that is explained by the regression equation.
Simple Regression II ٥٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Correlation Coefficient
� Some programs also report the square root of R2 as the correlation between the y and y-hat values.
� When there is only a single predictor variable, as here, the R2 is just the square of the correlation between yand x.
Multiple Regression I
Simple Regression II ٥٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.4.3 The F Test
� An additional measure of fit is provided by the F statistic, which is the ratio of MSR to MSE.
� This can be used as another way to test the hypothesis that β1 = 0.
� This test is not real important in simple regression because it is redundant with the t test on the slope.
� In multiple regression (next chapter) it is much more important.
Simple Regression II ٥٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F Test Setup
The hypotheses are:H0: β1 = 0 Ha: β1 ≠ 0
The F ratio has 1 numerator degree of freedom and n-2 denominator degrees of freedom.
A critical value for the test is selected from that distribution and H0 is rejected if the computed F ratio exceeds the critical value.
Simple Regression II ٥٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.8 Pricing Communications Nodes (continued)
Below we see the portion of the Minitab output that lists the statistics we have just discussed.
S = 4307 R-Sq = 88.7% R-Sq(adj) = 87.8%
Analysis of Variance
Source DF SS MS F P
Regression 1 1751268376 1751268376 94.41 0.000
Residual Error 12 222594146 18549512
Total 13 1973862521
Multiple Regression I
Simple Regression II ٥٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
R2 and F
R2 = SSR/SST = 1751268376/ 1973862521
= .8872 or 88.7%
F = MSR/MSE = 1751268376/222594146
= 94.41
From the F1,12 distribution, the critical value at a 5% significance level is 4.75
Simple Regression II ٥٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.5 Prediction or Forecasting With a Simple Linear Regression Equation
� Suppose we are interested in predicting the cost of a new communications node that had 40 ports.
� If this size project is something we would see often, we might be interested in estimating the average cost of all projects with 40 nodes.
� If it was something we expect to see only once, we would be interested in predicting the cost of the individual project.
Simple Regression II ٦٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.5.1 Estimating the Conditional Mean of y Given x.
At xm = 40 ports, the quantity we are estimating is:
Our best guess of this is just the point on the regression line:
1040| 40ββµ +==xy
10 40ˆ bby m +=
Multiple Regression I
Simple Regression II ٦١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standard Error of the Mean
� We will want to make an interval estimate, so we need some kind of standard error.
� Because our point estimate is a function of the random variables b0 and b1 their standard errors figure into our computation.
� The result is:
2
2
)1(
)(1
x
mem
Sn
xx
nSS
−
−+=
Simple Regression II ٦٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Where Are We Most Accurate?
� For estimating the mean at the point xm
the standard error is Sm.
� If you examine the formula:
you can see that the second term will be zero if we predict at the mean value of x.
� That makes sense—it says you do your best prediction right in the center of your data.
2
2
)1(
)(1
x
mem
Sn
xx
nSS
−
−+=
Simple Regression II ٦٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interval Estimate
� For estimating the conditional mean of y that occurs at xm we use:
^ym ± tn-2 Sm
� We call this a confidence interval for the mean value of y at xm.
Multiple Regression I
Simple Regression II ٦٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Test
� We could also perform a hypothesis test about the conditional mean.
� The hypothesis would be:
H0: µy|x=40 = (some value)
and we would construct a t ratio from the point estimate and standard error.
Simple Regression II ٦٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.5.2 Predicting an Individual Value of y Given x
� If we are trying to say something about an individual value of y it is a little bit harder.
� We not only have to first estimate the conditional mean, but we also have to tack on an allowance for ybeing above or below its mean.
� We use the same point estimate but our standard error is larger.
Simple Regression II ٦٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Prediction Standard Error
� It can be show that the prediction standard error is:
� This looks a lot like the previous one but has an additional term under he square root sign.
� The relationship is:
2
2
)1(
)(11
x
mep
Sn
xx
nSS
−
−++=
222
emp SSS +=
Multiple Regression I
Simple Regression II ٦٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Predictive Inference
� Although we could be interested in a hypothesis test, the most common type of predictive inference is a prediction interval.
� The interval is just like the one for the conditional mean, except that Sp
is used in the computation.
Simple Regression II ٦٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.10 Pricing Communications Nodes (one last time)
What do we get when there are 40 ports?
Many statistics packages have a way for you to do the prediction. Here is Minitab's output:
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 42600 1178 ( 40035, 45166) ( 32872, 52329)
Values of Predictors for New Observations
New Obs NUMPORTS
1 40.0
Simple Regression II ٦٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
From the Output
^ym = 42600 Sm = 1178
Confidence interval: 40035 to 45166computed: 42600 ± 2.179(1178)
Prediction interval: 32872 to 52329computed: 42600 ± 2.179(????)
it does not list Sp
Multiple Regression I
Simple Regression II ٧٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interpretations
For all projects with 40 nodes, we are 95% sure that the average cost is between $40,035 and $45,166.
We are 95% sure that any individual project will have a cost between $32,872 and $52,329.
Simple Regression II ٧١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.5.3 Assessing Quality of Prediction
� We use the model's R2 as a measure of fit ability, but this may overestimate the model's ability to predict.
� The reason for that is that R2 is optimized by the least squares procedure, for the data in our sample.
� It is not necessarily optimal for data outside our sample, which is what we are predicting.
Simple Regression II ٧٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Data Splitting
� We can split the data into two pieces. Use the first part to obtain the equation and use it to predict the data in the second part.
� By comparing the actual y values in the second part to their corresponding predicted values, you get an idea of how well you predict data that is not in the "fit" sample.
� The biggest drawback to this is that it won't work too well unless we have a lot of data. To be really reliable we should have at least 25 to 30 observations in both samples.
Multiple Regression I
Simple Regression II ٧٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The PRESS Statistic
� Suppose you temporarily deleted observation i from the data set, fit a new equation, then used it to predict the yivalue.
� Because the new equation did not use any information from this data point, we get a clearer picture of the model's ability to predict it.
� The sum of these squared prediction errors is the PRESS statistic.
Simple Regression II ٧٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Prediction R2
� It sounds like a lot of work to do by hand, but most statistics packages will do it for you.
� You can then compute an R2-like measure called the prediction R2:
SST
PRESSRPRED −=12
Simple Regression II ٧٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
In Our Example
For the communications node data we have been using, SSE = 222594146, SST =1973862521 and R2 = 88.7%
Minitab reports that PRESS = 345066019
Our prediction R2:
1 – (345066019/1973862521) = 1 - .175 = .825 or 82.5%
Although there is a little loss, it implies we still have good prediction ability.
Multiple Regression I
Simple Regression II ٧٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.6 Fitting a Linear Trend Model to Time-Series Data
� Data gathered on different units at the same point in time are called cross sectional data.
� Data gathered on a single unit (person, firm, etc.) over a sequence of time periods are called time-series data.
� With this type of data, the primary goal is often building a model that can forecast the future
Simple Regression II ٧٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Time Series Models
� There are many types of models that attempt to identify patterns of behavior in a time series in order to extrapolate it into the future.
� Some of these will be examined in Chapter 11, but here we will just employ a simple linear trend model.
Simple Regression II ٧٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Linear Trend ModelWe assume the series displays a steady upward or
downward behavior over time that can be described by:
where t is the time index (t =1 for the first observation, t=2 for the second, and so forth).
The forecast for this model is quite simple:
You just insert the appropriate value for T into the regression equation.
tt ety ++= 10 ββ
Tbby T 10ˆ +=
Multiple Regression I
Simple Regression II ٧٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 3.11 ABX Company Sales
� The ABX Company sells winter sports merchandise including skates and skis. The quarterly sales (in $1000s) from first quarter 1994 through fourth quarter 2003 are graphed on the next slide.
� The time-series plot shows a strong upward trend. There are also some seasonal fluctuations which will be addressed in Chapter 7.
Simple Regression II ٨٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
300
250
200
40302010
SA
LE
S
Index
ABX Company Sales
Simple Regression II ٨١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Obtaining the Trend Equation
� We first need to create the time index variable which is equal to 1 for first quarter 1994 and 40 for fourth quarter 2003.
� Once this is created we can obtain the trend equation by linear regression.
Multiple Regression I
Simple Regression II ٨٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Trend Line EstimationThe regression equation is
SALES = 199 + 2.56 TIME
Predictor Coef SE Coef T P
Constant 199.017 5.128 38.81 0.000
TIME 2.5559 0.2180 11.73 0.000
S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%
Analysis of Variance
Source DF SS MS F P
Regression 1 34818 34818 137.50 0.000
Residual Error 38 9622 253
Total 39 44440
Simple Regression II ٨٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Slope Coefficient
The slope in the equation is 2.5559. This implies that over this 10-year period, we saw an average growth in sales of $2,556 per quarter.
The hypothesis test on the slope has a t value of 11.73, so this is indeed significantly greater than zero.
Simple Regression II ٨٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Forecasts For 2004
� Forecasts for 2004 can be obtained by evaluating the equation at t = 41, 42, 43 and 44.
� For example, the sales in fourth quarter are forecast:
SALES = 199 + 2.56 (44) = 311.48
� A graph of the data, the estimated trend and the forecasts is next.
Multiple Regression I
Simple Regression II ٨٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
454035302520151050
300
250
200
TIME
SA
LE
S
Data, Trend (—) and Forecast (---)
Simple Regression II ٨٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.7 Some Cautions in Interpreting Regression Results
Two common mistakes that are made when using regression analysis are:
1. That x causes y to happen, and
2. That you can use the equation to predict y for any value of x.
Simple Regression II ٨٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.7.1 Association Versus Causality
� If you have a model with a high R2, it does not automatically mean that a change in xcauses y to change in a very predictable way.
� It could be just the opposite, that y causes x to change. A high correlation goes both ways.
� It could also be that both y and x are changing in response to a third variable that we don't know about.
Multiple Regression I
Simple Regression II ٨٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Third Factor
� One example of this third factor is the price and gasoline mileage of automobiles. As price increases, there is a sharp drop in mpg. This is caused by size. Larger cars cost more and get less mileage.
� Another is mortality rate in a country versus percentage of homes with television. As TV ownership increases, mortality rate drops. This is probably due to better economic conditions improving quality of life and simultaneously allowing for greater ownership.
Simple Regression II ٨٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3.7.2 Forecasting Outside the Range of the Explanatory Variable
� When we have a model with a high R2, it means we know a good deal about the relationship of y and x for the range of x values in our study.
� Think of our communication nodes example where number of ports ranged from 12 to 68. Does our model even hold if we wanted to price a massive project of 200 ports?
Simple Regression II ٩٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
An Extrapolation Penalty
� Recall that our prediction intervals were always narrowest when we predicted right in the middle of our data set.
� As we go farther and farther outside the range of our data, the interval gets wider and wider, implying we know less and less about what is going on.
Multiple Regression I
Multiple Regression I ٩١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 4Multiple Regression Analysis
(Part 1)
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Multiple Regression I ٩٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.1 Using Multiple Regression
� In Chapter 3, the method of least squares was used to describe the relationship between a dependent variable y and an explanatory variable x.
� Here we extend that to two or more predictor variables, using an equation of the form:
kk xbxbxbby ++++= L22110ˆ
Multiple Regression I ٩٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Basic Exploration
� In Chapter 3 our main graphic tool was the X-Y scatter plot.
� Exploratory graphics are a bit harder to produce here because they need to be multidimensional.
� Even if there were just two xvariables a 3-D display is needed.
Multiple Regression I
Multiple Regression I ٩٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Estimation of Coefficients
We want an equation of the form:
As before we use least squares. The coefficients b0 b1 b2 ... bk are determined by minimizing the sum of squared residuals.
kk xbxbxbby ++++= L22110ˆ
Multiple Regression I ٩٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Formulae are Very Complex
� Can show exact formula when k=1 (simple regression). Refer to Section 3.1.
� Few texts show the formulae for k=2 (the simplest of multiple regressions)
� Appendix D shows formula in matrix notation
� This is totally a computer problem
Multiple Regression I ٩٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 4.1 Meddicorp Sales
n = 25 sales territories
Y = Sales (1000$) in each territory
X1 = Advertising (100$) in territory
X2 = Amount of bonuses (100$) paid
to salespersons in the territory
Data set MEDDICORP4
Multiple Regression I
Multiple Regression I ٩٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
600500400
1700
1600
1500
1400
1300
1200
1100
1000
900
ADV
SA
LE
S
Plotsand
Correlation
340290240
1700
1600
1500
1400
1300
1200
1100
1000
900
BONUS
SA
LE
S
Correlations
SALES ADV
ADV 0.900
BONUS 0.568 0.419
Multiple Regression I ٩٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3D Graphics
900
1000
1100
1200
600
1300
1400
1500
1600
SALES
1700
500 240ADV400
290Bon
340
Meddicorp Sales
Multiple Regression I ٩٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab Regression OutputThe regression equation is
SALES = - 516 + 2.47 ADV + 1.86 BONUS
Predictor Coef SE Coef T P
Constant -516.4 189.9 -2.72 0.013
ADV 2.4732 0.2753 8.98 0.000
BONUS 1.8562 0.7157 2.59 0.017
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
Multiple Regression I
Multiple Regression I ١٠٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
3D Surface Graph
900
600
1000
1100
1200
1300
1400
1500
EstSales
1600
500 240Advert
290
400Bonus340
Estimated Meddicorp Sales
The regression equation is
SALES = - 516 + 2.47 ADV + 1.86 BONUS
Multiple Regression I ١٠١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interpretation of Coefficients
� Recall that sales is in $1000s and advertising and bonus in $100s.
� If advertising is held fixed, sales increase $1860 for each $100 of bonus paid.
� If bonus were fixed, sales increase $2470 for each $100 spent on adv.
The regression equation is
SALES = - 516 + 2.47 ADV + 1.86 BONUS
Multiple Regression I ١٠٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2 Inferences From a MultipleRegression Analysis
In general, the population regression equation
involving K predictors is:
This says the mean value of y at a given set of x
values is a point on the surface described by the
terms on the right-hand side of the equation.
KKxxxy xxxK
ββββµ ++++= L22110,...,,| 21
Multiple Regression I
Multiple Regression I ١٠٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2.1 Assumptions Concerning the Population Regression Line
An alternative way of writing the relationship is:
where i denotes the ith observation and eidenotes a random error or disturbance (deviation from the mean).
We make certain assumptions about the ei.
ikikiii exxxy +++++= ββββ L22110
Multiple Regression I ١٠٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Assumptions
1. We expect the average disturbance ei to be zero so the regression line passes through the average value of y.
2. The ei have constant variance σe2.
3. The ei are normally distributed.
4. The ei are independent.
Multiple Regression I ١٠٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Inferences
� The assumptions allow inferences about the population relationship to be made from a sample equation.
� The first inferences considered are those about the individual population coefficients β1 β2 ... βK.
� Chapter 6 examines what happens when the assumptions are violated.
Multiple Regression I
Multiple Regression I ١٠٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.2.2 Inferences about the Population Regression Coefficients
If we wish to make an estimate of the effect of a change in one of the x variables on y, use the interval:
this refers to the jth of the K+1 regression coefficients. The multiplier t is selected from the t-distribution with n-K-1 degrees of freedom.
jbKnj stb 1−−±
Multiple Regression I ١٠٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests About the Coefficients
A test about the marginal effect of xj
on y may be obtained from:
H0: ββββj = ββββj*
Ha: ββββj ≠ ββββj*
where ββββj* is some specific value that is
relevant for the jth coefficient.
Multiple Regression I ١٠٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test Statistic
The test would be performed by using the standardized test statistic:
The most common form of this test is for the parameter to be 0. In this case the test statistic is just the estimate divided by its standard error.
bS
t
j
jjb β*
−=
Multiple Regression I
Multiple Regression I ١٠٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 4.2 Meddicorp (Continued)
Refer again to the portion of the regression output about the individual regression coefficients:
This lists the estimates, their standard errors and the ratio of the estimates to their standard errors.
Predictor Coef SE Coef T P
Constant -516.4 189.9 -2.72 0.013
ADV 2.4732 0.2753 8.98 0.000
BONUS 1.8562 0.7157 2.59 0.017
Multiple Regression I ١١٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests For Effect of Advertising
To see if an increase in advertising expenditure affects sales, we can test:
H0: ββββADV = 0 (An increase in advertisinghas no effect on sales)
Ha: ββββADV ≠ 0 (Sales do change whenadvertising increases)
The df are n-K-1 = 25–2-1 = 22. At a 5% significance level, the critical point from the t-table is 2.074
Multiple Regression I ١١١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test Result
From the output we get:
t = ( 2.4732 – 0)/.2753 = 8.98
This is above the critical value of 2.074, so we reject H0.
Note that we could also make use of the p-value (.000) for the test.
Multiple Regression I
Multiple Regression I ١١٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
One-Sided Test on Bonus
We can modify the test to make it one sided
H0: ββββBONUS = 0 (Increased bonuses
do not affect sales)
Ha: ββββBONUS > 0 (Sales increase when
bonuses are higher)
At a 5% significance level, the (one-sided) critical point is 1.717.
Multiple Regression I ١١٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
One-Sided Test Result
From the output we get:
t = 1.8562/.7157 = 2.59 which is > 1.717
We reject H0 but this time make a more specific conclusion.
The listed p-value (.017) is for a two-sided test. For our one-sided test, cut it in half.
Multiple Regression I ١١٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interval Effect of advertising
Recall that sales are measured in 1000$ and ADV in 100$
badv = 2.4732 and has standard error = .2753
2.4732 ± 2.074(.2753) = 2.4732 ± .5709
= 1.902 to 3.044
Each $100 spent on advertising returns $1902 to $3044 in sales.
Multiple Regression I
Multiple Regression I ١١٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3 Assessing the Fit of the Model
Recall how we partitioned the variation in the previous chapter:
SST = Total variation in the sample of Y values
Split up into two components SSE, SSR
SSE = Error or unexplained variation
SSR = Explained by the Yhat function
Multiple Regression I ١١٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3.1 The ANOVA Table and R2
� These are the same statistics we briefly examined in simple regression.
� They are perhaps more important here because they measure how well all the variables in the equation work together.
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
Multiple Regression I ١١٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
R2 – a Universal Measure of Fit
R2 = SSR / SST = proportion of variation explained by the regression equation.
If multiplied by 100, interpret as %
If only one x, R2 is square of correlation
For multiple, R2 is square of correlation between the Y values and Y-hat values
Multiple Regression I
Multiple Regression I ١١٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
For our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
R2 = 1067797 / 1248974 = .85494
85.5% of the variation in sales in the 25 territories is explained by the different levels of advertising and bonus
Multiple Regression I ١١٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Adjusted R2
� If there are many predictor variables to choose from, the best R2 is always obtained by throwing them all in the model.
� Some of these predictors could be insignificant, suggesting they contribute little to the model's R2.
� Adjusted R2 is a way to balance the desire for high R2 against the desire to include only important variables.
Multiple Regression I ١٢٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Computation
The "adjustment" is for the number of variables in the model.
Although regular R2 may decrease when you remove a variable, the adjusted version may actually increase if that variable did not have much significance.
)1/(
)1/(12
−
−−−=
nSST
KnSSERadj
Multiple Regression I
Multiple Regression I ١٢١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
4.3.2 The F Statistic
� Since R2 is so high, you would certainly think that the model contains significant predictive power.
� In other problems it is perhaps not so obvious. For example, would an R2 of 20% show any prediction ability at all?
� We can test for the predictive power of the entire model using the F statistic.
Multiple Regression I ١٢٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F Tests
� Generally these compare two sources of variation
� F = V1/V2 and has two df parameters
� Here V1 = SSR/K has K df
� And V2 = SSE/(n-K-1) has n-k-1 df
Multiple Regression I ١٢٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Usually will see several pages of these; one or
two pages at each specific level of significance
(.10, .05, .01).
Value of F at a
specific significance level
Numerator d.f.
denom.
d.f.
F Tables
Multiple Regression I
Multiple Regression I ١٢٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F Test Hypotheses
H0: β1 = β2 = …= βK = 0 (None of the Xs
help explain Y)
Ha: Not all βs are 0 (At least one X is useful)
H0: R2 = 0 is an equivalent hypothesis
Multiple Regression I ١٢٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
F test for our exampleS = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
F = 533899 / 8235 = 64.83 has p-value = 0.000
From tables, F2,22,.05 = 3.44 and F2,22,.01 = 5.72
Confirms that R2 = 85.5% is not near zero
Chapter 4Multiple Regression Analysis
(Part 2)
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Multiple Regression I
4.4 Comparing Two Regression Models
So far we have looked at two types of hypothesis tests. One was about the overall fit:
H0: β1 = β2 = …= βK = 0
The other was about individual terms:
H0: βj = 0
Ha: βj ≠ 0
4.4.1 Full and Reduced Model Using Separate Regressions
� Suppose we wanted to test a subset of the x variables for significance as a group.
� We could do this by comparing two models.
� The first (Full Model) has K variables in it.
� The second (Reduced Model) contains only the L variables that are NOT in our group.
The Two Models
For convenience, let's assume the group is the last (K-L) variables. The Full Model is:
The Reduced Model is just:
exxxxy KKLLLL +++++++= ++ βββββ LL 11110
exxy LL ++++= βββ L110
Multiple Regression I
The Partial F Test
We test the group for significance with another F test. The hypothesis is:
H0: βL+1 = βL+2 = …= βK = 0
Ha: At least one β ≠ 0
The test is performed by seeing how much SSE changes between models.
The Partial F Statistic
Let SSEF and SSER denote the SSE in the full and reduced models.
(SSER – SSEF) / (K – L)F =
SSEF / (n-K-1)
The statistic has (K-L) numerator and (n-K-1) denominator d.f.
The "Group"
� In many problems the group of variables has a natural definition.
� In later chapters we look at groups that provide curvature, measure location and model seasonal variation.
� Here we are just going to look at the effect of adding two new variables.
Multiple Regression I
Example 4.4 Meddicorp (yet again)
In addition to the variables for advertising and bonuses paid, we now consider variables for market share and competition.
x3 = Meddicorp market share in each area
x4 = largest competitor's sales in each area
The New Regression ModelThe regression equation is
SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET
Predictor Coef SE Coef T P
Constant -593.5 259.2 -2.29 0.033
ADV 2.5131 0.3143 8.00 0.000
BONUS 1.9059 0.7424 2.57 0.018
MKTSHR 2.651 4.636 0.57 0.574
COMPET -0.1207 0.3718 -0.32 0.749
S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%
Analysis of Variance
Source DF SS MS F P
Regression 4 1073119 268280 30.51 0.000
Residual Error 20 175855 8793
Total 24 1248974
Did We Gain Anything?
� The old model had R2 = 85.5% so we gained only .4%.
� The t ratios for the two new variables are .57 and -.32.
� It does not look like we have an improvement, but we really need the F test to be sure.
Multiple Regression I
The Formal Test
Numerator df = (K-L) = 4-2 = 2Denominator df = (n-K-1) = 20
At a 5% level, F2,20 = 3.49
H0: βMKTSHR = βCOMPET = 0Ha: At least one is ≠ 0
Reject H0 if F > 3.49
Things We Need
Full Model: (K = 4)SSEF = 175855(n-K-1) = 20
Reduced Model: (L = 2)Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
SSER
Computations
(SSER – SSEF) / (K – L)F =
SSEF / (n-K-1)
(181176 – 175855)/ (4 – 2)=
175855 / (25-4-1)
5321/2= = .3026
8793
Multiple Regression I
4.4.2 Full and Reduced Model Comparisons Using Conditional Sums of Squares
� In the standard ANOVA table, SSR shows the amount of variation explained by all variables together.
� Alternate forms of the table break SSR down into components.
� For example, Minitab shows sequential SSR which shows how much SSR increases as each new term is added.
Sequential SSR for MeddicorpS = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%
Analysis of Variance
Source DF SS MS F P
Regression 4 1073119 268280 30.51 0.000
Residual Error 20 175855 8793
Total 24 1248974
Source DF Seq SS
ADV 1 1012408
BONUS 1 55389
MKTSHR 1 4394
COMPET 1 927
Meaning What?
1. If ADV was added to the model first, SSR would rise from 0 to 1012408.
2. Addition of BONUS would yield a nice increase of 55389.
3. If MKTSHR entered third, SSR would rise a paltry 4394.
4. Finally, if COMPET came in last, SSR would barely budge by 927.
Multiple Regression I
Implications
� This is another way of showing that once you account for advertising and bonuses paid, you do not get much more from the last two variables.
� The last two sequential SSR values add up to 5321, which was the same as the (SSER – SSEF) quantity computed in the partial F test.
� Given that, it is not surprising to learn that the partial F test can be stated in terms of sequential sums of squares.
4.5 Prediction With a Multiple Regression Equation
As in simple regression, we will look at two types of computations:
1. Estimating the mean y that can occur at a set of x values.
2. Predicting an individual value of y that can occur at a set of x values.
4.5.1 Estimating the Conditional Mean of y Given x1, x2, ..., xK
This is our estimate of the point on our regression surface that occurs at a specific set of x values.
For two x variables, we are estimating:
22110,| 21xxxxy βββµ ++=
Multiple Regression I
Computations
The point estimate is straightforward, just plug in the x values.
The difficult part is computing a standard error to use in a confidence interval. Thankfully, most computer programs can do that.
22110ˆ xbxbby m ++=
4.5.2 Predicting an Individual Value of y Given x1, x2, ..., xK
Now the quantity we are trying to estimate is:
Our interval will have to account for the extra term ( ei ) in the equation, thus will be wider than the interval for the mean.
iiii exxy +++= 22110 βββ
Prediction in Minitab
Here we predict sales for a territory with 500 units of advertising and 250 units of bonus
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 1184.2 25.2 (1131.8, 1236.6) ( 988.8, 1379.5)
Values of Predictors for New Observations
New Obs ADV BONUS
1 500 250
Multiple Regression I
Interpretations
We are 95% sure that the average sales in territories with $50,000 advertising and $25,000 of bonuses will be between $1,131,800 and $1,236,600.
We are 95% sure that any individualterritory with this level of advertising and bonuses will have between $988,800 and $1,379,500 of sales
4.6 Multicollinearity: A Potential Problemin Multiple Regression
� In multiple regression, we like the x variables to be highly correlated with y because this implies good prediction ability.
� If the x variables are highly correlated among themselves, however, much of this prediction ability is redundant.
� Sometimes this redundancy is so severe that it causes some instability in the coefficient estimation. When that happens we say multicollinearity has occurred.
4.6.1 Consequences of Multicollinearity
1. The standard errors of the bj are larger than they should be. This could cause all the t statistics to be near 0 even though the F is large.
2. It is hard to get good estimates of the βj. The bj may have the wrong sign. They may have large changes in value if another variable is dropped from or added to the regression.
Multiple Regression I
4.6.2 Detecting Multicollinearity
Several methods appear in the literature. Some of these are:
1. Examining pairwise correlations
2. Seeing large F but small t ratios
3. Computing Variance Inflation Factors
Examining Pairwise Correlations
� If it is only a collinearity problem, you can detect it by examining the correlations for pairs of xvalues.
� How large the correlation needs to be before it suggests a problem is debatable. One rule of thumb is .5, another is the maximum correlation between y and the various x values.
� The major limitation of this is that it will not help if there is a linear relationship involving several xvalues, for example,
x1 = 2x2 - .07x3 + a small random error
Large F, Small t
� With a significant F statistic you would expect to see at least one significant predictor, but that may not happen if all the variables are fighting each other for significance.
� This method of detection may not work if there are, say, six good predictors but the multicollinearity only involves four of them.
� This method also may not help identify what variables are involved.
Multiple Regression I
Variance Inflation Factors
� This is probably the most reliable method for detection because it both shows the problem exists and what variables are involved.
� We can compute a VIF for each variable. A high VIF is an indication that the variable's standard error is "inflated" by its relationship to the other x variables.
Auxiliary Regressions
Suppose we regressed each x value, in turn, on all of the other x variables.
Let Rj2 denote the model's R2 we get
when xj was the "temporary y".
1The variable's VIF is VIFj =
1 - Rj2
VIFj and Rj2
If xj was totally uncorrelated with the other xvariables, its VIF would be 1.
This table shows some other values.
10099%
1090%
580%
250%
10%
VIFjRj2
Multiple Regression I
Auxiliary Regressions: A Lot of Work?
� If there were a large number of xvariables in the model, obtaining the auxiliaries would be tedious.
� Most statistics package will compute the VIF statistics for you and report them with the coefficient output.
� You can then do the auxiliary regressions, if needed, for the variables with high VIF.
Using VIFs
� A general rule is that any VIF > 10 is a problem.
� Another is that if the average VIF is considerably larger than 1, SSE may be inflated.
� The average VIF indicates how many times larger SSE is due to multicollinearity than if the predictors were uncorrelated.
� Freund and Wilson suggest comparing the VIF to 1/(1-R2) for the main model. If the VIF are less than this, multicollinearity is not a problem.
Our Example
Pairwise correlations
The maximum correlation among the x variables is .452 so if multicollinearity exists it is well hidden.
Correlations: SALES, ADV, BONUS, MKTSHR, COMPET
SALES ADV BONUS MKTSHR
ADV 0.900
BONUS 0.568 0.419
MKTSHR 0.023 -0.020 -0.085
COMPET 0.377 0.452 0.229 -0.287
Multiple Regression I
VIFs in Minitab
The regression equation is
SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - .121 COMPET
Predictor Coef SE Coef T P VIF
Constant -593.5 259.2 -2.29 0.033
ADV 2.5131 0.3143 8.00 0.000 1.5
BONUS 1.9059 0.7424 2.57 0.018 1.2
MKTSHR 2.651 4.636 0.57 0.574 1.1
COMPET -0.1207 0.3718 -0.32 0.749 1.4
S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%
No Problem!
4.6.3 Correction for Multicollinearity
� One solution would be to leave out one or more of the redundant predictors.
� Another would be to use the variables differently. If x1 and x2 are collinear, you might try using x1 and the ratio x2/ x1instead.
� Finally, there are specialized statistical procedures that can be used in place of ordinary least squares.
4.7 Lagged Variables as Explanatory Variables in Time-Series Regression
� When using time series data in a regression, the relationship between y and x may be concurrentor x may serve as a leading indicator.
� In the latter, a past value of x appears as a predictor, either with or without the current value of x.
� An example would be the relationship between housing starts as y and interest rates as x. When rates drop, it is several months before housing starts increase.
Multiple Regression I
Lagged Variables
The effect of advertising on sales is often cumulative so it would not be surprising see it modeled as:
Here xt is advertising in the current month and the lagged variables xt-1 and xt-2represent advertising in the two previous months.
ttttt exxxy ++++= −− 231210 ββββ
Potential Pitfalls
� If several lags of the same variable are used, it could cause multicollinearity if xtwas highly autocorrelated (correlated with its own past values).
� Lagging causes lost data. If xt-2 is included in the model, the first time it can be computed is at time period t = 3. We lose any information in the first two observations.
Lagged y Values
� Sometimes a past value of y is used as a predictor as well. A relationship of this type might be:
� This implies that this month's sales ytare related to by two months of advertising expense xt and xt-1 plus last month's sales yt-1.
ttttt exxyy ++++= −− 132110 ββββ
Multiple Regression I
Example 4.6 Unemployment Rate
� The file UNEMP4 contains the national unemployment rates (seasonally-adjusted) from January 1983 through December 2002.
� On the next few slides are a time series plot of the data and regression models employing first and second lags of the rates.
10.5
9.5
8.5
7.5
6.5
5.5
4.5
3.5
Aug-1999Jun-1995Apr-1991Feb-1987
UN
EM
P
Date/Time
Time Series PlotAutocorrelation is .97 at lag 1 and .94 at lag 2
Regression With First Lag
The regression equation is
UNEMP = 0.153 + 0.971 Unemp1
239 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 0.15319 0.04460 3.44 0.001
Unemp1 0.971495 0.007227 134.43 0.000
S = 0.1515 R-Sq = 98.7% R-Sq(adj) = 98.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 414.92 414.92 18070.47 0.000
Residual Error 237 5.44 0.02
Total 238 420.36
High R2 because of autocorrelation
Multiple Regression I
Regression With Two LagsThe regression equation is
UNEMP = 0.168 + 0.890 Unemp1 + 0.0784 Unemp2
238 cases used 2 cases contain missing values
Predictor Coef SE Coef T P VIF
Constant 0.16764 0.04565 3.67 0.000
Unemp1 0.89032 0.06497 13.70 0.000 77.4
Unemp2 0.07842 0.06353 1.23 0.218 77.4
S = 0.1514 R-Sq = 98.7% R-Sq(adj) = 98.6%
Analysis of Variance
Source DF SS MS F P
Regression 2 395.55 197.77 8630.30 0.000
Residual Error 235 5.39 0.02
Total 237 400.93
Comments
� It does not appear that the second lag term is needed. Its t statistic is 1.23.
� Because we got R2 = 98.7% from the model with just one term, there was not much variation left for the second lag term to explain.
� Note that the second model also had a lot of multicollinearity.
Fitting Curves to Data ١٧١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 5:
Fitting Curves to DataTerry Dielman
Applied Regression Analysis:A Second Course in Business and Economic Statistics, fourth edition
Multiple Regression I
Fitting Curves to Data ١٧٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.1 Introduction
In Chapter 4 , the model was presented as:
where we assumed linear relationships between y and the x variables.
In this chapter we find that this may not be true and consider curvilinear relationships between the variables.
ikikiii exxxy +++++= ββββ L22110
Fitting Curves to Data ١٧٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Modeling
� In general, we regress Y on some X which is not a linear function.
�Common functions are X2 , 1/X or log(X)
� In economics, sometimes regress log(y) on log(x)
Fitting Curves to Data ١٧٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.2 Fitting Curvilinear Relationships
� Polynomial Regression – a common correction for nonlinearity is to add powers of the explanatory variable
� In practice a second-order model is often sufficient to describe the relationship
i
k
ikiii exxxy +++++= ββββ L2
210
Multiple Regression I
Fitting Curves to Data ١٧٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 5.1: Telemarketing
n = 20 telemarketing employees
Y = average calls per day over 20 workdays
X = Months on the job
Data set TELEMARKET5
Fitting Curves to Data ١٧٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Plot of Calls versus Months
302010
35
30
25
20
MONTHS
CA
LL
S
There is an increase in calls with experience, but the rate of increase slows over time.
Fitting Curves to Data ١٧٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Fit of a First-Order Model
� For comparison purposes, we first fit the linear equation and obtained:
CALLS = 13.6708 + .7435 MONTHS
� This equation, which has an R2 of 87.4%, implies that each month of experience leads to .7435 more calls per day.
Multiple Regression I
Fitting Curves to Data ١٧٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Fitting a Second-Order Model
302010
35
30
25
20
MONTHS
CA
LL
S
S = 1.00325 R-Sq = 96.2 % R-Sq(adj) = 95.8 %
- 0.0401182 MONTHS**2
CALLS = -0.140471 + 2.31020 MONTHS
Regression Plot
Fitting Curves to Data ١٧٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression OutputRegression Analysis: CALLS versus MONTHS, MonthSQ
The regression equation is
CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ
Predictor Coef SE Coef T P
Constant -0.140 2.323 -0.06 0.952
MONTHS 2.3102 0.2501 9.24 0.000
MonthSQ -0.040118 0.006333 -6.33 0.000
S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%
Analysis of Variance
Source DF SS MS F P
Regression 2 437.84 218.92 217.50 0.000
Residual Error 17 17.11 1.01
Total 19 454.95
Fitting Curves to Data ١٨٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Test on β2
H0: β2 = 0 (Use the linear equation)Ha: β2 ≠ 0 (Quadratic has improved fit)
Test as usual with t = b2/SE(b2)
Here t = -.0402/.00633 = -6.33 is significant with p-value = .000
Not surprising since R2 increased 9%
Multiple Regression I
Fitting Curves to Data ١٨١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Hypothesis Tests "Top Down"
� The usual practice is to keep lower-order terms when a high-order term is significant.
� In b0 + b1 x + b2 x2 we would retain the b1 term even if it had an insignificant t-ratio, if the b2 term was significant.
Fitting Curves to Data ١٨٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Higher and higher?
� To see if the polynomial has even a higher order, we fit a cubic equation.
� The table below shows the second-order model was sufficient.
Model p for highest order term
R2 Adj R2 Se
Linear 0.000 87.4% 86.7% 1.787
Quadratic 0.000 96.2% 95.8% 1.003
Cubic 0.509 96.3% 95.7% 1.020
Fitting Curves to Data ١٨٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Centering the X
� When polynomial regression is used, multicollinearity often results because x and x2 are correlated.
� This can be eliminated by subtracting x-bar (the mean) from each x
2andUse )x(xxx −−
Multiple Regression I
Fitting Curves to Data ١٨٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.2.2 Reciprocal Transformation of the x Variable
� Another curvilinear relationship that is in common use is:
� Here y and x are inversely related but the relationship is not linear.
i
i
i ex
y +
+=
110 ββ
Fitting Curves to Data ١٨٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 5.2
� We are interested in the relationship between gas mileage and a car's horsepower.
� An the next page is a plot of the highway mpg (HWYMPG) and horsepower (HP) for 147 cars listed in the October 2002 Road and Track.
Fitting Curves to Data ١٨٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7006005004003002001000
70
60
50
40
30
20
10
HP
HW
YM
PG
Highway MPG versus Horsepower
Multiple Regression I
Fitting Curves to Data ١٨٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Modeling the Relationship
� A regression of HWYMPG on HP yieldsHWYMPG = 38.73 - .0477 HP with R2 = 59.4%
� This does not fit too well because as horsepower increases, mileage decreases, but the rate of decrease is slower for more-powerful cars.
� Although other models, including a quadratic, might work, we regressed HWYMPG on 1/HP.
Fitting Curves to Data ١٨٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression Results
The regression equation is
HWYMPG = 13.6 + 2692 HPINV
Predictor Coef SE Coef T P
Constant 13.6310 0.6493 20.99 0.000
HPINV 2962.4675 111.7526 24.09 0.000
S = 2.93107 R-Sq = 80.0% R-Sq(adj) = 79.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 4987.0 4987.0 580.48 0.000
Residual Error 145 1245.1 8.6
Total 146 6232.7
Fitting Curves to Data ١٨٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7006005004003002001000
70
60
50
40
30
20
10
HP
HW
YM
PG
Data and Reciprocal Fit
Multiple Regression I
Fitting Curves to Data ١٩٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.2.3 Log Transformation of the x Variable
� Yet another curvilinear equation is:
where ln(x) is the natural logarithm of x.
� It is assumed that the x values are positive because ln(0) is undefined.
iii exy ++= )ln(10 ββ
Fitting Curves to Data ١٩١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 5.4 Fuel Consumption
n = 51 (50 states plus Washington, D.C.)
FUELCON = fuel consumption per capita
POP = state population
AREA = area of state in square miles
POPDENS = population density
Data Set FUELCON5
Fitting Curves to Data ١٩٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
1000050000
700
600
500
400
300
DENSITY
FU
EL
CO
N
Plot of Fuelcon versus Density
r = -.454
Multiple Regression I
Fitting Curves to Data ١٩٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Effect of the Transformation
� The graph has one point (D.C.) on the right with all others clumped to the left.
� It is hard to see what type of relationship there is until some adjustments are made.
� Here take the natural log of density to "pull" the extreme point back in.
Fitting Curves to Data ١٩٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
9876543210
700
600
500
400
300
LogDensity
FU
EL
CO
N
Consumption versus Logdensity
r = -.527
Fitting Curves to Data ١٩٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Linear and Log Regressions
The regression equation is
FUELCON = 495 - 0.025 DENSITY
Predictor Coef SE Coef T P
Constant 465.628 9.481 52.28 0.000
DENSITY -0.025 0.007 -3.56 0.001
S = 65.1675 R-Sq = 20.6% R-Sq(adj) = 19.0%
The regression equation is
FUELCON = 597 – 24.5 LOGDENS
Predictor Coef SE Coef T P
Constant 597.19 29.96 22.15 0.000
LOGDENS -24.53 5.65 -4.34 0.000
S = 62.1561 R-Sq = 27.8% R-Sq(adj) = 26.3%
Multiple Regression I
Fitting Curves to Data ١٩٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.2.4 Log Transformations of Both the y and x Variables
� Here the natural log of y is the dependent variable and the natural log of x is the independent variable:
� Comparing results with other models may be difficult since we are not modeling yitself.
� Economists sometimes use this to estimate price elasticity (y is demand andx price; b1 is estimated elasticity).
iii exy ++= )ln()ln( 10 ββ
Fitting Curves to Data ١٩٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 5.4 Imports and GDP
The gross domestic product (GDP) and dollar amount of total imports (IMPORTS) for 25 countries was obtained from the World Fact Book.
For both variables, low values clump together and higher values spread out, suggesting log transformations for both.
Fitting Curves to Data ١٩٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
1000050000
1000
500
0
GDP
IMP
OR
TS
Scatterplot of Imports vs GDP
Multiple Regression I
Fitting Curves to Data ١٩٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
1050
7
6
5
4
3
2
1
0
-1
-2
LogGDP
Lo
gIm
p
Scatterplot of LogImp vs LogGDP
Fitting Curves to Data ٢٠٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Two Regression ModelsRegression Analysis: IMPORTS versus GDP
Predictor Coef SE Coef T P
Constant 22.32 19.24 1.16 0.258
GDP 0.105671 0.008452 12.50 0.000
S = 87.00 R-Sq = 87.2% R-Sq(adj) = 86.6%
Regression Analysis: LogImp versus LogGDP
Predictor Coef SE Coef T P
Constant -1.1275 0.4346 -2.59 0.016
LogGDP 0.86703 0.07877 11.01 0.000
S = 0.9142 R-Sq = 84.0% R-Sq(adj) = 83.4%
Not directly comparable
Fitting Curves to Data ٢٠١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The R2 Compare Different Things
� The 87.2 % R2 for the no-log model is the percentage of variation in Imports explained.
� The 84.0% for the second model is the percentage of variation in ln(Imports) explained.
� If you converted the fitted values of the second model back to Imports you might find the log model better.
Multiple Regression I
Fitting Curves to Data ٢٠٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
What Transformation to Use
� It is probably best to try several.
� A quadratic is most flexible because it uses two parameters to fit the relationship between to fit the relationship between y and x.
� Some further analysis is in Chapter 6 where tests for nonlinearity are discussed.
Fitting Curves to Data ٢٠٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
5.2.5 Fitting Curved Trends
If the data is collected over time, we may want to consider variations on the linear trend model of Chapter 3.
Another is the S-Curve trend:
tt etty +++= 2
210:trendQuadratic βββ
+
+= tt e
ty
1exp 10 ββ
Fitting Curves to Data ٢٠٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
S Curve Model
Many products have a demand curve like this.
1. Initial demand increases slowly
2. As product matures, demand picks up and steadily grows.
3. At some saturation point demand levels off.
Multiple Regression I
Fitting Curves to Data ٢٠٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Exponential Growth Model
Another alternative is an exponential trend:
This can be fit by least squares if you model ln(y).
( )tt ety ++= 10exp ββ
Checking Assumptions ٢٠٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 6Assessing the Assumptions of
the Regression Model
Terry Dielman
Applied Regression Analysisfor Business and Economics
Checking Assumptions ٢٠٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.1 Introduction
In Chapter 4 the multiple linear regression model was presented as
Certain assumptions were made about how the errors ei behaved. In this chapter we will check to see if those assumptions appear reasonable.
ikikiii exxxy +++++= ββββ L22110
Multiple Regression I
Checking Assumptions ٢٠٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.2 Assumptions of the Multiple Linear Regression Model
a. We expect the average disturbance ei to be zero so the regression line passes through the average value of Y.
b. The disturbances have constant variance σe
2.c. The disturbances are normally
distributed.d. The disturbances are independent.
Checking Assumptions ٢٠٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.3 The Regression Residuals
� We cannot check to see if the disturbances ei behave correctly because they are unknown.
� Instead, we work with their sample counterpart, the residuals
which represent the unexplained variation in the y values.
iii yye ˆˆ −=
Checking Assumptions ٢١٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
PropertiesProperty 1: They will always average 0
because the least squares estimation procedure makes that happen.
Property 2: If assumptions a, b and d of Section 6.2 are true then the residuals should be randomly distributed around their mean of 0. There should be no systematic pattern in a residual plot.
Property 3: If assumptions a through d hold, the residuals should look like a random sample from a normal distribution.
Multiple Regression I
Checking Assumptions ٢١١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Suggested Residual Plots
1. Plot the residuals versus each explanatory variable.
2. Plot the residuals versus the predicted values.
3. For data collected over time or in any other sequence, plot the residuals in that sequence.
In addition, a histogram and box plot are useful for assessing normality.
Checking Assumptions ٢١٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized residuals
� The residuals can be standardized by dividing by their standard error.
� This will not change the pattern in a plot but will affect the vertical scale.
� Standardized residuals are always scaled so that most are between -2 and +2 as in a standard normal distribution.
Checking Assumptions ٢١٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A plot meeting property 2
11010510095
3
2
1
0
-1
-2
X
Res
a. mean of 0 b. Same scatter d. No pattern with X
Multiple Regression I
Checking Assumptions ٢١٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A plot showing a violation
302010
2
1
0
-1
-2
MONTHS
Sta
ndard
ized R
esid
ual
Residuals Versus MONTHS
(response is CALLS)
Checking Assumptions ٢١٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.4 Checking Linearity
� Although sometimes we can see evidence of nonlinearity in an X-Y scatterplot, in other cases we can only see it in a plot of the residuals versus X.
� If the plot of the residuals versus an X shows any kind of pattern, it both shows a violation and a way to improve the model.
Checking Assumptions ٢١٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 6.1: Telemarketing
n = 20 telemarketing employees
Y = average calls per day over 20 workdays
X = Months on the job
Data set TELEMARKET6
Multiple Regression I
Checking Assumptions ٢١٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Plot of Calls versus Months
302010
35
30
25
20
MONTHS
CA
LL
S
There is some curvature, but it is masked by the more obvious linearity.
Checking Assumptions ٢١٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
If you are not sure, fit the linear model and save the residuals
The regression equation is
CALLS = 13.7 + 0.744 MONTHS
Predictor Coef SE Coef T P
Constant 13.671 1.427 9.58 0.000
MONTHS 0.74351 0.06666 11.15 0.000
S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 397.45 397.45 124.41 0.000
Residual Error 18 57.50 3.19
Total 19 454.95
Checking Assumptions ٢١٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residuals from model
With the linearity "taken out" the curvature is more obvious
Multiple Regression I
Checking Assumptions ٢٢٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.4.2 Tests for lack of fit
� The residuals contain the variation in the sample of Y values that is not explained by the Yhat equation.
� This variation can be attributed to many things, including:
• natural variation (random error)
• omitted explanatory variables
• incorrect form of model
Checking Assumptions ٢٢١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Lack of fit
� If nonlinearity is suspected, there are tests available for lack of fit.
� Minitab has two versions of this test, one requiring there to be repeated observations at the same X values.
� These are on the Options submenu off the Regression menu
Checking Assumptions ٢٢٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The pure error lack of fit test
� In the 20 observations for the telemarketing data, there are two at 10, 20 and 22 months, and four at 25 months.
� These replicates allow the SSE to be decomposed into two portions, "pure error" and "lack of fit".
Multiple Regression I
Checking Assumptions ٢٢٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The test
H0: The relationship is linear
Ha: The relationship is not linear
The test statistic follows an F distribution with c – k – 1 numerator df and n – c denominator df
c = number of distinct levels of X
n = 20 and there were 6 replicates so c = 14
Checking Assumptions ٢٢٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Minitab's outputThe regression equation is
CALLS = 13.7 + 0.744 MONTHS
Predictor Coef SE Coef T P
Constant 13.671 1.427 9.58 0.000
MONTHS 0.74351 0.06666 11.15 0.000
S = 1.787 R-Sq = 87.4% R-Sq(adj) = 86.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 397.45 397.45 124.41 0.000
Residual Error 18 57.50 3.19
Lack of Fit 12 52.50 4.38 5.25 0.026
Pure Error 6 5.00 0.83
Total 19 454.95
Checking Assumptions ٢٢٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test results
At a 5% level of significance, the critical value (from F12, 6 distribution) is 4.00.
The computed F is 5.25 is significant (p value of .026) so we conclude the relationship is not linear.
Multiple Regression I
Checking Assumptions ٢٢٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests without replication
� Minitab also has a series of lack of fit tests that can be applied when there is no replication.
� When they are applied here, these messages appear:
� The small p values suggest lack of fit.
Lack of fit test
Possible curvature in variable MONTHS (P-Value = 0.000)
Possible lack of fit at outer X-values (P-Value = 0.097)
Overall lack of fit test is significant at P = 0.000
Checking Assumptions ٢٢٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.4.3 Corrections for nonlinearity
� If the linearity assumption is violated, the appropriate correction is not always obvious.
� Several alternative models were presented in Chapter 5.
� In this case, it is not too hard to see that adding an X2 term works well.
Checking Assumptions ٢٢٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Quadratic model
The regression equation is
CALLS = - 0.14 + 2.31 MONTHS - 0.0401 MonthSQ
Predictor Coef SE Coef T P
Constant -0.140 2.323 -0.06 0.952
MONTHS 2.3102 0.2501 9.24 0.000
MonthSQ -0.040118 0.006333 -6.33 0.000
S = 1.003 R-Sq = 96.2% R-Sq(adj) = 95.8%
Analysis of Variance
Source DF SS MS F P
Regression 2 437.84 218.92 217.50 0.000
Residual Error 17 17.11 1.01
Total 19 454.95
No evidence of lack of fit (P > 0.1)
Multiple Regression I
Checking Assumptions ٢٢٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residuals from quadratic model
302010
1
0
-1
MONTHS
RE
SI1
No violations evident
Checking Assumptions ٢٣٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.5 Check for constant variance
� Assumption b states that the errors eishould have the same variance everywhere.
� This implies that if residuals are plotted against an explanatory variable, the scatter should be the same at each value of the X variable.
� In economic data, however, it is fairly common to see that a variable that increases in value often will also increase in scatter.
Checking Assumptions ٢٣١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 6.3 FOC Sales
n = 265 months of sales data for a fibre-optic company
Y = Sales
X= Mon ( 1 thru 265)
Data set FOCSALES6
Multiple Regression I
Checking Assumptions ٢٣٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Data over time
40000
30000
20000
10000
0
200100
SA
LES
Index
Note: This uses Minitab’s Time Series Plot
Checking Assumptions ٢٣٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residual plot
3002001000
20000
10000
0
-10000
-20000
Mon
Resid
ual
Residuals Versus Mon
(response is SALES)
Checking Assumptions ٢٣٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Implications
� When the errors ei do not have a constant variance, the usual statistical properties of the least squares estimates may not hold.
� In particular, the hypothesis tests on the model may provide misleading results.
Multiple Regression I
Checking Assumptions ٢٣٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.5.2 A Test for Nonconstant Variance
� Szroeter developed a test that can be applied if the observations appear to increase in variance according to some sequence (often, over time).
� To perform it, save the residuals, square them, then multiply by i (the observation number).
� Details are in the text.
Checking Assumptions ٢٣٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.5.3 Corrections for Nonconstant Variance
Several common approaches for correcting nonconstant variance are:
1. Use ln(y) instead of y
2. Use √y instead of y
3. Use some other power of y, yp, where the Box-Cox method is used to determine the value for p.
4. Regress (y/x) on (1/x)
Checking Assumptions ٢٣٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
LogSales over time
10
9
8
200100
Log
Sale
s
Index
Multiple Regression I
Checking Assumptions ٢٣٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residuals from Regression
3002001000
0.5
0.0
-0.5
-1.0
Mon
Resid
ual
Residuals Versus Mon
(response is LogSales)
This looks real good after I put this text box on top of those six large outliers.
Checking Assumptions ٢٣٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.6 Assessing the Assumption That the Disturbances are Normally Distributed
� There are many tools available to check the assumption that the disturbances are normally distributed.
� If the assumption holds, the standardized residuals should behave like they came from a standard normal distribution.
– about 68% between -1 and +1
– about 95% between -2 and +2
– about 99% between -3 and +3
Checking Assumptions ٢٤٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.6.1 Using Plots to Assess Normality
� You can plot the standardized residuals versus fitted values and count how many are beyond -2 and +2; about 1 in 20 would be the usual case.
� Minitab will do this for you if ask it to check for unusual observations (those flagged by an R have a standardized residual beyond ±2.
Multiple Regression I
Checking Assumptions ٢٤١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Other tools
� Use a Normal Probability plot to test for normality.
� Use a histogram (perhaps with a superimposed normal curve) to look at shape.
� Use a Boxplot for outlier detection. It will show all outliers with an *.
Checking Assumptions ٢٤٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 6.5 Communication Nodes
Data in COMNODE6
n = 14 communication networks
Y = Cost
X1 = Number of ports
X2 = Bandwidth
Checking Assumptions ٢٤٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression with unusuals flaggedThe regression equation is
COST = 17086 + 469 NUMPORTS + 81.1 BANDWIDTH
Predictor Coef SE Coef T P
Constant 17086 1865 9.16 0.000
NUMPORTS 469.03 66.98 7.00 0.000
BANDWIDT 81.07 21.65 3.74 0.003
S = 2983 R-Sq = 95.0% R-Sq(adj) = 94.1%
Analysis of Variance
(deleted)
Unusual Observations
Obs NUMPORTS COST Fit SE Fit Residual St Resid
1 68.0 52388 53682 2532 -1294 -0.82 X
10 24.0 23444 29153 1273 -5709 -2.12R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Multiple Regression I
Checking Assumptions ٢٤٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
55000450003500025000
1
0
-1
-2
Fitted Value
Sta
ndard
ized R
esi
dual
Residuals Versus the Fitted Values
(response is COST)
Residuals versus fits (from regression graphs)
Checking Assumptions ٢٤٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.6.2 Tests for normality
� There are several formal tests for the hypothesis that the disturbances eiare normal versus nonnormal.
� These are often accompanied by graphs* which are scaled so that data which are normally-distributed appear in a straight line.
* Your Minitab output may appear a little different depending on whether you have the student or professional version, and which release you have.
Checking Assumptions ٢٤٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
10-1-2
2
1
0
-1
-2
Norm
al S
core
Standardized Residual
Normal Probability Plot of the Residuals
(response is COST)
Normal plot (from regression graphs)
If normal, should follow straight line
Multiple Regression I
Checking Assumptions ٢٤٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Normal probability plot (graph menu)
3210-1-2-3
99
95
90
80
70
60
50
40
30
20
10
5
1
Data
Pe
rce
nt 1.187AD*
Goodness of Fit
Normal Probability Plot for SRES1ML Estimates - 95% CI
Mean
StDev
-0.0547797
1.02044
ML Estimates
Checking Assumptions ٢٤٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Test for Normality (Basic Statistics Menu)
P-Value: 0.216
A-Squared: 0.463
Anderson-Darling Normality Test
N: 14
StDev: 1.05896
Average: -0.0547797
10-1-2
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
bab
ility
SRES1
Normal Probability Plot
AcceptsHo: Normality
Checking Assumptions ٢٤٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 6.7 S&L Rate of Return
Data set SL6
n =35 Saving and Loans stocksY = rate of return for 5 years ending 1982X1 = the "Beta" of the stockX2 = the "Sigma" of the stock
Beta is a measure of nondiversifiable risk and Sigma a measure of total risk
Multiple Regression I
Checking Assumptions ٢٥٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Basic exploration
2.51.50.5
10
5
0
BETA
RE
TU
RN
20100
10
5
0
SIGMA
RE
TU
RN
Correlations: RETURN, BETA, SIGMA
RETURN BETA
BETA 0.180
SIGMA 0.351 0.406
Checking Assumptions ٢٥١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Not much explanatory powerThe regression equation is
RETURN = - 1.33 + 0.30 BETA + 0.231 SIGMA
Predictor Coef SE Coef T P
Constant -1.330 2.012 -0.66 0.513
BETA 0.300 1.198 0.25 0.804
SIGMA 0.2307 0.1255 1.84 0.075
S = 2.377 R-Sq = 12.5% R-Sq(adj) = 7.0%
Analysis of Variance
(deleted)
Unusual Observations
Obs BETA RETURN Fit SE Fit Residual St Resid
19 2.22 0.300 -0.231 2.078 0.531 0.46 X
29 1.30 13.050 2.130 0.474 10.920 4.69R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Checking Assumptions ٢٥٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
One in every crowd?
43210
5
4
3
2
1
0
-1
Fitted Value
Sta
ndard
ized R
esi
dual
Residuals Versus the Fitted Values
(response is RETURN)
Multiple Regression I
Checking Assumptions ٢٥٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Normality Test
P-Value: 0.000
A-Squared: 2.235
Anderson-Darling Normality Test
N: 35
StDev: 2.30610
Average: 0.0000000
1050
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
bab
ility
RESI1
Normal Probability Plot
RejectH0: Normality
Checking Assumptions ٢٥٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.6.3 Corrections for Nonnormality
� Normality is not necessary for making inference with large samples.
� It is required for inference with small samples.
� The remedies are similar to those used to correct for nonconstant variance.
Checking Assumptions ٢٥٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.7 Influential Observations
� In minimizing SSE, the least squares procedure tries to avoid large residuals.
� It thus "pays a lot of attention" to y values that don't fit the usual pattern in the data. Refer to the example in Figures 6.42(a) and 6.42(b).
� That probably also happened in the S&L data where the one very high return masked the relationship between rate of return, beta and sigma for the other 34 stocks.
Multiple Regression I
Checking Assumptions ٢٥٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.7.2 Identifying outliers
� Minitab flags any residual bigger than 2 in absolute value as a potential outlier.
� A boxplot of the residuals uses a slightly different rule, but should give similar results.
� There is also a third type of residual that is often used for this purpose.
Checking Assumptions ٢٥٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Deleted residuals
� If you (temporarily) eliminate the ith
observation from the data set, it cannot influence the estimation process.
� You can then compute a "deleted" residual to see if this point fits the pattern in the other observations.
Checking Assumptions ٢٥٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Deleted Residual IllustrationThe regression equation is
ReturnWO29 = - 2.51 + 0.846 BETA + 0.232 SIGMA
34 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant -2.510 1.153 -2.18 0.037
BETA 0.8463 0.6843 1.24 0.225
SIGMA 0.23220 0.07135 3.25 0.003
S = 1.352 R-Sq = 37.2% R-Sq(adj) = 33.1%
Without observation 29, we get a much better fit.
Predicted Y29 = -2.51 + .846(1.2973) + .232(13.3110) = 1.678
Prediction SE is 1.379
Deleted residual29 = (13.05 – 1.678)/1.379 = 8.24
Multiple Regression I
Checking Assumptions ٢٥٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The influence of observation 29
� When it was temporarily removed, the R2 went from 12.5% to 37.2% and we got a very different equation
� The deleted residual for this observation was a whopping 8.24, which shows it had a lot of weight in determining the original equation.
Checking Assumptions ٢٦٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.7.3 Identifying Leverage Points
� Outliers have unusual y values; data points with unusual X values are said to have leverage. Minitab flags these with an X.
� These points can have a lot of influence in determining the Yhatequation, particularly if they don't fit well. Minitab would flag these with both an R and an X.
Checking Assumptions ٢٦١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Leverage
� The leverage of the ith observation is hi (it is hard to show where this comes from without matrix algebra).
� If h > 2(K+1)/n it has high leverage.
� For S&P returns, k = 2 and n = 35 so the benchmark is 2(3)/35 = .171
� Observation 19 has a very small value for Sigma, this is the reason why it has h19 = .764
Multiple Regression I
Checking Assumptions ٢٦٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.7.4 Combined Measures
� The effect of an observation on the regression line is a function of both the y and X values.
� Several statistics have been developed that attempt to measure combined influence.
� The DFIT statistic and Cook's D are two more-popular measures.
Checking Assumptions ٢٦٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The DFIT statistic
� The DFIT statistic is a function of both the residual and the leverage.
� Minitab can compute and save these under "Storage".
� Sometimes a cutoff is used, but it is perhaps best just to look for values that are high.
Checking Assumptions ٢٦٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
DFIT Graphed
35302520151050
1.5
1.0
0.5
0.0
Observation Number
DF
IT1
29
19
Multiple Regression I
Checking Assumptions ٢٦٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Cook's D
� Often called Cook's Distance
� Minitab also will compute these and store them.
� Again, it might be best just to look for high values rather than use a cutoff.
Checking Assumptions ٢٦٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Cook's D Graphed
35302520151050
0.3
0.2
0.1
0.0
Observation Number
CO
OK
1
19
29
Checking Assumptions ٢٦٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.7.5 What to do with Unusual Observations
� Observation 19 (First Lincoln Financial Bank) has high influence because of its very low Sigma.
� Observation 29 (Mercury Saving) had a very high return of 13.05 but its Beta and Sigma were not unusual.
� Since both values are out of line with the other S&L banks, they may represent data recording errors.
Multiple Regression I
Checking Assumptions ٢٦٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Eliminate? Adjust?
� If you can do further research you might find out the true story.
� You should eliminate an outlier data point only when you are convinced it does not belong with the others (for example, if Mercury was speculating wildly).
� An alternative is to keep the data point but add an indicator variable to the model that signals there is something unusual about this observation.
Checking Assumptions ٢٦٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.8 Assessing the Assumption That the Disturbances are Independent
� If the disturbances are independent, the residuals should not display any patterns.
� One such pattern was the curvature in the residuals from the linear model in the telemarketing example.
� Another pattern occurs frequently in data collected over time.
Checking Assumptions ٢٧٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.8.1 Autocorrelation
� In time series data we often find that the disturbances tend to stay at the same level over consecutive observations.
� If this feature, called autocorrelation, is present, all our model inferences may be misleading.
Multiple Regression I
Checking Assumptions ٢٧١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
First-order autocorrelation
If the disturbances have first-order autocorrelation, they behave as:
ei = ρ ei-1 + µi
where µi is a disturbance with expected value 0 and independent over time.
Checking Assumptions ٢٧٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The effect of autocorrelation
If you knew that e56 was 10 and ρ was .7, you would expect e57 to be 7 instead of zero.
This dependence can lead to high standard errors for the bj coefficients and wider confidence intervals.
Checking Assumptions ٢٧٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.8.2 A Test for First-Order Autocorrelation
Durbin and Watson developed a test for positive autocorrelation of the form:
H0: ρ = 0Ha: ρ > 0
Their test statistic d is scaled so that it is 2 if no autocorrelation is present and near 0 if it is very strong.
Multiple Regression I
Checking Assumptions ٢٧٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A Three-Part Decision Rule
The Durbin-Watson test distribution depends on n and K. The tables (Table B.7) list two decision points dL and dU.
If d < dL reject H0 and conclude there is positive autocorrelation.
If d > dU accept H0 and conclude there is no autocorrelation.
If dL ≤ d ≤ dU the test is inconclusive.
Checking Assumptions ٢٧٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 6.10 Sales and Advertising
n = 36 years of annual data
Y = Sales (in million $)
X = Advertising expenditures ($1000s)
Data in Table 6.6
Checking Assumptions ٢٧٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Test
n = 36 and K = 1 X-variable
At a 5% level of significance, Table B.7 gives dL = 1.41 and dU = 1.52
Decision Rule:Reject H0 if d < 1.41Accept H0 if d > 1.52Inconclusive if 1.41 ≤≤≤≤ d ≤≤≤≤ 1.52
Multiple Regression I
Checking Assumptions ٢٧٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression With DW StatisticThe regression equation is
Sales = - 633 + 0.177 Adv
Predictor Coef SE Coef T P
Constant -632.69 47.28 -13.38 0.000
Adv 0.177233 0.007045 25.16 0.000
S = 36.49 R-Sq = 94.9% R-Sq(adj) = 94.8%
Analysis of Variance
Source DF SS MS F P
Regression 1 842685 842685 632.81 0.000
Residual Error 34 45277 1332
Total 35 887961
Unusual Observations
Obs Adv Sales Fit SE Fit Residual St Resid
1 5317 381.00 309.62 11.22 71.38 2.06R
15 6272 376.10 478.86 6.65 -102.76 -2.86R
R denotes an observation with a large standardized residual
Durbin-Watson statistic = 0.47 Significant autocorrelation
Checking Assumptions ٢٧٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Plot of Residuals over Time
2
1
0
-1
-2
-3
302010
SR
ES
1
Index
Shows first-order autocorrelation with r = .71
Checking Assumptions ٢٧٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
6.8.3 Correction for First-Order Autocorrelation
One popular approach creates a new y and x variable.
First, obtain an estimate of ρ. Here we use r = .71 from Minitab's Autocorrelation analysis.
Then compute yi* = yi – r yi-1
and xi* = xi – r xi-1
Multiple Regression I
Checking Assumptions ٢٨٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
First Observation Missing
Because the transformation depends on lagged y and x values, the first observation requires special handling.
The text suggests y1* = √1 – r2 y1
and a similar computation for x1*
Checking Assumptions ٢٨١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Other Approaches
� An alternative is to use an estimation technique (such as SAS's Autoregprocedure) that automatically adjusts for autocorrelation.
� A third option is to include a lagged value of y as an explanatory variable. In this model, the DW test is no longer appropriate.
Checking Assumptions ٢٨٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression With Lagged Sales as a PredictorThe regression equation is
Sales = - 234 + 0.0631 Adv + 0.675 LagSales
35 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant -234.48 78.07 -3.00 0.005
Adv 0.06307 0.02023 3.12 0.004
LagSales 0.6751 0.1123 6.01 0.000
S = 24.12 R-Sq = 97.8% R-Sq(adj) = 97.7%
Analysis of Variance
(deleted)
Unusual Observations
Obs Adv Sales Fit SE Fit Residual St Resid
15 6272 376.10 456.24 5.54 -80.14 -3.41R
16 6383 454.60 422.02 12.95 32.58 1.60 X
21 6794 512.00 559.41 4.46 -47.41 -2.00R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Multiple Regression I
Checking Assumptions ٢٨٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residuals From Model With Lagged Sales
2
1
0
-1
-2
-3
302010
SR
ES
2
Index
Now r = -.23 is not significant
Indicator Variables ٢٨٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 7Using Indicator and Interaction Variables
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Indicator Variables ٢٨٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.1 Using and Interpreting Indicator Variables
�Suppose some observations have a particular characteristic or attribute, while others do not.
�We can include this information in the regression model by using dummy or indicator variables.
Multiple Regression I
Indicator Variables ٢٨٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Add the info thru a coding scheme
Use a binary (dummy) variable to “indicate”when the characteristic is present
Di = 1 if observation i has the attribute
Di = 0 if observation i does not have it
Indicator Variables ٢٨٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
An Example
Di = 1 if individual i is employed
Di = 0 if individual i is not employed
We could do it the other way and use the "1" to indicate an unemployed individual.
Indicator Variables ٢٨٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Multiple Categories
� For multiple categories, use multiple indicators.
� For example, to indicate where a firm's stock is listed, we could define 3 indicator variables; one each for the NYSE, AMEX and NASDAQ.
� For computational reasons, we would include only two of these in the regression.
Multiple Regression I
Indicator Variables ٢٨٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.1 Employment Discrimination
If two groups have apparently different salary structures, you first need to account for differences in education, training and experience before any claim of discrimination can be made.
Regression analysis with an indicator variable for the group is a way to investigate this.
Indicator Variables ٢٩٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Treasury Versus Harris
The data set HARRIS7 contains information on the salaries of 93 employees of the Harris Trust and Savings Bank. They were sued by the US Department of Treasury in 1981.
Here we examine how salary depends on education, also accounting for gender.
Indicator Variables ٢٩١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Salary Versus Years of Education
At all levels of education, the male salaries appear higher.
Multiple Regression I
Indicator Variables ٢٩٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression Analysis
The regression equation is
SALARY = 4173 + 80.7 EDUCAT + 692 MALES
Predictor Coef SE Coef T P
Constant 4173.1 339.2 12.30 0.000
EDUCAT 80.70 27.67 2.92 0.004
MALES 691.8 132.2 5.23 0.000
S = 572.4 R-Sq = 36.3% R-Sq(adj) = 34.9%
How do we interpret this equation?
Indicator Variables ٢٩٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
An Intercept AdjusterFor an indicator variable, the bj is not really a slope.
To see this, evaluate the equation for the two groups.
FEMALES (MALES = 0)SALARY = 4173 + 80.7 EDUCAT + 692 MALES
= 4173 + 80.7 EDUCAT + 692 (0)
= 4173 + 80.7 EDUCAT
MALES (MALES = 1)SALARY = 4173 + 80.7 EDUCAT + 692 MALES
= 4173 + 80.7 EDUCAT + 692 (1)
= 4173 + 80.7 EDUCAT + 692
= 4865 + 80.7 EDUCAT
Indicator Variables ٢٩٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Parallel Salary Equations
Multiple Regression I
Indicator Variables ٢٩٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is The Difference Significant?
H0: βMALES = 0
Ha: βMALES ≠ 0
Use t = b/SEb as usual
t = 5.23 is significant
(After accounting for years of education, there is no salary difference)
(After accounting for education, there IS a salary difference)
Indicator Variables ٢٩٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
What if the Coding Was Different?
� If we had an indicator for females and used it, the equation would be:
SALARY = 4865 + 80.7 EDUCAT - 692 FEMALES
� The difference between the groups is the same. For females, the intercept in the equation is 4865 – 692 = 4173
Indicator Variables ٢٩٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Multiple Categories
� Pick one category as the "base category".
� Create one indicator variable for each other category.
� In general, if there are m categories, use m – 1 indicator variables.
Multiple Regression I
Indicator Variables ٢٩٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.3 Meddicorp Sales
Y = Sales in one of 25 territories
X1 = advertising in territory
X2 = bonuses paid in territory
Also Region: 1 = South
2 = West
3 = Midwest
Indicator Variables ٢٩٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
How do you use region?
What happens if you just put it in the model?
Sales = -84 + 1.55 ADV + 1.11 BONUS + 119 Region
R2 = 92.0% and Se = 68.89
SE(Region) = 28.69 so tstat = 4.14 is significant
Indicator Variables ٣٠٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Region as an X
This implies the difference between Region 3 (MW) and Region 2 (W) = b3 = 119
And the difference between Region 2 (W) and Region 1 (S) is also 119
The sales differences may not be equal but this forces them to be estimated that way
Multiple Regression I
Indicator Variables ٣٠١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A more flexible approach
� Use two indicator variables to tell the three regions apart
� Can use any one of the three as the “base” category.
� Here is what it looks like if Midwest is selected as the base.
Indicator Variables ٣٠٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Coding scheme
Region
D1
South
D2
West
SOUTH
WEST
MIDWEST
1
0
0
0
1
0
Indicator Variables ٣٠٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Results
SALES = 435 + 1.37ADV + .975 BONUS
- 258 South – 210 West
R2 = 94.7 and Se = 57.63
Both indicators are significant
Multiple Regression I
Indicator Variables ٣٠٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
This Defines Three Equations
SALES = 435 + 1.37ADV + .975 BONUS
- 258 South – 210 West
S: SALES = 177 + 1.37ADV + .975 BONUS
W: SALES = 225 + 1.37ADV + .975 BONUS
MW: SALES = 435 + 1.37ADV + .975 BONUS
Indicator Variables ٣٠٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is Location Significant?
� Because location is measured by two variables in a group, we need to do a partial F test.
� The full Model has ADV, BONUS, SOUTH and WEST and has R2 = 94.7
� The reduced model has only ADV and BONUS, with R2 = 85.5
Indicator Variables ٣٠٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Output For F-TestFULL MODEL
S = 57.63 R-Sq = 94.7% R-Sq(adj) = 93.6%
Analysis of Variance
Source DF SS MS F P
Regression 4 1182560 295640 89.03 0.000
Residual Error 20 66414 3321
Total 24 1248974
REDUCED MODEL
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 1067797 533899 64.83 0.000
Residual Error 22 181176 8235
Total 24 1248974
Multiple Regression I
Indicator Variables ٣٠٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Partial F Computations
(SSER – SSEF) / (K – L)F =
MSEF
(181176 - 66414)/ (4-2)= = 17.3
3321
Indicator Variables ٣٠٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.2 Interaction Variables
� Another type of variable used in regression models is an interaction variable.
� This is usually formulated as the product of two variables; for example, x3 = x1x2
� With this variable in the model, it means the level of x2 changes how x1affects Y
Indicator Variables ٣٠٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interaction Model
With two x variables the model is:
If we factor out x1 we get:
so each value of x2 yields a different slope in the relationship between y and x1
exxxxy ++++= 21322110 ββββ
exxxy ++++= 2212310 )( ββββ
Multiple Regression I
Indicator Variables ٣١٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interaction Involving an Indicator
If one of the two variables is binary, the interaction produces a model with two different slopes.
When x2 = 0
When x2 = 1
exy ++= 110 ββ
exy ++++= 13120 )()( ββββ
Indicator Variables ٣١١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.4 Discrimination (again)
� In the Harris Bank case, suppose we suspected that the salary difference by gender changed with different levels of education.
� To investigate this, we created a new variable MSLOPE = EDUCAT*MALES and added it to the model.
Indicator Variables ٣١٢
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression Output
The regression equation is
SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE
Predictor Coef SE Coef T P
Constant 4395.3 389.2 11.29 0.000
EDUCAT 62.13 31.94 1.95 0.055
MALES -274.9 845.7 -0.32 0.746
MSLOPE 73.59 63.59 1.16 0.250
S = 571.4 R-Sq = 37.3% R-Sq(adj) = 35.2%
How do we interpret the equation this time?
Multiple Regression I
Indicator Variables ٣١٣
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A Slope AdjusterTo see the interaction effect, once again evaluate
the equation for the two groups.
FEMALES (MALES = 0)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE
= 4395 + 62.1 EDUCAT - 275 (0) + 73.6 (EDUCAT*0)= 4395 + 62.1 EDUCAT
MALES (MALES = 1)SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE
= 4395 + 62.1 EDUCAT - 275 (1) + 73.6 (EDUCAT*1)= 4395 + 62.1 EDUCAT – 275 + 73.6 EDUCAT= 4120 + 135.7 EDUCAT
Indicator Variables ٣١٤
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Lines With Two Different Slopes
A bigger gap occurs at higher education levels
Indicator Variables ٣١٥
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests in This Model� Although the slope adjuster implies
the salary gap increases with education, this effect is not really significant (tMSLOPE = 1.16).
� The overall affect of gender is now contained in two variables, so a partial F test would be needed to test for differences between male and female salaries.
Multiple Regression I
Indicator Variables ٣١٦
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.3 Seasonal Effects in Time Series Regression
� Data collected over time (say quarterly)
� If we think the Y variable depends on the calendar can do a kind of “seasonal adjustment” by adding quarter dummies
� Q1 = 1 if this was first quarter, Q2 = 1 if a second quarter, Q3 = 1 if third
� Don’t use Q4 since that is the “base”
Indicator Variables ٣١٧
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.5 ABX Company Sales
� We fit a trend to these sales in Example 3.11 by regressing sales on a time index variable.
� Because this company sells winter sports merchandise, including seasonal effects should markedly improve the fit.
Indicator Variables ٣١٨
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
403020100
300
250
200
TIME
SA
LE
S
ABX Company Sales
4th qtr
Multiple Regression I
Indicator Variables ٣١٩
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Two RegressionsThe regression equation is
SALES = 199 + 2.56 TIME
Predictor Coef SE Coef T P
Constant 199.017 5.128 38.81 0.000
TIME 2.5559 0.2180 11.73 0.000
S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%
The regression equation is
SALES = 211 + 2.57 TIME + 3.75 Q1 - 26.1 Q2 - 25.8 Q3
Predictor Coef SE Coef T P
Constant 210.846 3.148 66.98 0.000
TIME 2.56610 0.09895 25.93 0.000
Q1 3.748 3.229 1.16 0.254
Q2 -26.118 3.222 -8.11 0.000
Q3 -25.784 3.217 -8.01 0.000
S = 7.190 R-Sq = 95.9% R-Sq(adj) = 95.5%
Indicator Variables ٣٢٠
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Are the Seasonal Effects Significant?
� The strong t-ratios for Q2 and Q3 say "yes" and the model R2 increased by 17.6% when we added the seasonal indicators.
� With evidence this strong we probably don't need to test further.
� In general, however, we would need another partial F test to see if the overall seasonal effect is significant.
Indicator Variables ٣٢١
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Partial F Computations
(SSER – SSEF) / (K – L)F =
MSEF
(9622 - 1810)/ (4-1)= = 17.3
(1810/35= 52)
F(0.05,3,35)=2.92
Multiple Regression I
Variable Selection ٣٢٢
Chapter 8Variable Selection
Terry DielmanApplied Regression Analysis:
A Second Course in Business and Economic Statistics, fourth edition
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٣
8.1 Introduction
� Previously we discussed some tests (t-test and partial F) that helped us determine whether certain variables should be in the regression.
� Here we will look at several variable selection strategies that expand on this idea.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٤
Why is This Important?
� If an important variable is omitted, the estimated regression coefficients can become biased (systematically too high or low).
� Their standard errors can become inflated, leading to imprecise intervals and poor power in hypothesis tests.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٥
Strategies
� All possible regressions: computer procedures that briefly examine every possible combination of Xs and report summaries of fit ability.
� Selection algorithms: rules for deciding when to drop or add variables
1. Backwards Elimination
2. Forward Selection
3. Stepwise Regression
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٦
Words of Caution
� None guarantee you get the right model because they do not check assumptions or search for omitted factors like curvature.
� None have the ability to use a researcher's knowledge about the business or economic situation being analyzed.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٧
8.2 All Possible Regressions
� If there are k x variables to consider using, there are 2k possible subsets. For example, with only k=5, there are 32 regression equations.
� Obtaining these sounds like a ton of work but programs like SAS or Minitab have algorithms that can measure fit ability without really producing the equation.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٨
Typical Output
� The program will usually give you a summary table.
� Each line on the table will tell you which variables were in the model, plus measures of fit ability.
� These measures include R2, adjusted R2, Se and a new one, Cp
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٢٩
The Cp Statistic
p = k + 1 is the number of terms in the model, including the intercept.
SSEp is the SSE of this model
MSEF is the MSE in the "full model" (with all the variables)
SSEpCp = - (n – 2p)
MSEF
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٠
Using The Cp Statistic
Theory says that in a model with bias, Cp will be large.
It also says that in a model with no bias, Cp should be equal to p.
It is thus recommended that we consider models with a small Cp and those with Cp near p = k + 1.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣١
Example 8.1 Meddicorp Revisited
n = 25 sales territoriesy = Sales (in $1000s) in each territoryx1 = Advertising ($100s) in the territoryx2 = Bonuses paid (in $100s) in the territoryx3 = Market share in the territoryx4 = largest competitor's sales ($1000s)x5 = Region code (1 = S, 2 = W, 3 = MW)
We are not using region here because it should be converted to indicator variables which should be examined as a group.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٢
Summary Results For All Possible Regressions
Variables in the Regression R2 R2
adj Cp Se
ADV 81.1 80.2 5.90 101.42
BONUS 32.3 29.3 75.19 191.76
COMPET 14.2 10.5 100.85 215.83
MKTSHR 0.1 0.0 120.97 232.97
ADV, BONUS 85.5 84.2 1.61 90.75
ADV, MKTSHR 81.2 79.5 7.66 103.23
ADV, COMPET 81.2 79.5 7.74 103.38
BONUS, COMPET 38.7 33.2 68.03 186.51
BONUS, MKTSHR 32.8 26.7 76.46 195.33
COMPET, MKTSHR 16.1 8.5 100.18 218.20
ADV, BONUS, MKTSHR 85.8 83.8 3.10 91.75
ADV, BONUS, COMPET 85.7 83.6 3.30 92.26
ADV, MKTSHR, COMPET 81.3 78.6 9.60 105.52
BONUS, MKTSHR, COMPET 40.9 32.5 66.90 187.48
ADV, BONUS, MKTSHR, COMPET 85.9 83.1 5.00 93.77
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٣
The Best Model?
� The two variable model with ADV and BONUS has the smallest Cp and highest adjusted R2.
� The three variable models adding either MKTSHR or COMPET also have small Cp values but only modest increases in R2.
� The two-variable model is probably the best.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٤
M C
B K O
O T M
A N S P
D U H E
Vars R-Sq R-Sq(adj) C-p S V S R T
1 81.1 80.2 5.9 101.42 X
1 32.3 29.3 75.2 191.76 X
2 85.5 84.2 1.6 90.749 X X
2 81.2 79.5 7.7 103.23 X X
3 85.8 83.8 3.1 91.751 X X X
3 85.7 83.6 3.3 92.255 X X X
4 85.9 83.1 5.0 93.770 X X X X
Minitab Results
By default, the Best Subsets procedure prints two models for each number of X variables. This can be increased up to 5.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٥
Limitations
� With a large number of potential xvariables, the all possible approach becomes unwieldy.
� Minitab can use up to 31 predictors, but warns that computational time can be long when as few as 15 are used.
� "Obviously good" predictors can be forced into the model, thus reducing search time, but this is not always what you want.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٦
8.3 Other Variable Selection Techniques
� With a large number of potential xvariables, it may be best to use one of the iterative selection methods.
� These will look at only the set of models that their rules will lead them too, so they may not yield a model as good as that returned by the all possible regressions approach.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٧
8.3.1 Backwards Elimination1. Start with all variables in the equation.
2. Examine the variables in the model for significance and identify the least significant one.
3. Remove this variable if it does not meet some minimum significance level.
4. Run a new regression and repeat until all remaining variables are significant.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٨
No Search Routine Needed?
� Although most software packages have automatic procedures for backwards elimination, it is fairly easy to do interactively.
� Run a model, check its t-tests for significance, and identify the variable to drop.
� Run again with one less variable and repeat the steps.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٣٩
Step 1 – All Variables
Regression Analysis: SALES versus ADV, BONUS, MKTSHR, COMPET
The regression equation is
SALES = - 594 + 2.51 ADV + 1.91 BONUS + 2.65 MKTSHR - 0.121 COMPET
Predictor Coef SE Coef T P
Constant -593.5 259.2 -2.29 0.033
ADV 2.5131 0.3143 8.00 0.000
BONUS 1.9059 0.7424 2.57 0.018
MKTSHR 2.651 4.636 0.57 0.574
COMPET -0.1207 0.3718 -0.32 0.749
S = 93.77 R-Sq = 85.9% R-Sq(adj) = 83.1%
Least Significant
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٠
Step 2 – COMPET Eliminated
Regression Analysis: SALES versus ADV, BONUS, MKTSHR
The regression equation is
SALES = - 621 + 2.47 ADV + 1.90 BONUS + 3.12 MKTSHR
Predictor Coef SE Coef T P
Constant -620.6 240.1 -2.58 0.017
ADV 2.4698 0.2784 8.87 0.000
BONUS 1.9003 0.7262 2.62 0.016
MKTSHR 3.116 4.314 0.72 0.478
S = 91.75 R-Sq = 85.8% R-Sq(adj) = 83.8%
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤١
Step 3 – MKTSHR Eliminated
Regression Analysis: SALES versus ADV, BONUS
The regression equation is
SALES = - 516 + 2.47 ADV + 1.86 BONUS
Predictor Coef SE Coef T P
Constant -516.4 189.9 -2.72 0.013
ADV 2.4732 0.2753 8.98 0.000
BONUS 1.8562 0.7157 2.59 0.017
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٢
8.3.2 Forward Selection
� At each stage, it looks at the xvariables not in the current equation and tests to see if they will be significant if they are added.
� In the first stage, the x with the highest correlation with y is added.
� At later stages it is much harder to see how the next x is selected.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٣
MinitabOutput
forForwardSelection
An option in the
Stepwiseprocedure
Forward selection. Alpha-to-Enter: 0.25
Response is SALES on 4 predictors, with N = 25
Step 1 2
Constant -157.3 -516.4
ADV 2.77 2.47
T-Value 9.92 8.98
P-Value 0.000 0.000
BONUS 1.86
T-Value 2.59
P-Value 0.017
S 101 90.7
R-Sq 81.06 85.49
R-Sq(adj) 80.24 84.18
C-p 5.9 1.6
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٤
Same Model as Backwards
� This data set is not too complex, so both procedures returned the same model.
� With larger data sets, particularly when the x variables are correlated among themselves, results can be different.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٥
8.3.3 Stepwise Regression
� A limitation with the backwards procedure is that a variable that gets eliminated is never considered again.
� With forward selection, variables entering stay in, even if they lose significance.
� Stepwise regression corrects these flaws. A variable entering can later leave. A variable eliminated can later go back in.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٦
MinitabOutput
forStepwise
Regression
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is SALES on 4 predictors, with N = 25
Step 1 2
Constant -157.3 -516.4
ADV 2.77 2.47
T-Value 9.92 8.98
P-Value 0.000 0.000
BONUS 1.86
T-Value 2.59
P-Value 0.017
S 101 90.7
R-Sq 81.06 85.49
R-Sq(adj) 80.24 84.18
C-p 5.9 1.6
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٧
Selection Parameters� For backwards elimination, the user specifies
"Alpha to Remove", which is the maximum p-value a variable can have and stay in the equation.
� For forward selection, the user specifies "Alpha to Enter", which is the minimum p-vale a variable needs to enter the equation.
� Stepwise regression gets both.
� Often we use values like .15 or .20 because this encourages the procedures to look at models with more variables.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٨
8.4 Which Procedure is Best?
� Unless there are too many xvariables, the all possible models approach is favored because it looks at all combinations of variables.
� Of the other strategies, stepwise regression is probably best.
� If no search programs are available, backwards elimination can still provide a useful sifting of the data.
Multiple Regression I
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection
٣٤٩
No Guarantees
� Because they do not check assumptions or examine the model residuals, there is no guarantee of returning the right model.
� Nonetheless, these can be effective tools filtering the data and identifying which variables to pay more attention to.