Multiple regression models - Applied...
Transcript of Multiple regression models - Applied...
Multiple regression models
2
Multiple Regression AnalysisDefinitionThe multiple regression model equation is
Y = β0 + β1x1 + β2x2 + ... + βkxk + ∈
where E(∈) = 0 and V(∈) = σ 2.
In addition, for purposes of testing hypotheses andcalculating CIs or PIs, it is assumed that ∈ is normallydistributed.
This is not a regression line any longer, but a regressionsurface.
3
Multiple Regression AnalysisLet be particular values of x1,...,xk.Then
Thus just as β0 + β1x describes the mean Y value as afunction of x in simple linear regression, the true(or population) regression function β0 + β1x1 + . . . + βkxkgives the expected value of Y as a function of x1,..., xk.
The βi’s are the true (or population) regressioncoefficients.
4
Multiple Regression AnalysisThe regression coefficient β1 is interpreted as theexpected change in Y associated with a 1-unitincrease in x1 while x2,..., xk are held fixed.
Analogous interpretations hold for β2,..., βk.
These coefficients are called partial or adjustedregression coefficients.
5
Estimating Parameters
6
Estimating ParametersThe data in simple linear regression consists of n pairs(x1, y1), . . . , (xn, yn). Suppose that a multiple regressionmodel contains two predictor variables, x1 and x2.
Then the data set will consist of n triples (x11, x21, y1), (x12,x22, y2), . . . , (x1n, x2n, yn). Here the first subscript on xrefers to the predictor and the second to the observationnumber.
More generally, with k predictors, the data consists ofn (k + 1) tuples (x11, x21,..., xk1, y1), (x12, x22,..., xk2, y2), . . . ,(x1n, x2n ,. . . , xkn, yn), where xij is the value of the ithpredictor xi associated with the observed value yj.
7
Estimating ParametersThe observations are assumed to have been obtainedindependently of one another according to the model
To estimate the parameters β0, β1,..., βk using the principleof least squares, form the sum of squared deviations:
f (β0, β1,..., βk) = [yj – (β0 + β1 x1j + β2 x2j + . . . + βk xkj )]2
The least squares estimates are those values of the βi’sthat minimize f(β0,..., βk).
8
Estimating Parameters
Taking the partial derivative of f with respect to each βi (i = 0,1, . . . , k) and equating all partials to zero yieldsthe following system of normal equations:
β0n + β1Σx1j + β2Σx2j +. . . + βkΣxkj = Σyj
9
Estimating ParametersThese equations are linear in the unknowns b0, b1,..., bk.Solving yields the least squares estimates .
This is best done by utilizing a statistical software package.
10
Example – shear strength
En experiment was carried out to assess theimpact of the variables
• x1 = force (gm),• x2 = power (mW),• x3 = tempertaure (°C), and• x4 = time (msec)
y = bond shear strength (gm).
11
Example -- shear strengthA computer package gave the following least squaresestimates:
–37.48 .2117 .4983 .1297
.2583
.1297 gm is the estimated average change in strengthassociated with a 1-degree increase in temperature whenthe other three predictors are held fixed;(the other estimated coefficients are interpreted in a similarmanner.)
cont’d
12
Example -- shear strengthHow do you write the estimated regression equation?
A point prediction of strength resulting from a force of 35gm, power of 75 mW, temperature of 200° degrees, andtime of 20 msec is…?
cont’d
13
Fitted values
The predicted value results from substituting the valuesof the various predictors from the first observation into theestimated regression function:
The remaining predicted values come fromsubstituting values of the predictors from the 2nd, 3rd,...,and finally nth observations into the estimated function.
14
For example, the values of the 4 predictors for the lastobservation in the previous example are x1,30 = 35, x2,30 =75, x3,30 = 200, x4,30 = 20, so
= –37.48 + .2117(35) + .4983(75) + .1297(200) +.2583(20) = 38.41
The residuals are the differencesbetween the observed and predicted values.
Fitted values
15
Models with Interaction and Quadratic Predictors
Example: an investigator has obtained observations on y, x1, andx2. One possible model is:
Y = β0 + β1x1 + β2x2 + ∈.
However, other models can be constructed by forming predictorsthat are mathematical functions of x1 and/or x2.
For example, with and x4 = x1x2, the model
Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ∈
also has the general MR form.
16
Models with Interaction and Quadratic Predictors
Sometimes, mathematical functions of other predictors mayresult in a more successful model in explaining variation iny. This is still “linear regression”, even though therelationship between outcome and predictors may not be.
For example, the model
Y = β0 + β1x + β2x2 + ∈ is still MLR with k = 2, x1 = x, and x2= x2.
17
Example -- travel timeThe article “Estimating Urban Travel Times:A Comparative Study” (Trans. Res., 1980: 173–175)described a study relating the dependent variabley = travel time between locations in a certain city and theindependent variable x2 = distance between locations.
Two types of vehicles, passenger cars and trucks, wereused in the study.
Let
18
Example – travel timeOne possible multiple regression model is
Y = β0 + β1x1 + β2x2 + ∈
The mean value of travel time depends on whether avehicle is a car or a truck:
mean time = β0 + β2x2 when x1 = 0 (cars)
mean time = β0 + β1 + β2x2 when x1 = 1 (trucks)
cont’d
19
Example -- travel timeThe coefficient β1 is the difference in mean times betweentrucks and cars with distance held fixed; if β1 > 0, onaverage it will take trucks the same amount of time longerto traverse any particular distance than it will for cars.
A second possibility is a model with an interactionpredictor:
Y = β0 + β1x1 + β2x2 + β3x1x2 + ∈
Now the mean times for the two types of vehicles are…?Does it make sense to have different intercepts for carsand truck?
cont’d
20
Example – travel timeFor each model, the graph of the mean time versusdistance is a straight line for either type of vehicle, asillustrated in the figure.
Regression functions for models with one dummy variable (x1) and one quantitative variable x2
(a) no interaction (b) interaction
cont’d
21
Models with Predictors for Categorical Variables
You might think that the way to handle a three-categorysituation is to define a single numerical variable with codedvalues such as 0, 1, and 2 corresponding to the threecategories.
This is generally incorrect, because it imposes an orderingon the categories that may not exist in reality. It’s ok to dothis for education categories (HS=1,BS=2,Grad=3), but notfor ethnicity, for example.
The correct approach to incorporating three unorderedcategories is to define two different indicator variables.
22
Models with Predictors for Categorical Variables
Suppose, for example, that y is the lifetime of a certain tool,and that there are 3 brands of tool being investigated.
Let:x1 = 1 if tool A is used, and 0 otherwise,x2 = 1 if tool B is used, and 0 otherwise,x3 = 1 if tool C is used, and 0 otherwise.
Then, if an observation is on a:brand A tool: we have x1 = 1 and x2 = 0 and x3 = 0,brand B tool: we have x1 = 0 and x2 = 1 and x3 = 0,brand C tool: we have x1 = 0 and x2 = 0 and x3 = 1.
23
Using regression to do mean testing
If we wish to model average lifetimes of the tool as afunction of brand (A, B and C), the model to use is:
Y = β0 + β2x2 + β3x3 + ∈We only need 2 out of 3 indicator variables - WHY?
Lifetimes for brands A, B, and C are written as?
Question: How do you test if the average lifetime is thesame for brand A and brand B?
24
R2 and σ 2^
25
The closer the residuals are to 0, the better the job ourestimated regression function is doing in makingpredictions corresponding to observations in the sample.
Error or residual sum of squares is SSE = Σ(yi – )2.It is again interpreted as a measure of how much variationin the observed y values is not explained by (not attributedto) the model relationship.
The number of df associated with SSE is n – (k + 1)because k + 1 df are lost in estimating the k + 1β coefficients.
R2 and σ 2^
26
Total sum of squares, a measure of total variation in theobserved y values, is SST = Σ(yi – y)2.
Regression sum of squares SSR = Σ( – y)2 = SST – SSEis a measure of explained variation.
Then the coefficient of multiple determination R2 is
R2 = 1 – SSE/SST = SSR/SST
The positive square root of R2 is called the multiplecorrelation coefficient and is denoted by R.
R2 and σ 2^
27
R2 and σ 2
Because there is no preliminary picture of multipleregression data analogous to a scatter plot for bivariatedata, the coefficient of multiple determination is ourfirst indication of whether the chosen model is successful inexplaining y variation.
Unfortunately, there is a problem with R2: Its value can beinflated by adding lots of predictors into the model even ifmost of these predictors are rather frivolous.
There is no such thing as “negative information” from thepredictors – they can either help or do nothing wrtprediction.
^
28
R2 and σ 2
Example: suppose y is the sale price of a house. Thensensible predictors includex1 = the interior size of the house,x2 = the size of the lot on which the house sits,x3 = the number of bedrooms,x4 = the number of bathrooms, andx5 = the house’s age.
Now suppose we add inx6 = the diameter of the doorknob on the coat closet,x7 = the thickness of the cutting board in the kitchen,x8 = the thickness of the patio slab.
^
29
R2 and σ 2
Unless we are very unlucky in our choice of predictors,using n – 1 predictors (one fewer than the sample size) willyield an R2 of almost 1.
The objective in multiple regression is not simply toexplain most of the observed y variation, but to do so usinga model with relatively few predictors that are easilyinterpreted.
It is thus desirable to adjust R2 to take account of the sizeof the model:
^
30
R2 and σ 2
Because the ratio in front of SSE/SST exceeds 1, issmaller than R2. Furthermore, the larger the number ofpredictors k relative to the sample size n, the smaller willbe relative to R2.
Adjusted R2 can even be negative, whereas R2 itself mustbe between 0 and 1. A value of that is substantiallysmaller than R2 itself is a warning that the model maycontain too many predictors.
^
31
R2 and σ 2
It can be shown that R is the sample correlation coefficientcalculated from the ( , yi) pairs.
SSE is also the basis for estimating the remaining modelparameter:
^
32
ExampleInvestigators carried out a study to see how variouscharacteristics of concrete are influenced by x1 = %limestone powder and x2 = water-cement ratio, resulting inthe accompanying data (“Durability of Concrete withAddition of Limestone Powder,” Magazine of ConcreteResearch, 1996: 131–137).
33
ExampleConsider compressive strength as the dependent variable y.
Using computer software, fitting the first order modelresults in:y = 84.82 + .1643x1 – 79.67x2,SSE = 72.52 (df = 6), R2 = .741, = .654
If we include an interaction predictor, we get:
y = 6.22 + 5.779x1 + 51.33x2 – 9.357x1x2
SSE = 29.35 (df = 5) R2 = .895 = .831
cont’d
34
Model selection
35
Three questions:
• Model utility: are all predictorssignificantly related to our y? (Is ourmodel worthless?)
• Does any particular predictor matter?
• Which predictor subset matters the most?(Which among all possible models is the“best”?)
36
SSE
SSE is the left-over variation in the observations, afterfitting the regression model. The better the model fit, thelower SSE will be.
The SSE is the basis for estimating the error variance:
Adjusted R2 is one way of looking at how well the model fitsthe data:
37
A Model Utility TestThe model utility test in simple linear regression involvesthe null hypothesis H0: β1 = 0, according to which there isno useful linear relation between y and the predictor x.
In MLR we test the hypothesis H0: β1 = 0, β2 = 0,..., βk = 0,which says that there is no useful linear relationshipbetween y and any of the k predictors. If at least one ofthese β’s is not 0, the model is deemed useful.
We could test each β separately, but that would take timeand be very conservative (if Bonferroni correction is used).
A better test is a joint test, and is based on a statistic thathas an F distribution when H0 is true.
38
A Model Utility TestNull hypothesis: H0: β1 = β2 = … = βk = 0
Alternative hypothesis: Ha: at least one βi ≠ 0 (i = 1,..., k)
Test statistic value: F
SSR = regression sum of squares = SST – SSE
Rejection region for a level α test: f ≥ Fα,k,n – (k + 1)
39
Example – bond shear strengthReturning to the bond shear strength data, a model withk = 4 predictors was fit, so the relevant hypotheses are
H0: β1 = β2 = β3 = β4 = 0 Ha: at least one of these four β s is not 0
Usually, an output from a statistical package will looksomething like this:
40
Example – bond shear strength cont’d
41
Example – bond shear strengthThe null hypothesis should be rejected at any reasonablesignificance level.
We conclude that there is a useful linear relationshipbetween y and at least one of the four predictors in themodel.
This does not mean that all four predictors are useful;we will focus on that next.
cont’d
42
Inference in Multiple RegressionAll standard statistical software packages compute andshow the standard deviations of the regression coefficients,
Inference concerning a single βi is based on thestandardized variable
which has a t distribution with n – (k + 1) df.
A 100(1 – α)% CI for βi is
43
Inferences in Multiple RegressionSingle predictor hypothesis has the formH0: βi = 0 for a particular i.
For example, after fitting the four-predictor model inexample on bond shear strength, the investigator mightwish to test H0: β4 = 0.
According to H0, as long as the predictors x1, x2, and x3remain in the model, x4 contains no useful informationabout y.
The test statistic value is the t ratio .
44
Inferences in Multiple RegressionWe also see that as long as power, temperature, and timeare retained in the model, the predictor x1 = force can bedeleted.
45
Inference in Multiple RegressionAn F Test for a Group of Predictors. The model utility Ftest was appropriate for testing whether there is usefulinformation about the dependent variable in any of the kpredictors (i.e., whether β1 = ... = βk = 0).
In many situations, one first builds a model containing kpredictors and then wishes to know whether any of thepredictors in a particular subset provide useful informationabout Y.
Note that with k predictors there are 2k possible models.When k is moderate, we can check all possible subsets.However, for k large (such as 100), that is a giganticnumber of possible models.
46
Inference in Multiple RegressionThe relevant hypotheses are as follows:
H0: βl+1 = βl+2 = . . . = βk = 0(so the “reduced” model with only l predictorsY = β0 + β1x1 + . . . + βlxl + ∈ is the correct model)
versus
Ha: at least one among βl+1,..., βk is not 0(so the “full” model Y = β0 + β1x1 + . . . + βkxk + ∈, with kpredictors, is the correct model)
47
Inferences in Multiple RegressionThe test is carried out by fitting both the full and reducedmodels. Because the full model contains not only thepredictors of the reduced model but also some extrapredictors, it should fit the data at least as well as thereduced model.
That is, if we let SSEk be the sum of squared residuals forthe full model and SSEl be the corresponding sum for thereduced model, then SSEk ≤ SSEl.
48
Inferences in Multiple RegressionIntuitively, if SSEk is a great deal smaller than SSEl, the fullmodel provides a much better fit than the reduced model;the appropriate test statistic should then depend on thereduction SSEl – SSEk in unexplained variation.
SSEk = unexplained variation for the full model
SSEl = unexplained variation for the reduced model
Test statistic value:
Rejection region: f ≥ Fα,k–l,n – (k + 1)
49
Example -- fuelThe fuel consumption dataset, collected by C. Binghamfrom the American Almanac for 1974 to study the fuelconsumption across different states. The variables are:
cont’d
----------------------------------------------------------name variable label----------------------------------------------------------state Statetax 1972 motor fuel tax rate, cents/galinc 1972 per capita income, thousands of dollarsroad 1971 federal-aid highways, thousands of milesdlic Percent of population with driver's licensefuel Motor fuel consumption, gal/person
50
Example -- fuelWe want to look at the relationship of fuel consumption andfuel tax (which is measured in cents per gallon).
To assure that we work with comparable quantities, weonly look at per-capita versions of other variables thataffect fuel consumption.
For the initial analysis, we will look at the graphicalrelationship of the following variables which we think mightbe important in explaining the fuel consumption: tax, inc,road, dlic.
cont’d
51
Example -- fuel cont’dMotor fuelconsumption,
gal/person
1972 motorfuel tax
rate,cents/gal
1971federal-aidhighways,
thousands ofmiles
1972 percapita
income,thousands of
dollars
Percent ofpopulation
withdriverslicence
400
600
800
1000
400 600 800 1000
5
10
5 10
0
10
20
0 10 20
3
4
5
3 4 5
40
60
80
40 60 80
52
Example -- fuelFrom the scatterplot we can examine the relationshipbetween per-capita fuel consumption and other variables,as well as the relationships amongst the variablesthemselves.
This inter-predictor relationship is known as collinearity.There is always some of it in the model.
From the scatterplots in the top row (those describingrelationships between fuel and taxes, road length, per-capita income, and the proportion of driving licenses), wecan see that the per-capita fuel consumption seems to belinearly related only to two variables, tax and dlic, while theother variables (road, inc) seem to be rather weak.
cont’d
53
Example -- fuel cont’d
Hence for now we choose to examine only tax and dlic and their effect onfuel:
But before we go for this big model, let us consider what adding the singlenew predictor to a simple linear regression model would do. Specifically,let us consider what would happen if we added tax to the SLR modelrelating fuel to dlic.
The main idea in adding tax variable to the model
is to explain the part of fuel that hasn't already been explained by dlicvariable. The fitting of one predictor after another is the central point inmultiple regression.
54
Example -- fuel cont’d
55
Example -- fuel cont’d
56
Example -- fuel cont’d
We need to combine these two regressions and end up with amodel that contains both tax and dlic.
What can we say about the MLR model based on these twosimple regressions?
The R2 for the combined model will have to be bigger than48.9%, the biggest of the two individual R2’s. The total R2 will beadditive, i.e. 48.9%+ 20.4% =69.3% only if the two variables taxand dlic are uncorrelated, and tell us completely separateinformation about fuel consumption. This almost never happensin the real world.
In order to understand what one variable contributes on top ofthe other already in the model, we need to understand thatoverlap. We do that by running a regression of one on the other.
57
Example -- fuel cont’d
Let’s say dlic is already in the model. What information cantax contribute that is not already captured by dlic?
That would be the residuals from this regression:
58
Example -- fuel cont’d
What the residuals from that regression can explain invariation of the residuals from the regression of fuel on dlicis the contribution of tax to the model:
59
Example -- fuel cont’d
Proceeding in the same way, we obtain the full model:
_cons 247. 5998 3578. 363 0. 07 0. 945 - 6968. 856 7464. 056 r oad 416. 0281 65. 36389 6. 36 0. 000 284. 2092 547. 8469 i nc 1270. 123 332. 1401 3. 82 0. 000 600. 2992 1939. 948 t ax 74. 44481 250. 1444 0. 30 0. 767 - 430. 0194 578. 909 dl i c - 109. 906 37. 08678 - 2. 96 0. 005 - 184. 6986 - 35. 11339 f uel c Coef . St d. Er r . t P>| t | [ 95% Conf . I nt er val ]
Tot al 212091875 47 4512593. 09 Root MSE = 1278. 8 Adj R- squar ed = 0. 6376 Resi dual 70317692. 3 43 1635295. 17 R- squar ed = 0. 6685 Model 141774183 4 35443545. 8 Pr ob > F = 0. 0000 F( 4, 43) = 21. 67 Sour ce SS df MS Number of obs = 48
. r eg f uel c dl i c t ax i nc r oad
60
Example -- fuel cont’d
Tax does not seem significant. Re-running the modelwithout it gives:
_cons 1077. 415 2219. 442 0. 49 0. 630 - 3395. 576 5550. 406 r oad 404. 9094 53. 07603 7. 63 0. 000 297. 9417 511. 8771 i nc 1281. 629 326. 4479 3. 93 0. 000 623. 7166 1939. 542 dl i c - 114. 217 33. 78536 - 3. 38 0. 002 - 182. 307 - 46. 12713 f uel c Coef . St d. Er r . t P>| t | [ 95% Conf . I nt er val ]
Tot al 212091875 47 4512593. 09 Root MSE = 1265. 5 Adj R- squar ed = 0. 6451 Resi dual 70462530. 7 44 1601421. 15 R- squar ed = 0. 6678 Model 141629345 3 47209781. 6 Pr ob > F = 0. 0000 F( 3, 44) = 29. 48 Sour ce SS df MS Number of obs = 48
. r eg f uel c dl i c i nc r oad
61
Standardizing Variables
62
Standardizing VariablesFor multiple regression, especially when values of variablesare large in magnitude, it is advantageous to carry thiscoding one step further.
Let xi and si be the sample average and sample standarddeviation of the xij’s ( j = 1, …, n).
Now code each variable xi by xʹ′i = (xi – xi)/si. The codedvariable xʹ′i simply re-expresses any xi value in units ofstandard deviation above or below the mean.
63
Standardizing VariablesThus if xi = 100 and si = 20, xi = 130 becomes xʹ′i = 1.5,because 130 is 1.5, standard deviations above the mean ofthe values of xi.
For example, the coded full second-order model with twoindependent variables has regression function
64
Standardizing VariablesThe benefits of coding are(1) Increased numerical accuracy in all computations
(2) More accurate estimation because the individualparameters of the coded model characterize the behaviorof the regression function near the center of the data ratherthan near the origin.
65
Example 20The article “The Value and the Limitations of High-SpeedTurbo-Exhausters for the Removal of Tar-Fog fromCarburetted Water-Gas” (Soc., Chemical Industry J. of1946: 166–168) presents data on y = tar content(grains/100 ft3) of a gas stream as a function of x1 = rotorspeed (rpm) and x2 = gas inlet temperature (°F).
66
Example 20The data is also considered in the article “Some Aspects ofNonorthogonal Data Analysis” (J. of Quality Tech. 1973:67–79), which suggests using the coded model describedpreviously.
The means and standard deviations are x1 = 2991.13,s1 = 387.81, x2 = 58.468, and s2 = 6.944, soxʹ′1 = (x1 – 2991.13)/387.81 and xʹ′2 = (x2 – 58.468)/6.944.
With xʹ′3 = (xʹ′1)2, xʹ′4 = (xʹ′2)2, xʹ′5 = xʹ′1 xʹ′2, fitting the fullsecond-order model yielded = 40.2660, = –13.4041, = 10.2553, = 2.3313, = –2.3405, and = 2.5978.
cont’d
67
Example 20The estimated regression equation is then
= 40.27 – 13.40xʹ′1 + 10.26xʹ′2 + 2.33xʹ′3 – 2.34xʹ′4 + 2.60xʹ′5
Thus if x1 = 3200 and x2 = 57.0, xʹ′1 = .539, xʹ′2 = –.211,whatis the fitted value?
cont’d
68
Variable Selection
69
Variable SelectionSuppose an experimenter has obtained data on a responsevariable y as well as on p candidate predictors x1, …, xp.
How can a best (in some sense) model involving a subsetof these predictors be selected?
We know that as predictors are added one by one into amodel, SSE cannot increase (a larger model cannot explainless variation than a smaller one) and will usually decrease,albeit perhaps by a small amount.
70
Variable SelectionThere is no mystery as to which model gives the largest R2
value—it is the one containing all p predictors. What we’dreally like is a model involving relatively few predictors thatis easy to interpret and use yet explains a relatively largeamount of observed y variation.
For any fixed number of predictors (e.g., 5), it is reasonableto identify the best model of that size as the one with thelargest R2 value—equivalently, the smallest value of SSE.
The more difficult issue concerns selection of a criterionthat will allow for comparison of models of different sizes.
71
Variable SelectionUse subscript k to denote a quantity computed from amodel containing k predictors (e.g., SSEk).
Three different criteria, each one a simple function of SSEk,are widely used.
1. , the coefficient of multiple determination for a k-predictor model. Because will virtually always increase as k does (and can never decrease), we are not interested in the k that maximizes .
Instead, we wish to identify a small k for which is nearly as large as R2 for all predictors in the model.
72
Variable Selection2. MSEk = SSEk /(n – k – 1), the mean squared error for a k-predictor model.
This is often used in place of , because although never decreases with increasing k, a small decrease in SSEk obtained with one extra predictor can be more than offset by a decrease of 1 in the denominator of MSEk.
The objective is then to find the model having minimum MSEk. Since adjusted = 1 – MSEk /MST, where MST = SST/(n – 1), is constant in k, examination of adjusted is equivalent to consideration of MSEk.
73
Variable Selection3. The rationale for the third criterion, Ck, is more difficult to understand, but the criterion is widely used by data analysts.
Suppose the true regression model is specified by m predictors—that is,
Y = β0 + β1x1 + … + βmxm + ∈ V(∈) = σ 2
so that
E(Y) = β0 + β1x1 + … + βmxm
74
Variable SelectionConsider fitting a model by using a subset of k of these mpredictors; for simplicity, suppose we use x1, x2, …, xk.
Then by solving the system of normal equations, estimates are obtained (but not, of course, estimates ofany corresponding to predictors not in the fitted model).
The true expected value E(Y) can then be estimated by
75
Variable SelectionNow consider the normalized expected total error ofestimation
The second equality must be taken on faith because itrequires a tricky expected-value argument.
76
Variable SelectionA particular subset is then appealing if its value is small.Unfortunately, though, E(SSEk) and σ
2 are not known.
To remedy this, let s2 denote the estimate of σ 2 based on
the model that includes all predictors for which data isavailable, and define
77
Variable SelectionA desirable model is then specified by a subset of predictorsfor which Ck is small.
The total number of models that can be created frompredictors in the candidate pool is 2k (because each predictorcan be included in or left out of any particular model—one ofthese is the model that contains no predictors).
78
Variable SelectionIf k ≤ 5, then it would not be too tedious to examine allpossible regression models involving these predictors usingany good statistical software package.
But the computational effort required to fit all possiblemodels becomes prohibitive as the size of the candidatepool increases.
Several software packages have incorporated algorithmswhich will sift through models of various sizes in order toidentify the best one or more models of each particularsize.
79
Variable SelectionMinitab, for example, will do this for k ≤ 31 and allows theuser to specify the number of models of each size(1, 2, 3, 4, or 5) that will be identified as having bestcriterion values.
Why would we want to go beyond the best single model ofeach size?
80
Example 21The review article by Ron Hocking listed in the chapterbibliography reports on an analysis of data taken from the1974 issues of Motor Trend magazine.
The dependent variable y was gas mileage, there weren = 32 observations, and the predictors for which data wasobtained were:x1 = engine shape (1 = straight and 0 = V),x2 = number of cylinders,x3 = transmission type (1 = manual and 0 = auto),x4 = number of transmission speeds,x5 = engine size,x6 = horsepower,
81
Example 21x7 = number of carburetor barrels,x8 = final drive ratio,x9 = weight, andx10 = quarter-mile time.
In the table, we present summary information from theanalysis.
Best Subsets for Gas Mileage Data of Example 21Table 13.10
cont’d
82
Example 21The table describes for each k the subset having minimumSSEk; reading down the variables column indicates whichvariable is added in going from k to k + 1 (going from k = 2k = 3, x3 and x10 are added, and x2 is deleted).
The figure contains plots of , adjusted , and Ckagainst k; these plots are an important visual aid inselecting a subset.
cont’d
and Ck plots for the gas mileage data
83
Example 21The estimate of σ2 is s2 = 6.24, which is MSE10. A simplemodel that rates highly according to all criteria is the onecontaining predictors x3, x9, and x10.
cont’d
84
Variable SelectionGenerally speaking, when a subset of k predictors (k < m)is used to fit a model, the estimators will bebiased for and will also be a biasedestimator for the true E(Y) (all this because m – k predictorsare missing from the fitted model).
However, as measured by the total normalized expectederror , estimates based on a subset can provide moreprecision than would be obtained using all possiblepredictors; essentially, this greater precision is obtained atthe price of introducing a bias in the estimators.
85
Variable SelectionA value of k for which Ck ≈ k + 1 indicates that the biasassociated with this k-predictor model would be small.
An alternative to this procedure is forward selection (FS).FS starts with no predictors in the model and considersfitting in turn the model with only x1, only x2, …, and finallyonly xm.
The variable that, when fit, yields the largest absolute t ratioenters the model provided that the ratio exceeds thespecified constant tin.
86
Variable SelectionSuppose x1 enters the model. Then models with (x1, x2), (x1,x3), …, (x1, xm) are considered in turn.
The largest then specifies the enteringpredictor provided that this maximum also exceeds tin.
This continues until at some step no absolute t ratiosexceed tin. The entered predictors then specify the model.
The value tin = 2 is often used for the same reason thattout = 2 is used in BE.
87
Variable SelectionFor the tar content data, FS resulted in the sequence ofmodels given in Steps 5, 4, …, 1 in the table. This will notalways be the case.
Table 13.11
Backward Elimination Results for the Data of Example 20
88
Variable SelectionThe stepwise procedure most widely used is a combinationof FS and BE, denoted by FB. This procedure starts asdoes forward selection, by adding variables to the model,but after each addition it examines those variablespreviously entered to see whether any is a candidate forelimination.
For example, if there are eight predictors underconsideration and the current set consists of x2, x3, x5, andx6 with x5 having just been added, the t ratiosand are examined.
89
Variable SelectionIf the smallest absolute ratio is less than tout, then thecorresponding variable is eliminated from the model (somesoftware packages base decisions on f = t2).
The idea behind FB is that, with forward selection, a singlevariable may be more strongly related to y than to either oftwo or more other variables individually, but thecombination of these variables may make the singlevariable subsequently redundant.
90
Variable SelectionThis actually happened with the gas-mileage data discussedin previously, with x2 entering and subsequently leaving themodel.
Although in most situations these automatic selectionprocedures will identify a good model, there is no guaranteethat the best or even a nearly best model will result.
Close scrutiny should be given to data sets for which thereappear to be strong relationships among some of thepotential predictors; we will say more about this shortly.
91
Identification of InfluentialObservations
92
Identification of Influential Observations
In simple linear regression, it is easy to spot an observationwhose x value is much larger or much smaller than otherx values in the sample.
Such an observation may have a great impact on theestimated regression equation (whether it actually doesdepends on how far the point (x, y) falls from the linedetermined by the other points in the scatter plot).
93
Identification of Influential Observations
In multiple regression, it is also desirable to know whetherthe values of the predictors for a particular observation aresuch that it has the potential for exerting great influence onthe estimated equation.
94
Identification of Influential Observations
One method for identifying potentially influentialobservations relies on the fact that because each is alinear function of each predicted y value ofthe form is also a linear functionof the yj’s. In particular, the predicted values correspondingto sample observations can be written as follows:
95
Identification of Influential Observations
Each coefficient hij is a function only of the xij’s in thesample and not of the yj’s. It can be shown that hij = hji andthat 0 ≤ hjj ≤ 1.
Focus on the “diagonal” coefficients h11, h22, …, hnn. Thecoefficient hjj is the weight given to yj in computing thecorresponding predicted value
This quantity can also be expressed as a measure of thedistance between the point (x1j, …, xkj) in k-dimensionalspace and the center of the data
96
Identification of Influential Observations
It is therefore natural to characterize an observation whosehjj is relatively large as one that has potentially largeinfluence.
Unless there is a perfect linear relationship among the kpredictors, so the average of the hjj’s is(k + 1)/n.
Some statisticians suggest that if the jthobservation be cited as being potentially influential; othersuse 3(k + 1)/n as the dividing line.
97
Example 24The accompanying data appeared in the article “Testing forthe Inclusion of Variables in Linear Regression by aRandomization Technique” (Technometrics, 1966:695–699) and was reanalyzed in Hoaglin and Welsch, “TheHat Matrix in Regression and ANOVA” (Amer. Statistician,1978: 17–23).
98
Example 24The hij’s (with elements below the diagonal omitted bysymmetry) follow the data.
Here k = 2, so (k + 1)/n = 3/10 = .3; since h44 = .604 > 2(.3),the fourth data point is identified as potentially influential.
cont’d
99
Identification of Influential Observations
Another technique for assessing the influence of the jthobservation that takes into account yj as well as thepredictor values involves deleting the jth observation fromthe data set and performing a regression based on theremaining observations.
If the estimated coefficients from the “deleted observation”regression differ greatly from the estimates based on thefull data, the jth observation has clearly had a substantialimpact on the fit.
100
Identification of Influential Observations
One way to judge whether estimated coefficients changegreatly is to express each change relative to the estimatedstandard deviation of the coefficient:
There exist efficient computational formulas that allow allthis information to be obtained from the “no-deletion”regression, so that the additional n regressions areunnecessary.