Subset Selection in Regression Analysis

Subset Selection in RegressionAnalysis

M.Sc. Project Report (Final Stage)

submitted in partial fulfilment of the requirementsfor degree of

Master of Science

byAriful Islam Mondal

(02528019)

under the guidance of

Prof. Alladi Subramanyam

Department Of Mathematics

IIT Bombay

aDepartment of Mathematics

INDIAN INSTITUTE OF TECHNOLOGY, BOMBAYApril, 2004

Acknowledgment

I am extremely grateful to my guide, Prof. Alladi Subramanyam, for giving mesuch a nice project on applied statistics. Without his support and suggestions I couldnot have finished my project. Indeed, he devoted a lot of time for me. Moreover, thisproject helped me a lot for having a good job. Lastly, I would like to thank my guidefor everything he did for me and also would like to thank my friends and batchmateswho spent time with me at the time of typing the report.

April, 2004 Ariful Islam Mondal

2

Contents

1 Introduction 11.0.1 Why Subset Selection? . . . . . . . . . . . . . . . . . . . . . . . . 11.0.2 Uses of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.0.3 Organization of report and Future Plan . . . . . . . . . . . . . . . 2

2 Multiple Linear Regression 32.1 Multiple Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Estimation of the Model Parameters by Least Squares . . . . . . . . . . . 4

2.2.1 Least Squares Estimation Method . . . . . . . . . . . . . . . . . . 42.2.2 Geometrical Interpretation of Least Squares . . . . . . . . . . . . 52.2.3 Properties of the Least Squares Estimators . . . . . . . . . . . . . 72.2.4 Estimation of σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Confidence Interval in Multiple Regression . . . . . . . . . . . . . . . . . 82.3.1 Confidence Interval on the Regression Coefficients . . . . . . . . . 82.3.2 Confidence Interval Estimation of the Mean Response . . . . . . . 9

2.4 Gauss−Markov Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.1 Mean and Variance of Estimates Under G-M Conditions . . . . . 102.4.2 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . 10

2.5 Maximum Likelihood Estimator(MLE) . . . . . . . . . . . . . . . . . . . 122.6 Explanatory Power−Goodness of Fit . . . . . . . . . . . . . . . . . . . . 132.7 Testing of Hypothesis In Multiple Linear Regression . . . . . . . . . . . . 14

2.7.1 Test for Significance of Regression . . . . . . . . . . . . . . . . . . 152.7.2 Tests on Individual Regression Coefficients . . . . . . . . . . . . . 152.7.3 Special Cases of Orthogonal Columns in X . . . . . . . . . . . . . 182.7.4 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Regression Diagnosis and Measures of Model Adequacy 223.1 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Definition of Residuals . . . . . . . . . . . . . . . . . . . . . . . . 233.1.2 Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.3 Estimates of σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.4 Coefficient of Multiple Determination . . . . . . . . . . . . . . . . 243.1.5 Methods of Scaling Residuals . . . . . . . . . . . . . . . . . . . . 253.1.6 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

i

4 Subset Selection and Model Building 334.1 Model Building Or Formulation Of The Problem . . . . . . . . . . . . . . 344.2 Consequences Of Variable Deletion . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Properties of β¯p

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Criteria for Evaluating Subset Regression Models . . . . . . . . . . . . . 374.3.1 Coefficient of Multiple Determination . . . . . . . . . . . . . . . . 374.3.2 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.3 Residual Mean Square . . . . . . . . . . . . . . . . . . . . . . . . 394.3.4 Mallows’ Cp-Statistics . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Computational Techniques For Variable Selection . . . . . . . . . . . . . 424.4.1 All Possible Regression . . . . . . . . . . . . . . . . . . . . . . . . 424.4.2 Directed Search on t . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.3 Stepwise Variable Selection . . . . . . . . . . . . . . . . . . . . . 43

5 Dealing with Multicollinearity 455.1 Sources of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Effects Of Multicollinearity . . . . . . . . . . . . . . . . . . . . . 465.2 Multicollinearity Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Estimation of the Correlation Matrix . . . . . . . . . . . . . . . . 485.2.2 Variance Inflation Factors . . . . . . . . . . . . . . . . . . . . . . 495.2.3 Eigensystem Analysis of X′X . . . . . . . . . . . . . . . . . . . . 495.2.4 Other Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Methods for Dealing with Multicollinearity . . . . . . . . . . . . . . . . . 505.3.1 Collecting Additional Data . . . . . . . . . . . . . . . . . . . . . . 505.3.2 Model Respecification . . . . . . . . . . . . . . . . . . . . . . . . 515.3.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3.4 Principal Components Regression . . . . . . . . . . . . . . . . . . 51

6 Ridge Regression 546.1 Ridge Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Methods for Choosing k . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.3 Ridge regression and Variable Selection . . . . . . . . . . . . . . . . . . . 606.4 Ridge Regression: Some Remarks . . . . . . . . . . . . . . . . . . . . . . 60

7 Better Subset Regression Using the Nonnegative Garrote 627.1 Nonnegative Garrote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.2.1 Prediction and Model Error . . . . . . . . . . . . . . . . . . . . . 637.3 Estimation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.3.1 X-Controlled Estimates . . . . . . . . . . . . . . . . . . . . . . . . 647.3.2 X-Random Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 X-Orthonormal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.5 Conclusions And Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 68

ii

8 Subset Auto Regression 698.1 Method of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708.2 Pagano’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.3 Computation of The Increase in The Residual Variance . . . . . . . . . . 728.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A Data Analysis 74A.1 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74A.2 Forward Selection for Fitness Data . . . . . . . . . . . . . . . . . . . . . 76A.3 Backward Elimination Procedure for Fitness Data . . . . . . . . . . . . . 79A.4 Stepwise Selection For Fitness Data . . . . . . . . . . . . . . . . . . . . . 82A.5 Nonnegative Garrote Estimation of Fitness data . . . . . . . . . . . . . . 85A.6 Ridge Estimation For Fitness Data . . . . . . . . . . . . . . . . . . . . . 85A.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

iii

Chapter 1

Introduction

Study of the analysis of data aimed at discovering how one or more variables (calledindependent variables, predictor variables or regressors) affect ther variables (called de-pendent variables or response variables) is known as “regression analysis”. Consider themodel:

y = β0 + β1x + ε (1.1)

This is a simple linear regression model. Where x is a independent variable and y is adependent variable or response variable. In general, the response may be related to pregressors x1, x2, · · · , xp, so that

y = β0 + β1x1 + β2x2 + · · ·+ βpxp (1.2)

This is called a multiple linear regression model.The adjective linear is employed toindicate that the model is linear in the parameters β0, β1, · · · , βp and not because y is alinear function of the x’s. There are so many models in which y is related to the x’s in anon-linear fashion can still be treated as linear regression model as long as the equationis linear in the β’s.

An important objective of regression analysis is to estimate the unknown parametersin the regression model. This process is also called fitting the model to the data. Thenext phase of regression analysis is called model adequacy checking in which the appro-priateness of the model is studied and the quality of the fit ascertained. Through suchanalysis the usefulness of the regression model may be determined. The outcome of theadequacy checking may indicate either that the model is reasonable or that the originalfit must be modified.

1.0.1 Why Subset Selection?

In most practical problems the analyst has a pool of candidate regressors that shouldinclude all the influential factors but the actual subset of regressors that should be usedin the model needs to be determined. Finding an appropriate subset of regressors for themodel is called variables selection problem. Building a regression model that includesonly a subset of the available regressors involves two conflicting objectives. At first, wewould allow the model to include as many regressors as possible so that the “information

1

content” in these factors can influence the predicted value of y. On the other hand, wewant the model to include as few regressors as possible because the variance of theprediction increases as the number of regressors increases. Again, the more regressorsthere in a model, the greater is the cost of data collection and model maintenance.This process of finding a model that compromises between these two objectives is calledselecting the “best” regression equation.

1.0.2 Uses of regression

Regression models are used for several purposes, such as

1. data description,

2. parameter estimation and prediction of events,

3. control.

Engineers and scientists frequently use equations to summarize or describe a set of data.Sometime parameter estimation problems can be solved by regression technique. Forexample, suppose that an Regression analysis is helpful in developing such equations.Many applications of regression involve prediction of the response variable. Regressionmodels may be used for control purposes like chemical engineering works.

1.0.3 Organization of report and Future Plan

This report is divided into 8 chapters.In third chapter, Regression Diagnosis and mea-sures of Model adequacy from the first stage report are included. Chapter 4 relatedto Subset Selection using ordinary subset regression (Forward, Backward and Stepwisemethods). In Chapter 5 Multicollinearity has been discussed, in Chapter 6 Ridge Re-gression, in Chapter 7 idea of a intermediate selection procedure (Nonnegative Garrotein subset selection by Leo Breiman) is given and in Chapter 8 we have tried to use subsetselection criterion in order selection of an Autoregressive process scheme. And lastly, inthe Appendix the data analysis is provided.

∗ ∗ ∗ ∗ ∗

2

Chapter 2

Multiple Linear Regression

A regression model that involves more than one variable is called a multiple regressionmodel. Fitting and analyzing these models is discussed in this chapter. We shall alsodiscuss measure of model adequacy that are useful in multiple regression.

2.1 Multiple Regression Models

Suppose that n > p observations are available and let yi denote the ith observed responseand xij denote the ith observation or level of regressors xj, i = 1, 2, · · · , n and j = 1, 2,· · · , p. Then the general Multiple Linear Regression Model can be written as

y = Xβ + ε (2.1)

Where

y =

y1

y2...

yn

, X =

1 x1,1 x1,2 . . . x1,p

1 x2,1 x2,2 . . . x2,p...

.... . .

...1 xn,1 xn,2 . . . xn,p

,

β =

β0

β1...

βn

and ε =

ε1

ε2...εn

In general, y is an n × 1 vector of the observations, X is an n × (p+1) matrix of the

levels of the regressor variables, β is a (p + 1) × 1 vector of the regression coefficients,and ε is an n × 1 vector of random errors.

3

2.2 Estimation of the Model Parameters by Least

Squares

2.2.1 Least Squares Estimation Method

Model:

y = Xβ + ε

We assume that the error term ε in the model has E(ε) = 0, V(ε) = σ2I, and that ofthe errors are uncorrelated.

We wish to find the vector of Least squares(LS) estimators, β, that minimizes

S(β) =n∑

i=1

ε2i

= ε′ε

= (y −Xβ)′(y −Xβ). (2.2)

The LS estimators β must satisfy

∂S

∂β= −2X′y + 2X′Xβ = 0 (2.3)

which simplifies toX′Xβ = X′y (2.4)

These equations are called the Least Squares Normal Equations.

To find the solution of the normal equations shall consider the following cases:

Case I: (X’X) is non-singular.

If X′X is non-singular, i.e. if no column of the X matrix is a linear combination ofthe other columns, i.e. if the regressors are linearly independent, then (X′X)−1 exists.Hence the solution of the equations ( 2.4) is unique. Thus the least squares estimatorsof β

¯is given by:

β = (X′X)−1X′y (2.5)

provided that (X′X)−1 always exists.

4

Case II:(X′X) is singular.

If (X′X) is singular then we can solve the normal equations X′Xβ¯

= X′y¯

by usingGeneralized inverses(g-inverse)1 as follows:

β¯

= (X′X)−X′y¯

(2.6)

while the estimates are not unique.(X′X)− is called the g-inverse of (X′X), which is notunique.

The predicted value y¯, corresponding to the observed value y

¯is

y¯

= Xβ¯

(2.7)

= X(X′X)−1X′y¯

(2.8)

= Hy¯

(2.9)

The n × n matrix H = X(X′X)−1X′ is usually called ”hat” matrix, because it maps thevector of observed values and its properties play a central role in the regression analysis.Let M = I −H. Then

MX = (I −H)X = X −X(X ′X)−1X ′X = X −X = 0

and residualse¯

= y¯− y

¯(2.10)

where

e¯

=

e1

e2...en

, y¯

=

y1

y2...

yn

= Xβ¯

= X(X ′X)−1X ′y¯

(2.11)

Therefore, we can express the residual e¯

in terms of y¯

or β¯

as follows:

e¯

= y¯−Hy

¯= My

¯= MXβ

¯+ Mε

¯= Mε

¯(2.12)

2.2.2 Geometrical Interpretation of Least Squares

An intuitive geometrical interpretation of least squares is sometimes helpful. We maythink of the vector of observations y

¯′ = [y1, y2, · · · , yn] as defining a vector from the

origin to the point A in Figure 2.1. Note that [y1, y2, · · · , yn] form the coordinates ofan n-dimensional sample space. The sample space in Figure 2.1 is three dimensional.

1If A− is the g-inverse of A then AA−A = A.

5

The X consisits of p+1 (n × 1) vectors, for example, 1¯

(a column vector of 1′s), x¯1,

x¯2, · · · , x

¯p. Each of these columns define a vector from the origin in the sample space.These p + 1 vectors form a p + 1-dimensional subspace called the estimation space.The figure shows a 2-dimensional estimation space. We may represent any point in thissubspace by a linear combination of of the vectors 1

¯, x¯1, x

¯2, · · · , x¯p. Thus any point in

the estimation space is of the form Xβ¯. Let the vector Xβ

¯determine the point B in

Figure 2.1. The squared distance from B to A is just

S(β¯) = (y

¯−Xβ

¯)′(y

¯−Xβ

¯).

Therefore minimizing the squared distance of point A defined by the observation vectory¯

to the estimation requires finding the point in the estimation space that is closest toA. The squared distance will be a minimum when the point in the estimation space isthe foot of the line from A normal (or perpendicular) to the estimation space. This ispoint C in Figure 2.1.

This point is defined by the vector (y¯− Xβ

¯). Therefore since y

¯-y¯

= y¯

- Xβ¯

is

Figure 2.1: A geometrical representation of least squares

perpendicular to the estimation space, we may write

X′(y¯−Xβ

¯) = 0

¯or

X′Xβ¯

= X′y¯

6

which we recognize as the least squares normal equations.

2.2.3 Properties of the Least Squares Estimators

The statistical properties of the least squares estimators β¯

may be easily demonstrated.Considerfirst bias :

E(β¯) = E

[(X′X)−1X′y

¯

]= E

[(X′X)−1X′(Xβ

¯+ ε

¯)]

= E[(X′X)−1X′Xβ

¯+ (X′X)−1X′ε

¯

]= β

¯

since E (ε¯) = 0

¯. Thus β

¯is an unbiased estimator of β

¯.

The covariance matrix of β¯

is given by

Cov(β¯) = σ2(X′X)−1

Therefore if we let C = (X′X)−1, the variance of βj is σ2Cjj and the covariance between

βi and βj is σ2Cij.

The least squares estimator β¯

is the best linear unbiased eatimator of β¯

(the Gauss-Markov Theorem). If we further assume that the errors εi are normally distributeb,then β

¯is also the maximum likelihood estimator (MLE) of β

¯. The maximum likelihood

estimator is the minimum variance unbiased estimator of β¯.

2.2.4 Estimation of σ2

As in simple linear regression, we may develop an estimator of σ2 from the residual sumof squares

SSE =n∑

i=1

(yi − yi)2

=n∑

i=1

e2i

= e¯′e¯

Substituting e¯

= (y¯−Xβ

¯),we have

SSE = (y¯−Xβ

¯)′(y

¯−Xβ

¯)

= y¯

′y¯− 2β

¯X′y

¯+ β

¯

′X′Xβ

¯

Since X′Xβ¯

= X′y¯,this last equation becomes

SSE = y¯

′y¯− β

¯

′X′y

¯(2.13)

7

The residual sum of squares has n-p-1 degrees of freedom associated with it since p+1parameters are estimated in the regression model. The residual mean square is

MSE = SSE/(n-p+1) (2.14)

We can show that the expected value of MSE is σ2, so an unbiased estimator of σ2 isgiven by

σ2 = MSE (2.15)

This estimator of σ2 is model dependent.

2.3 Confidence Interval in Multiple Regression

Confidence intervals on individual regression coefficients and confidence intervals on themean response given specifif levels of the regressors paly the same important role inmultiple regression that they do in simple linear regression.The section develops theone-at-a-time confidence intervals for these cases.

2.3.1 Confidence Interval on the Regression Coefficients

To construct confidence interval estimates for the regression coefficients βj,we must as-sume that the errors εi are normally and independently distributed with mean zero andvariance σ2. Therefore the observations yi are normally and independently distributedwith mean β0 +

∑pj=1 βjxij and variance σ2. Since the least squares estimator β

¯is a lin-

ear combination of the observations, it follows that β¯

is normally distributed with meanvector β

¯and covariance matrix σ2(X′X)−1. This implies that the marginal distribution

of any regression coefficient βj is normal with mean βj and variance σ2Cjj, where Cij isthe j th diagonal element of the (X′X)−1 matrix. Consequently each of the statistics

βj − βj√σ2Cij

, j = 0, 1, ..., p (2.16)

is distributed as t with n - p - 1 degrees of freedom, where σ2 is the estimate of theerror variance obtained from 2.12 .Therefore a 100(1− α)% confidence interval for theregression coefficient βj, j = 0, 1, ..., k, is

βj − tα/2,n−p−1

√σ2Cij ≤ βj ≤ βj + tα/2,n−p−1

√σ2Cij (2.17)

We usually call the quantityse(βj) =

√σ2Cij (2.18)

the standard error of the regression coefficient βj.

8

2.3.2 Confidence Interval Estimation of the Mean Response

We may construct a confidence interval on the mean response at a particular point,suchas x01, x02, · · · , x0p. Define vector x

¯0 as

x¯0 =

1

x01

x02...

x0p

The fitted value at this point is

y¯0

= x¯′0β¯

(2.19)

This is an unbiased estimator of y¯0

, since E(y¯0

) = x¯′0β¯

= y¯0

, and the variance of y¯0

is

V (y¯0

) = σ2x¯′0(X

′X)−1x¯0 (2.20)

Therefore a 100(1−α)% confidence interval on the mean response at the point x01, x02, · · · , x0p

is

y¯0− tα/2,n−p−1

√σ2x

¯′0(X

′X)−1x¯0 ≤ y

¯0≤ y

¯0+ tα/2,n−p−1

√σ2x

¯′0(X

′X)−1x¯0 (2.21)

2.4 Gauss−Markov Conditions

In order for estimates of β¯

to have some desirable statistical properties,we need thefollowing assumptions, called Gauss-Markov conditions,which have been already intro-duced:

E(ε¯) = 0

¯(2.22)

E(ε¯ε¯) = σ2I (2.23)

We shall use these conditions repeatedly in the sequel.

Note that G-M conditions imply that

Ey¯

= Xβ¯

(2.24)

andCov(y

¯) = E[(y

¯−Xβ

¯)(y

¯−Xβ

¯)′] = E(ε

¯ε¯′) = σ2I (2.25)

It also follows that (see 2.12)

E(e¯e¯′) = ME[ε

¯ε¯′]M = σ2M (2.26)

since M = (I − H) is idempotent. Therefore,

V ar(ei) = σ2mii = σ2[1− hii] (2.27)

where mij and hij are the ij-th elements of M and H respectively. Because a varianceis non-negative and covariance matrix is at least positive semi-definite, it follows thathii ≤ 1 and M is at least positive semi-definite.

9

2.4.1 Mean and Variance of Estimates Under G-M Conditions

From equation ( 2.22),E(β

¯) = β

¯(2.28)

Now we know that, if for any parameter θ, its estimate T has the propetry that E(T) =θ, then T is an unbiased estimator of θ.Thus under G-M conditions, β

¯is an unbiased

estimator of β¯. Note that we only use the first G-M condition to prove this. Therefore

violation of condition ( 6.10) will not lead to bias. Further, under the G-M condition( 6.10)

Cov(β¯) = σ2(X′X)−1 (2.29)

2.4.2 The Gauss-Markov Theorem

In most applications of regression we are interested in estimates of some linear functionLβ

¯or l

¯′β¯

of β¯, where l

¯is a vector and L is a matrix. Estimates of these type include

the predicteds yi, the estimate y0 of future observation, y¯

and even β¯

itself. We considerhere l

¯′β¯.

Although there may be several possible estimators, we shall confine ourselves to linearestimators i.e., an estimator which is a linear function of y1, y2, · · · , yn, say c

¯′y¯. We

also require that these linear functions be unbiased estimators of l¯′β¯

and assume thatsuch linear unbiased estimators for l

¯′β¯

exist; l¯′β¯

is then called estimable.In the following theorem we show that among all linear unbiased estimators, the least

squares estimator l¯′β¯

= l¯′(X ′X)−1X ′y

¯, which is also a linear function of y1, y2, · · · , yn

and which is unbiased for l¯′β¯, has the smallest variance. That is, Var(l

¯′β¯) ≤ V ar(c

¯′y¯)

for all c¯

such that E(c¯′y¯) = l

¯′β¯. Such an estimator is called a best linear unbiased

estimator (BLUE).

Theorem 2.4.1 (Gauss-Markov) Let β¯

= l¯′(X ′X)−1X ′y

¯and y

¯= Xβ

¯+ε

¯. Then under

G-M conditions, the estimator l¯′β¯

of the estimable function l¯′β¯

is BLUE.

Before proving G-M theorem, let X1, X2, · · · , Xn be a random sample of size n from thedistribution f(x

¯, θ¯), where θ

¯εRk. Now let

T¯

=

T1

T2...

Tn

=

T1(X1, X2, · · · , Xn)T2(X1, X2, · · · , Xn)

...Tn(X1, X2, · · · , Xn)

If E(T) = θ

¯,then we say that T is unbiased estimator for θ

¯. Again, let T and S be two

unbiased estimators for θ¯. Then we say that T is better than S if

∑S

θ¯−∑T

θ¯

is non-

negative definite, ∀θ¯

ε Ω and ∀S unbised, where∑S

θ¯

is the variance-covariance matrix

of the estimator T.

10

Proof of G-M Theorem:

Model:y¯

= Xβ¯

+ ε¯

(2.30)

with E(ε¯) = 0

¯and V (ε

¯) = σ2I. We shall concentrate only on linear combination of y

¯.

Then

(i)a¯′y¯

is an unbiased estimator of a¯′Xβ

¯and

(ii)X′y¯

is an unbiased estimator of X′Xβ¯.

Again, let M′y¯

is also an unbiased estimator of X′Xβ¯. Therefore

M′Xβ¯

= X′Xβ¯∀ β

¯ε Rp

⇒ M′X = X′X, ∀ β¯

ε Rp

Write M′y¯

can be written as

M′y¯

= M′y¯−X′y

¯+ X′y

¯Let u

¯= M′y

¯−X′y

¯. Then the variance-covariance matrix can be written as:

M∑θ¯

=

u∑θ¯

+X∑θ¯

⇒M∑θ¯

−X∑θ¯

=

u∑θ¯

which is non-negative definite. Hence X′y¯

is the optimal unbiased estimator of X′Xβ¯.

If X′X is non-singular, thenβ¯

= (X ′X)−1Xy¯

Suppose u¯′X ′y

¯and v

¯′X ′y

¯be two BLUEs of λ

¯′β¯, we shall show that BLUE is unique.

Therefore,

E(u¯′X ′y

¯) = λ

¯′β¯

= E(v¯′X ′y

¯)

⇒ u¯′X ′Xβ

¯= λ

¯′β¯

= v¯′X ′Xβ

¯⇒ u¯′X ′X = λ

¯′ = v

¯′X ′X

i.e., E((u¯′ − v

¯′)X ′y

¯) = 0. Now since u

¯′X ′y

¯is BLUE so, V (u

¯′X ′y

¯) ≤ v

¯′X ′y

¯for all

v¯′ 6= u

¯′. Again, since v

¯′X ′y

¯is a BLUE so, V (v

¯′X ′y

¯) ≤ u

¯′X ′y

¯for all u

¯′ 6= v

¯′. Therefore

V (v¯′X ′y

¯) = u

¯′X ′y

¯and hence BLUE is unique.

Alternative Proof of G-M theorem:let c

¯′y¯

be another linear unbiased estimator of (estimable) l¯′β¯. Since c

¯′y¯

is an unbiasedestimator of l

¯′β¯, l¯′β¯

= E(c¯′y¯) = c

¯′Xβ

¯for all β

¯and hence we have

c¯′X = l

¯′ (2.31)

11

Now,

V ar(c¯′y¯) = c

¯′Cov(y

¯c¯) = c

¯′(σ2I)c

¯= σ2c

¯′c¯,

and

V ar(l¯′β¯) = l

¯′Cov(β

¯)l¯

= σ2l¯′(X ′X)−1l

¯= σ2c

¯′X(X ′X)−1X ′c

¯

from ( 2.28) and ( 2.29). Therefore

V ar(c¯′y¯)− V ar(l

¯′β¯) = σ2[c

¯′c¯− c

¯′X(X ′X)−1X ′c

¯]

= σ2c¯′[I −X(X ′X)−1X ′]c

¯≥ 0

since I −X(X ′X)−1X ′ = M is positive semi-definite (see Section 2.4). This proves thetheorem.

A slight generalization of the Gausss-Markov theorem is the following:

Theorem 2.4.2 Under G-M conditions, the estimator Lβ¯

of the estimable function Lβ¯is BLUE in the sense that

Cov(Cy¯

)− Cov(Lβ¯

)

is positive semi-definite, where L is an arbtrary matrix and Cy¯

is another unbiasedlinear estimator of Lβ

¯.

This theorem implies that if we wish to estimate several(possibly) related linear func-tion of the βj’s, we can not do better(in a BLUE sense) than use least squares estimates.

Proof: As in the proof of Gausss-Markov theorem, the unbiasedness of Cy¯

yields Lβ¯= CE(y

¯) = CXβ

¯for all β

¯, whence L = CX, and since Cov(Cy

¯) = σ2CC ′ and

Cov(Lβ¯) = σ2L(X ′X)−1L′ = σ2CX(X ′X)−1X ′C ′,

it follows that

Cov(Cy¯)− Cov(Lβ

¯) = σ2C[I −X(X ′X)−1X ′]C ′,

which is positive semi-definite, since, as shown before that the matrix [I−X(X ′X)−1X ′] =M is positive semi-definite.

2.5 Maximum Likelihood Estimator(MLE)

Model:

y¯

= Xβ¯

+ ε¯

We assume that the Gausss-Markov conditions hold and the yi’s are normally dis-tributed.That is, the error terms εi’s are independent and identically distributed as

12

normal with mean zero and variance σ2 i.e. εi iid N(0,σ2), i.e. ε¯

; N(0¯, σ2I). Then

the probability density function of y1,y2, · · · , yn is given by

(2πσ2)−n/2exp[− 1

2σ2(y¯−Xβ

¯)′(y

¯−Xβ

¯)]. (2.32)

The same probability density function, when considered as a function of β¯

and σ2, giventhe observations y1,y2, · · · , yn, i.e. f(β

¯, σ2|y

¯), is called the Likelihood function and is

denoted by L(β¯, σ2|y

¯). The maximum likelihood estimates of β

¯and σ2 are obtained by

maximizing L(β¯, σ2|y

¯) with respect to β

¯and σ2. Since log[z] is an incresing function of

z, the same maximum likelihood estimates can be found by maximizing the logarithmof L.

Since maximizing ( 2.32) with respect to β¯

is equivalent to minimizing (y¯−Xβ

¯)′(y

¯−

Xβ¯), the maximum likelihood estimate of β

¯is the same as the least squares estimate;

i.e., it is β¯

= (X ′X)−1X ′y¯. The maximum likelihood estimate of σ2, obtained by

equating to zero the derivative of the log of the likelihood function with respect to σ2

after substituting β¯

by β¯, is

1

n(y¯−Xβ

¯)′(y

¯−Xβ

¯) =

1

ne¯′e¯

(2.33)

To obtain the maximum likelihood estimate of β¯

under the constraints Cβ¯

= γ¯we need to minimize ( 2.33 ) subject to Cβ

¯= γ

¯. This is equivalent to minimizing

(y¯−Xβ

¯)′(y

¯−Xβ

¯) subject to Cβ

¯− γ

¯= 0

¯.

2.6 Explanatory Power−Goodness of Fit

In this section we shall discuss about a measure of how well our model explains thedata-some measure of goodness of fit. One way to do this would be to see what a goodwould look like. A good model should make no mistake. Hence

y¯≈ Xβ

¯

Therefore, our estimatederrors or residuals can provide a useful measure of how well ourmodel approximates the data. It would also be useful if we could scale this thing so thevalue associated with the goodness of fit would have some meaning.

To see this, let’s examine our estimated equation and break it up into two parts-the explained portion Xβ

¯and the unexplained portion e

¯. Ideally, we would like the

unexplained portion to be negligible. Hence our estimated model is

y¯

= Xβ¯︸︷︷︸

explained portion

+ e¯︸︷︷︸

unexplained portion

(2.34)

To gauge how well our models fits, we first minimize (using least squares) the square ofour dependent variable y

¯, so to get a sense of variation of y

¯, and it will naturally leads

13

us to getting the variation of the explained and unexplained portions of our regression.So,

y¯

′y¯

= (Xβ¯

+ e¯)′(Xβ

¯+ e

¯)

This gives,

y¯

′y¯

= β¯

′X ′Xβ

¯+ e

¯′Xβ

¯+ β

¯

′X ′e

¯+ e

¯′e¯

(2.35)

But from our assumptions, we know X’e¯

= 0¯

so its transpose e¯′X must also be zero.

Therefore

y¯

′y¯

= β¯

′X ′Xβ

¯+ e

¯′e¯

(2.36)

which has good intuitive meaning.

y¯

′y¯︸︷︷︸

SSR

= β¯

′X ′Xβ

¯︸︷︷︸SSR

+ e¯′e¯︸︷︷︸

SSE

(2.37)

wher SST stands for the total sum of squares which gives us an idea of the total variation;SSR stands for the regression sum of squares which gives us idea of how much of thevariation comes from explained portion; and SSE stands for the error sum of squareswhich gives us an idea of how much of the variation comes from the unexplained portion.If we divide it all by SST , we get

1 =β¯

′X ′Xβ

¯y¯′y¯

+e¯′e¯

y¯′y¯

(2.38)

so that SSR/SST (the percent of variation of y¯

explained by X) and the SSE/SST

(the percent of variation of y¯

that is unexplained) must equal one. Hence, if we define astatistics R2 to compute how well our model fits, i.e., the percent variation of y

¯explained

by X, we get

R2 = 1− e¯′e¯

y¯′y¯

=β¯

′X ′Xβ

¯y¯′y¯

(2.39)

So a high R2, say .95, says that just about all variation in y¯

is being explained by X,whereas a low R2, say .05, says that we can’t explain much of the variation of y

¯.

2.7 Testing of Hypothesis In Multiple Linear Re-

gression

In multiple regression problems certain tests of hypotheses about the model parame-ters are useful in measuring model adequacy. In this section we shall describe severalimportant hypothesis-testing procedures. We shall continue to require the normalityassumption on the errors introduced in previous sections.

14

2.7.1 Test for Significance of Regression

The test for significance of regression is a test to determine if there is a linear relationshipbetween response y and any of the regressor variables x1, x2, · · · , xp. The appropriatehypotheses are

H0 : β0 = β1 = · · · = βp = 0 (2.40)

against

H1 : βj 6= 0 for at least one j

Rejection of H0 : βj = 0 implies that atleast one of the regressors x1, x2, · · · , xp con-tributes significantly to the model.

Test Procedure:The total sum of squares SST is partioned into a sum of squares due to regression anda residual sum of squares:

SST = SSR + SSE (2.41)

and if H0 : βj = 0 is true, then SSR/σ2 ; χ2p+1 where the number of degrees of freedom

(d.f.) for χ2 are equal to the number of regressor variables including constant or inother words d.f. equal to number of parameters to be estimated. Also we can show thatSSE/σ2 ; χ2

n−p−1 and that SSE and SSR are independent. The test procedures forH0 : βj = 0 is to compute

F0 =SSR/(p + 1)

SSE/(n− p− 1)=

MSR

MSE

(2.42)

and reject H0 if F0 > Fα,(p+1),(n−p−1). The procedure is usually summarized in an nalysis-of-variance table such as

Table 2.1-Analysis of Variance for Significance Regressionin Multiple Regression

Source of Variation d.f. Sum of squares Mean squares F0

Regression p + 1 SSR MSR MSR/MSE

Residuals n - p - 1 SSE MSE .Total n SST - -

2.7.2 Tests on Individual Regression Coefficients

The testing of hypotheses on the individual regression coefficients are helpful in deter-mining the value of each of the regressors in the model. For example, the model mightbe more effective with the inclusion of additional regressors or perhaps with the deletionof one or more regressors presently in the model.

Adding a variable to a regression model always causes the sum of squares for regres-sion to increase and the residual sum of squares to decrease. We must decide whetherthe increase in the regression sum of squares is sufficient to warrant using the additional

15

regressor in the model. The addition of a regressor also increase the variance of thefitted value y

¯, so we must be careful to include only regressor that are of real value in

explaining the response. Furthermore adding an unimportant regressor may increasethe residual mean square, which may decrease the usefulness of the model.

The hypotheses for testing the significance of any individual regression coefficient,such as βj, are

H0 : βj = 0

against H1 : βj 6= 0 (2.43)

If H0 : βj = 0 is not rejected, then this indicates that the regressor xj can be deletedfrom the model.The test statistic for this hypothesis is

t0 =βj√σ2Cjj

=βj

se(βj)(2.44)

where Cij is the diagonal element of (X ′X)−1 corresponding to βj. The null hypoth-esis H0 : βj = 0 is rejected if |t0| > tα/2,n−p−1. Note that this is really a partial or

marginal test because regression coefficient βj depends on all the other regressor vari-ables xi(i 6= j) that are in the model. Thus this is a test of the contribution of xj giventhe other regressors in the model.

We may also directly determine the contribution to the regression sum of squares ofa regressor, for example xj, given that the other regressors xi(i 6= j) are include in themodel by using the ”extra-sum-of-squares” method. This procedure can also be used toinvestigate the contribution of a subset of the regressor variables to the model. Considerthe regression model with p regressors

y¯

= Xβ¯

+ ε¯

(2.45)

where y¯

is an n × 1 vector of the observations, X is an n × (p + 1) matrix of the levelsof the regressor variables, β

¯is a (p + 1) × 1 vector of the regression coefficients, and

ε¯

is an n × 1 vector of random errors. We would like to determine if some subset ofr < p regressors contributes significantly to the regression model. Let the regressioncoefficients be partioned as follows:

β¯

=

[β¯1

β¯2

]where β

¯1is (p− r + 1)× 1 and β

¯2is r × 1. We want to test the hypotheses

H0 : β¯2

= 0¯

against H1 : β¯26= 0

¯(2.46)

The model can be written as

y¯

= Xβ¯

+ ε¯

= X1β¯1

+ X2β¯2

+ ε¯

(2.47)

16

where n × (p − r + 1) matrix X1 represents the columns of X associated with β¯1

andthe n× r matrix X2 represents the columns of X associated with β

¯2. This is called full

model.For the full model, we know that β

¯= (X ′X)−1X ′y

¯. The regression sum of squares

for this model is

SSR(β¯) = β

¯

′X ′y

¯(p + 1 d.f.)

and

MSE =y¯′y¯− β

¯

′X ′y

¯n− p− 1

To find the contribution of the terms in β¯2

to the regression, fit the model assumingthat the null hypothesis H0 : β

¯2= 0

¯is true. This reduced model is

y¯

= X1β¯1

+ ε¯

(2.48)

The lest squares estimator of β¯1

in the reduced model is β¯1

= (X ′1X1)

−1X ′1y¯. The

regression sum of squares is

SSR(β¯1

) = β¯

′1X ′

1y¯

(d.f.p− r + 1) (2.49)

The regression sum of squares due to β¯2

given that β¯1

is already in the model is

SSR(β¯2|β¯1

) = SSR(β¯)− SSR(β

¯1) (2.50)

with p + 1 − (p − r + 1) = r degrees of freedom. This sum of squares is called theextra sum of squares due to β

¯2because it measures the increase in the regression sum

of squares that results from adding the regressors xp−r+1, xp−r+2, · · · , xp to a model thatalready contains x1, x2, · · · , xp−r. Now SSR(β

¯2|β¯1

) is independent of MSE, and the nullhypothesis H0 : β

¯2= 0

¯may be tested by the statistic

F0 =SSR(β

¯2|β¯1

)/r

MSE

(2.51)

If F0 > Fα,r,n−p−1, we reject H0 concluding that atleast one of the parameters in β¯2

isnot zero, and consequently at least one of the regressors xp−r+1, xp−r+2, · · · , xp in X2

contributes significantly to the regression model.

Some authors call the above test a partial F-test because it measures the contribu-tion of the regressors in X2 given that the other regressors in X1 are in the model. Toillustrate the usefulness of this procedure, consider the model

y = β0 + β1x1 + β2x2 + β3x3 + ε (2.52)

The sum of squares

SSR(β1|β0, β2, β3)

SSR(β2|β0, β1, β3)

17

and

SSR(β3|β0, β1, β2)

are single-degree-of -freedom sums of squares that measure the contribution of eachregressor xj, j = 1, 2, 3 to the model given that all other regressors were already in themodel. That is, we are assessing the value of adding xj to a mode that did not includethis regressor.In general, we could find

SSR(βj|β0, β1, · · · , βj−1, βj+1, · · · , βp), 1 ≤ j ≤ p

which is the increase in the regression sum of squares due to adding xj to a modelthat already contains x1, · · · , xj−1, xj+1, · · · , xp. Some find it helpful to think of this asmeasuring the contribution of xj as if it were the last variable added to the model.

2.7.3 Special Cases of Orthogonal Columns in X

Consider the model( 2.47)

y¯

= Xβ¯

+ ε¯

= X1β¯1

+ X2β¯2

+ ε¯.

The extra sum of squares method allows us to measure the effect of the regressors in X2

conditional on those in X1 by computing SSR(β¯2|β¯1

). In general, we can not talk aboutfinding the sum of squares due to β

¯2, SSR(β

¯2), without accounting for the dependency of

this quantity on the regressors in X1. However if the columns orthogonal to the columnsin X2 , we can determine a sum of squares due to β

¯2that is free of any dependency on

the regressors in X2.

To demonstrate this, form the normal equation X′Xβ¯

= X′y¯

for the above model.The normal equations are X ′

1X1 | X ′1X2

−−−− | − −−−X ′

2X1 | X ′2X2

β¯1

−−β¯2

=

X′1y¯−−

X′2y¯

Now if the columns of X1 are orthogonal to the columns in X′

1X2 and X′2X1 = 0

¯. Then

the normal equations become

X′1X1β

¯1= X′

1y¯

X′2X2β

¯2= X′

2y¯

with solutions

β¯1

= (X′1X1)

−1X′1y¯β¯2

= (X′2X2)

−1X′2y¯

18

Note that the least squares estimator of β¯1

is β¯1

regardless of whether or not X2 is in

the model, and the least squares estimator of β¯2

is β¯2

regardless of whether or not X1

in the model.The regression sum of squares for the full model is

SSR(β¯) = β

¯

′X′y

¯=[

β¯1

, β¯2

] [ X′1y¯X′

2y¯

]= β

¯

′1X′

1y¯

+ β¯

′2X′

2y¯

= y¯

′X′1(X

′1X1)

−1X′1y¯

+ y¯

′X′2(X

′2X2)

−1X′2y¯

(2.53)

However, the normal equations form two sets, and for each set we note that

SSR(β¯1

) = β¯

′1X′

1y¯

SSR(β¯2

) = β¯

′2X′

2y¯

(2.54)

which implies,SSR(β

¯) = SSR(β

¯1) + SSR(β

¯2) (2.55)

Therefore

SSR(β¯1|β¯2


¯2) = SSR(β

¯1)

and

SSR(β¯2|β¯1


¯2) = SSR(β

¯1)

Consequently, SSR(β¯1

) measures the contribution of the regressors in X1 to the modelunconditionally, and SSR(β

¯2) measures the contribution of the regressors in X2 to the

model unconditionally. Because we can unambiguously determine the effect of eachregressor when regressors are orthogonal, data collection experiments are often designedto have orthogonal variables.

2.7.4 Likelihood Ratio Test

Model:

y¯

= Xβ¯

+ ε¯

(2.56)

Assumptions:

ε¯

;iid N(0¯, σ2I).

The MLE of β¯

and σ2 are,

19

β¯

= (X ′X)−1X ′y¯

and

σ2 =1

n(y¯−Xβ

¯)′(y

¯−Xβ

¯) =

1

ne¯′e¯

Let’s partition β¯

as follows:

β¯

=

[β¯ (1)

β¯ (2)

]

where β¯ (1)

is a vector of order q + 1 and β¯ (2)

is a vector of order p - q. Our aim is to

testH0 : β

¯ (2)= 0

¯against H1 : β

¯ (2)6= 0

¯Now write

X =[X(1) : X(2)

]Then the model becomes

y¯

= X(1)β¯ (1)

+ X(2)β¯ (2)

+ ε¯

(2.57)

and under H0, the model becomes

y¯

= X(1)β¯ (1)

+ ε¯

(2.58)

LetR2

0(X) = (y¯−Xβ

¯)′(y

¯−Xβ

¯)

andR2

0(X(1)) = (y¯−X(1)β

¯ (1))′(y

¯−X(1)β

¯ (1))

whereβ¯ (1)

= (X′(1)X(1))

−1X′(1)y

¯.

Also, D = R20(X(1)) − R2

0(X), be the extra sum of squares.

Result 2.7.1 Let y¯

= Xβ¯

+ε¯, ε¯

; Nn(0¯, σ2In) and β

¯

′ =(β¯ (1)

: β¯ (2)

). Then Likelihood

Ratio (LR) test for testing H0 : β¯ (2)

= 0¯

is to reject H0 if

F =D

(p− q)S2=

R20(X(1))−R2

0(X)

(p− q)S2> Fp−q,n−p−1(α)

where Fm,n(α) is the upper α% points of F distribution with d.f.(m,n).

20

Proof:The likelihood function

L(β¯, σ2) = (2πσ2)−n/2exp[− 1

2σ2(y¯−Xβ

¯)′(y

¯−Xβ

¯)] (2.59)

Therefore,maxβ¯

,σ2

L(β¯, σ2) = L(β

¯, σ2) = (2πσ2)−n/2e−

n2 (2.60)

Now under H0,β¯ (1)

= (X′(1)X(1))

−1X′(1)y

¯and

σ2(1) =

1

n(y¯−X(1)β

¯ (1))′(y

¯−X(1)β

¯ (1))

Therefore,max

H0

L(β¯, σ2) = L(β

¯ (1), σ2

(1)) = (2πσ2(1))

−n/2e−n2 (2.61)

Hence the LR test statistics is given by,

λ =maxH0 L(β

¯, σ2)

maxβ¯

,σ2 L(β¯, σ2)

=

(σ2

(1)

σ2

)−n/2

=

(1 +

σ2(1) − σ2

σ2

)−n/2

Therefore, λ is small when (σ2

(1) − σ2

σ2

)is very large or equivalently when

n

(σ2

(1) − σ2

nσ2

)≥ c1, say

i.e.,R2

0(X(1))−R20(X)

R20(X)

≥ c1

or,

D =(R2

0(X(1))−R20(X))/(p− q)

R20(X)/(n− p− 1)

> c2

where c1, c2 are constants. Therefore, reject H0 if T = D(p−q)S2 > c2.

Note that T ; Fp−q,n−p−1(α). Hence we reject H0 if T > Fp−q,n−p−1(α).

21

Chapter 3

Regression Diagnosis and Measuresof Model Adequacy

Evaluating model adequacy is an important part of a multiple regression problem. Thissection will present several methods for measuring model adequacy.Many of these tech-niques are extensions of those used in simple linear regression.

The major assumptions that we have made so far in our study of regression analysisare as follows:

1. The relationship between y and X is linear, or at least it is well approximated bya straight line.

2. The error term ε has zero mean.

3. The error term εi has constant variance σ2 for all i.

4. The errors are uncorrelated.

5. The errors are normally distributed.

We should always consider the validity of these assumptions to be doubtful andconduct analysis to examine the adequacy of the model we have tentatively entertained.Gross violation of the assumptions may yield an unstable model in the sense that adifferent sample could lead to a totally different model with opposite conclusions. Weusually cannot detect departures from the underlying assumptions by examination ofthe standard summary statistics, such as the t- or F-statistics or R2. These are ”global”model properties, and as such they do not ensure model adequacy.

In this chapter we present several methods useful for diagnosing and treating viola-tions of the basic regression assumptions.

22

3.1 Residual Analysis

3.1.1 Definition of Residuals

We have defined the residuals as

e¯

= y¯− y

¯= (y

¯−Xβ

¯) (3.1)

viewed as the deviation between the data and the fit. This is a measure of the variabilitynot explained by the regression model and departure from the underlying assumptionson the errors should show up by residuals.Analysis of residuals effective method fordiscovering several types of model deficiencies.

Properties

1. E(ei) = 0,∀i.

2. Approximate variance is given by

e¯′e¯

n− p− 1=

∑ni e2

i

n− p− 1= SSE/(n− p− 1) = MSE.

where ei’s are not independent(we shall see later part of this section).

3. Sometimes it is useful to work with the standardized residuals.

di =ei√

MSE

,∀i. (3.2)

where E(di) = 0 and var(di) ≈ 1

The above equation scales the residuals by dividing them by their average standarddeviation. In some regression data sets residuals may have standard deviation that differgreatly.In simple linear regression

V (ei) = V (yi − yi)

= V (yi) + V (yi)− 2Cov(yi, yi)

= σ2 + σ2

[1

n+

(xi − x)

Sxx

]− 2Cov(yi, yi)

Now we can show that

Cov(yi, yi) = Cov

[yi, y +

Sxy

Sxx

(xi − x)

]= σ2

[1

n+

(xi − x)2

Sxx

6= 0

]

23

Therefore ei’s are not independent.

The studentized residuals are defined as

ri =ei√

MSE

[1−

(1n

+ (xi−x)Sxx

)] , i = 1, 2, · · · , n (3.3)

Notice that in this equation(3.3) the ordinary least squares residuals ei has been dividedby its exact standard error, rather than the average value as in standardized residuals(see3.2). Studentized residuals are extremly useful in diagnostics.In small data sets thestudentized residuals are often more appropriate than the standardized residuals becausethe differences in residual variances will be more dramatic.

3.1.2 Estimates

y¯

= Xβ¯

= X(X ′X)−1X ′y¯

= Hy¯

whereβ¯

= (X ′X)−1X ′y¯

andH = X(X ′X)−1X ′

is a symmetric and idempotent matrix.We have seen that E(β¯) = β

¯and Cov(β

¯) =

σ2(X ′X)−1.

3.1.3 Estimates of σ2

Residual sum of squares

SSE = e¯′e¯

= (y¯−Xβ

¯)′(y

¯−Xβ

¯)

= y¯

′y¯, d.f. n− p− 1

and residual mean squares

MSE =1

n− p− 1SSE. (3.4)

Then E(MSE) = σ2, i.e.,σ2 = MSE.Therefore MSE is an unbiased estimator of σ2.

3.1.4 Coefficient of Multiple Determination

The coefficient of multiple determination R2 is defined as

R2 =SSR

SST

= 1− SSE

SST

(3.5)

It is customary to think of R2 as a measure of the reduction in the variability of y¯obtained by using x1, x2, · · · , xp. As in the simple linear regression case, we must have

24

0 ≤ R2 ≤ 1. However, a large value of R2 does not necessarily imply that the regressionmodel is a good one. Adding a regressors to the model will always increase R2 regardlessof whether or not the additional regressor contributes to the model. Thus it is possiblefor models that have a large values of R2 to perform poorly in prediction or estimation.

The positive square root of R2 is the multiple correlation coefficient between y¯

andthe set of regressor variables x1, x2, · · · , xp. That is, R is a measure of the linear asso-ciation between y

¯and x1, x2, · · · , xp. We may also show that R2 is the square of the

correlation between y¯

and the vector of fitted values y¯.

Adjusted R2

Some analysts prefer to use an adjusted R2-statistics because the ordinary R2 definedabove will always increase (at least not decrease) when a new term is added to theregression model.We shall see that in variable selection and model building procedures,it will be helpful to have a procedure that can guard against overfitting the model,that is,adding terms that are unnecessary. The adjusted R2 penalizes the analyst who includesunnecessary variables in the model.

We define the adjusted R2, R2a, by replacing SSE and SST in equation(3.5) by the

corresponding mean squares; that is,

R2a = 1− SSE/(n− p− 1)

SST /n= 1− n

n− p− 1(1−R2) (3.6)

3.1.5 Methods of Scaling Residuals

I. Standardized and Studentized Residuals:

We have already introduced two types of scaled residuals, the standardized residuals

di =ei√

MSE

, i = 1, 2, · · · , n

and the studentized residuals. We now give a general development of the studentizedresidual scaling.Recall,

e¯

= (I−H)y¯

(3.7)

As H is symmetric (H ′ = H) and idempotent (HH = H). Similarly the matrix (I−H)is symmetric and idempotent. Substituting y

¯= Xβ

¯+ ε

¯into above equation yields

e¯

= (I−H)(Xβ¯

+ ε¯)

= Xβ¯−HXβ

¯+ (I−H)ε

¯= (I−H)ε

¯(3.8)

25

Thus the residuals are the same linear transformation of the observations y¯

and theerrors ε

¯.

The covariance matrix of the residuals is

V (e¯) = V [(I−H)ε

¯]

= (I−H)V (ε¯)(I−H)′

= σ2(I−H) (3.9)

since V (ε¯) = σ2I and (I − H) is symmetric and idempotent. The matrix (I − H) is

generally not diagolnal, so the residuals have different variances and they are correlated.The variance of the i-th residual is

V (ei) = σ2(1− hii) (3.10)

where hii is the i-th diagonal element of H. Since 0 ≤ hii ≤ 1, using the residualmean square MSE to estimate the variance of the residuals actually overestimates V (ei).Further more since hii is a measure of the location of the i-th point in x-space, thevariance of ei depends upon where the point x

¯i lies. Generally points near the center ofthe x-space have larger variance(poorer least squares fit) than residuals at more remotelocations.Violation of the model assumptions are more likely at remote points, and theseviolations may be hard to detect from inspection of ei (or di) because their residuals willusually be smaller.

Several authors(Behnken and Draper[1972]),Davies and Hutton[1975],and Huber[1975]suggest talking this inequality of variance into account when scaling the residuals. Theyrecommend plotting the ”studentized” residuals

ri =ei√

MSE(1− hii), i = 1, 2, ..., n (3.11)

instead of ei(or di). The studentized residuals have constant variance V (ri) = 1 re-

gardless of the location of x¯i when the form of the model is correct. In many situations

the variance of the residuals stabilizes, particularly for large data sets. In these casesthere may be little difference between the standardized and studentized residuals. Thusstandardized and studentized residuals often convey equivalent information. However,since any point with a large residual and a large hii potentially highly influential on theleast squares fit, examination of the studentized residuals is generally recommended.

The covariance between ei and ej is

Cov(ei, ej) = −σ2hij (3.12)

so another approach to scaling the residuals is to transform the n dependent residualsinto n− p orthogonal functions of the errors ε

¯.These transformed residuals are normally

and independently distributed with constant variance σ2. Several procedures have beenproposed to investigate departures from the underlying assumptions using transformedresiduals. These procedures are not widely used in practice because it is difficult tomake specific inferences about the transformed residuals, such as the interpretation ofoutliers. Further more dependence between the residuals does not affect interpretationof the usual residual plots unless p is large relative to n.

26

II. Prediction Error Sum of Squares Residuals:

The prediction error sum of squares(PRESS) proposed by Allen[1971b,1974] provides auseful residual scaling. To calculate PRESS, select an observation, for example i. Fit theregression model to the remaining n - 1 observations and use this equation to predict thewithheld observation yi. Denoting this predicted value y(i), we may find the predictionerror for point i as e(i) = yi − y(i). The prediction error is often called the i-th PRESSresidual. This procedure is repeated for each observation i = 1,2,...,n, producing a setof n PRESS residuals e(1), e(2), · · · , e(n). Then the PRESS statistic is defined as the sumof squares of the n PRESS residuals as in

PRESS =n∑

i=1

e2(i) =

n∑i=1

[yi − y(i)

]2(3.13)

Thus PRESS uses each possible subset of n − 1 observations as the estimation as theestimation data set, and every observation in turn is used to form the prediction dataset, and every observation in turn is used to form the prediction data set.

It would initially seem that calculating PRESS requires fitting n different regres-sions.However, it is possible to calculate PRESS from the results of a single least squaresfit to all n observations. It turns out that the i-th PRESS residual is

ei =ei

1− hii

(3.14)

Thus since PRESS is just the sum of the squares of PRESS residuals, a simple computingformula is

PRESS =n∑

i=1

(ei

1− hii

)2

(3.15)

From ( 3.14) it is easy to see that PRESS residual is just the ordinary residual weightedaccording to the diagonal elements of the hat matrix hii . Residuals associated withpoints for which hii is large will have PRESS residuals. These points will generally behigher influence points. Generally, a large difference between the ordinary residual willindicate a point where the model fits the data well, but a model built without that pointpredicts poorly.

Finally note that the variance of the i-th PRESS residual is

V[e(i)

]= V

[ei

1− hii

]=

1

(1− hii)2

[σ2(1− hii)

]=

σ2

1− hii

27

so that a standardized PRESS residual is

e(i)√V[e(i)

] =e(i)/(1− hii)√[σ2(1− hiii)]

=ei√

σ2(1− hii)

which if we use MSE to estimate σ2 is just the studentized residual discussed previously.

III. R-Student:

The studentized residual ri discussed above is often considered an outlier diagnostic.It is customary to use MSE as an estimate of σ2 in computing ri. This is referred toas internal scaling of the residual because MSE is an internally generated estimate ofσ2 obtained from fitting the model to all n observations. Another approach would beto use an estimate of σ2 based on a data set with the i-th observation removed. Denotethe estimate of σ2 so obtained by S2

(i). We can show that

S2(i) =

(n− p)MSE − e2i /(1− hii)

n− p− 1(3.16)

The estimate of σ2 in ( 3.16) is used instead of MSE to produce an externally studentizedresidual, usually called R-student, given by

ti =ei√

S2(i)(1− hii)

, i = 1, 2, · · · , n (3.17)

In many situation ti will differ little from the studentized residual ri. However, if thei-th observation is influential, then S2

(i) can differ significantly from MSE, and thus theR-student statistic will be more sensitive to this point. Furthermore under the standardassumptions ti does follows the tn−p−1-distribution. Thus R-student offers a more formalprocedure for outlier detection via hypothesis testing. Furthermore detection of outliersneeds to be considered simultaneously with the detection of influential observations.

IV.Estimation of Pure Error:

The procedure involved partitioning the error (or residual) sum of squares into sumsquares due to ”pure error” and sum of squares due to ”lack of fit”,

SSE = SSPE + SSLOF

where SSPE is computed using responses at repeated observations at the same level of x¯.

This is a model independent estimate of σ2.The calculation of SSPE requires repeatedobservations on y

¯at the same set of levels on the regressor variables x1, x2, · · · , xp, i.e.,

some of the rows of X matrix be same. However repeated observations do not oftenoccur in multiple regression.

28

3.1.6 Residual Plots

The residuals ei from the multiple linear regression model plays an important role injudging model adequacy just as they do in simple linear regression. Specially we oftenfind it instructive to plot the following:

1. Residuals on Normal Probability paper.

2. Residuals versus each regressor xj, j = 1,2,...,p

3. Residuals versus fitted yi, i = 1, 2, · · · , n.

4. Residuals in time sequence (if known).

The plots are used to detect departures from normality, outliers, inequality of varianceand the wrong functional specification for a regressor.There are several other residualplots useful in multiple regression analysis, some of them are as follows:

1. Plot of residuals against regressors omitted from the model.

2. Partial residual plots(i-th partial residuals for regressor xj is e∗ij = yi − β¯1

xi1 −· · · − β

¯j−1xi,j−1 − β

¯j+1xi,j+1 − · · · − β

¯pxi,p).

3. Partial regression plots:plots of residuals from which the linear dependency of y¯

onall regressors other than xj have been removed against regressor xj with its lineardependency on other regressors removed.

4. Plot of regressor xj against xi(checking multicollinearity):If two or more regressorsare highly correlated, we say that multicollinearity is present in the data. Multi-collinearity can seriously disturb the least squares fit and in some situations renderthe regression model almost useless. We shall discuss some of them here.

Normal Probability Plot

A very simple method for checking the normality assumption is to plot the residualson normal probability paper. This graph paper is so designed so that the cumulativenormal distribution will plot as a straight line.

Let e[1] < e[2] < · · · < e[n] be the residuals ranked in increasing order. Plot e[i]

against the cumulative probabilities, Pi = (i − 12)/n, i = 1, 2, · · · , n. resulting should

be approximately on a straight line. The straight line is usually determined visually,with emphasis on the central values(e.g., the .33 and .67 cumulative probability points)rather than the extremes. Substantially departures from a straight line indicate thatthe distribution is not normal.

Figure 3.1(a) displays an ”idealized” normal probability plot. Notice that the pointslie approximately along a straight line. Figure 3.1(b)-(e) represent other typical prob-lems. Figure 3.1(b) shows a sharp upward and downward curve at both extremes,

29

Figure 3.1: Normal Probability plots

30

indicating that the tails of the distribution are too heavy for it to be considered normal.Conversely Figure 3.1c shows flattening at the extremes, which is a pattern typical ofsamples from a distribution with thinner tails than the normal. Figure 3.1(d)-(e) ex-hibit patterns associated with positive and negative skew, respectively. Andrews [1970]and Gnanadesikan [1977] note that normal probability plots often exhibit no unusualbehavior even if errors εi are not normally distributed. This problems occur because theresiduals are not a simple random sample; they are the remnants of a parameter estima-tion process. The residuals can be shown to be linear combinations of the model errors(the εi). Thus fitting the parameters tends to destroy the evidence of non-normality inthe residuals, and consequently we cannot always rely on the normal probability plot todetect departures from normality.

A common defect that shows up on the normal probability plot is the occurrenceof one or two large residuals. Sometimes this is an indication that the correspondingobservations are outliers.

Residual Plot against yi

Figure 3.2: Patterns for residual plots:(a) satisfactory;(b) funnel;(c) double bow;(d)nonlinear.

31

A plot of the residuals ei (or the scaled residuals di or ri) versus the correspondingfitted values yi is useful for detecting several common types of model inadequacies.1

vspace0.5mm If this plot resembles Figure 3.1.6(a), which indicates that the residualscan be contained in a horizontal band, then there are no obvious model defects. Plots ofei against yi that resemble any of the patterns in Figures 3.1.6(b)-(d) are symptomaticof model deficiencies.

∗ ∗ ∗ ∗ ∗

1The residuals should be plotted against the fitted values yi and not the observed values yi becausethe ei and the yi are uncorrelated while the ei and the yi are usually correlated .

32

Chapter 4

Subset Selection and ModelBuilding

So far we have assumed that the variables that go into the regression equation werechosen in advance. Our analysis involved examining the equation to see whether thefunctional specification was correct and whether the underlying assumptions about theerror term were valid. The analysis presupposed that the regressor variables includedin the model are known to be influential. However, in most practical applications theanalyst has a pool of candidate regressors that should include all the influential factors,but the actual subset of regressors that should be used in the model needs to be deter-mined. Finding an appropriate subset of regressors for the model is called the variableselection problem.

Building a regression model that includes only a subset of the variable regressorsinvolves two conflicting objectives.

1. We would like the model to include as many as regressors as possible so that the”information content” in these factors can influence the predicted value of y.

2. We want the model to include as few regressors as possible because the varianceof the prediction y increases as the number of regressors increases. Also the moreregressors there are in a model, greater the costs of data collection and modelmaintenance.

The process of finding a model that is a compromise between these two objectives iscalled ”best” regression equation. Unfortunately there is no unique definition of best.Furthermore there are several algorithms that can be used for variable selection, andthese procedures frequently specify different different subsets of the candidate regressorsas best.

The variable selection problem is often discussed in an idealized setting. It is usuallyassumed that the correct functional specification of the regressors is known, and thatno outliers or influential observations are present. In practice, these assumptions arerarely met. Investigation of model adequacy is linked to the variable selection problem.Although ideally these problems should be solved simultaneously, an iterative approachis often employed, in which

33

1. a particular variable selection strategy is employed and then

2. the resulting subset model is checked for correct functional specification, outliers,influential observations.

Several iterations may be required to produce an adequate model.

4.1 Model Building Or Formulation Of The Problem

Suppose we have a response variable Y and q predictor variables X1, X2, · · · , Xq. Alinear model that represents Y in terms of q variables is

yi =

q∑j=1

βjxij + εi, (4.1)

where βj are parameters an εi represents random disturbances. Instead of dealing withfull set of variables (particularly when q is large), we might delete a number of variablesand construct an equation with a subset of variables. This chapter is concerned withdetermining which variables are to be retained in the equation. Let us denote the set ofvariables retained by X1, X2, · · · , Xp and those deleted by Xp+1, Xp+2, · · · , Xq. Let usexamine the effect of variable deletion under two general conditions:

1. The model that connects Y to the X’s has all β’s (β0, β1, · · · , βq) nonzero.

2. The model has β0, β1, · · · , βp nonzero, but βp+1, βp+2, · · · , βq zero.

Suppose that instead of fitting (4.1) we fit the subset model

yi = β0 +

p∑j=1

βjxij + εi (4.2)

We will examine the effect of deletion of variables on the estimates of parameters andthe predicted values of Y . The solution to the problem of variable selection becomes alittle clearer once the effects of retaining unessential variables or the deletion of essentialvariables in an equation are known.

4.2 Consequences Of Variable Deletion

To provide motivation for variable selection, we will briefly review the consequences of in-correct model specification. assume that there are K candidate regressors x1, x2, · · · , xK

and n ≥ K + 1 observations on these regressors and the response y. The full model,containing all K regressors, is

yi = β0 +K∑

j=1

βjxij + εi, i = 1, 2, · · · , n (4.3)

34

or equivalently

y¯

= Xβ¯

+ ε¯

(4.4)

We assume that the list of candidate regressors contains all the influential variables andall equations include an intercept term. Let r be the number of regressors that aredeleted from (4.4). Then the number of variables that are retained is p = K + 1 − r.Since the intercept is included, the subset model contains p− 1 = K − r of the originalregressors.

The model (4.4) can be written as

y¯

= Xpβ¯p

+ Xrβ¯ r

+ ε¯

(4.5)

where the X matrix has been partitioned into Xp an n × p matrix whose columnsrepresent the intercept and the p− 1 regressors to be retained in the subset model, andXr, an n× r matrix whose columns represent the regressors to be deleted from the fullmodel. Let β

¯be partitioned conformably into β

¯pand β

¯ r. For the full model the least

squares estimate of β¯

is

β¯

∗= (X′X)−1X′y

¯(4.6)

and an estimate of the residual variance σ2 is

σ2∗ =

y¯′y¯− β

¯

∗′

pX′y

¯n−K − 1

=y¯

′ [I−X(X′X)−1X′]y¯

n−K − 1(4.7)

The components of β¯

∗are denoted by β

¯

∗p

and β¯

∗r, and y∗i are the fitted values. For the

subset model

y¯

= Xpβ¯p

+ ε¯

(4.8)

the least squares estimate of β¯p

is

β¯p

= (X′pXp)

−1X′py¯

(4.9)

the estimate of the residual variance is

σ2 =y¯′y¯− β

¯

′pX′

py¯

n− p=

y¯

′ [I−Xp(X′pXp)−1X′

p

]y¯

n− p(4.10)

and the fitted values are yi.

4.2.1 Properties of β¯p

The properties of the estimates β¯p

and σ2 from the subset model have been investigated

by several authors. The results can be summarized as follows:

35

1. Bias in β¯p

E(β¯p

)= β

¯p+ (X′

pXp)−1X′

pXrβ¯ r

= β¯p

+ Aβ¯ r

where A = (X′pXp)

−1X′pXr. Thus β

¯pis a biased estimate of β

¯punless the regression

coefficients corresponding to the deleted variables (β¯ r

) are zero or the retained variablesare orthogonal to the deleted variables (X′

pXr = 0¯).

2.Variance of β¯p

The variance of β¯p

and β¯

∗are V (β

¯p) = σ2(X′

pXp)−1 and V (β

¯

∗) = σ2(X′X)−1, respec-

tively. Also the matrix V (β¯

∗p)− V (β

¯p) is positive semidefinite; that is, the variances of

the least squares estimates of the parameters in the full model are greater than or equalto the variances of the corresponding parameters in the subset model. Consequentlydeleting variables never increases the variances of the estimates of the remaining param-eters.

3. Precision of the Parameter Estimates

Since β¯p

is a biased estimate of β¯p

and β¯

∗p

is not, it is more reasonable to compare the

precision of the parameter estimates from the full and subset models in terms of meansquare error. The mean square error of β

¯pis

MSE(β¯p

) = σ2(X′pXp)

−1 + Aβ¯ r

β¯

′rA′

If the matrix V (β¯

∗r) − β

¯ rβ¯

′r

is positive semidefinite, the matrix V (β¯

∗r) − MSE(β

¯p) is

positive semidefinite. This means that the least squares estimates of the parametersin the subset model have smaller mean squares error than the corresponding parameterestimates from the full model when the deleted variables have regression coefficients thatare smaller than the standard errors of their estimates in the full model.

4. Precision in Prediction

Suppose we wish to predict the response at the point x′ = [x′p,x′r]. If we use the full

model, the predicted value is y¯

∗ = x′β¯

∗, with mean x′β

¯

∗ and prediction variance

V (y∗) = σ2[1 + x

¯′(X′X)−1x

¯

]However, if the subset model is used, y

¯= x

¯′pβ¯p

, with mean

E(y) = x¯′pβ¯

+ x¯′pAβ

¯ r

36

and the prediction mean square error

MSE(y) = σ2[1 + x

¯′p(X

′pXp)

−1x¯p

]+ (x

¯′pAβ

¯ r− x

¯′rβ¯ r

)2

Note that y is a biased estimate of y unless x¯′pβ¯p

= 0, which is only true in general if

X′pXrx

¯′pβ¯ r

= 0¯. Furthermore the variance of y∗ from the full model is not less than the

variance of y from the subset model. In terms of mean square error

V (y∗) ≥ MSE(y)

provided that the matrix V (β¯

∗r)− β

¯

′rβ¯ r

is positive semidefinite.

4.3 Criteria for Evaluating Subset Regression Mod-

els

Two key aspects of the variable selection problem are generating the subset models anddeciding if one subset is better than another. In this section we discuss criteria forevaluating and comparing subset regression models.

4.3.1 Coefficient of Multiple Determination

A measure of the adequacy of a regression model that has been widely used is thecoefficient of multiple determination, R2. Let R2

p denote the coefficient of multipledetermination for a subset regression model with p terms, that is, p − 1 regressors andan intercept term β0. Computationally

R2p =

SSR(p)

Syy

= 1− SSE(p)

Syy

(4.11)

where SSR(p) and SSE(p) denote the regression sum of squares and the residual sum

of squares, respectively, for a p-term subset model. There are

(K

p− 1

)values of R2

p

for each value of p, one for each possible subset model of size p. Now R2p increases as p

increases and is a maximum when p = K + 1. Therefore the analyst uses this criterionby adding regressors to the model up to the point where an additional variable is notuseful in that it provides only a small increase in R2

p. The general approach is illustratedin Figure 4.1, which represents a hypothetical plot of the maximum value of R2

p foreach subset of size p against p. Typically one examines a display such as this and thenspecifies the number of regressors for the final model as the point at which the ”knee”in the curve becomes apparent.

Since we cannot find an ”optimum” value of R2 for subset regression model, wemust look for a ”satisfactory” value. Aitkin [1974] has proposed one solution to thisproblem by providing a test by which all subset regression models that have an R2 notsignificantly different from the R2 for the full model can be identified. let

R20 = 1− (1−R2

K+1)(1 + da,n,K) (4.12)

37

Figure 4.1: Plot of R2p against p

where

da,n,K =KFa,n,n−K−1

n−K − 1

and R2K+1 is the value of R2 for the full model. Aitkin calls any subset of regressor

variables producing an R2 greater than R20 an R2-adequate (α) subset.

Generally it is not straightforward to use R2 as an criterion for choosing the numberof regressor to include in the model. However, for a fixed number of variables p can be

used to compare the

(K

p− 1

)subset models so generated. Models having large values

of R2p are preferred.

4.3.2 Adjusted R2

To avoid difficulties of interpreting R2, some analysts prefer to use adjusted R2 statistic,defined for a P -term equation as

R2p = 1−

(n− 1

n− p

)(1−R2

p) (4.13)

The R2p does not necessarily increase as additional regressors are introduced into the

model. Infact Edward[1969], Haitovski[1969], and Seber [1977] showed that if s regressorsare added to the model, R2

p+s will exceed R2p iff the partial F -statistic for testing the

significance of s additional regressors exceeds 1. Therefore optimum subset model canbe chosen with maximum R2

p.

38

4.3.3 Residual Mean Square

The residual mean square for a subset regression model with p variables,

MSE(p) =SSE(p)

n− p(4.14)

can be used as a model evaluation criterion. The general behavior of MSE(p) as pincreases as in Figure 4.2. Because SSE(p) always decreases as p increases, MSE(p)

Figure 4.2: Plot of MSE(p) against p

initially decreases, then stabilizes, and eventually may increases. The eventual increasein MSE(p) occurs when the reduction in SSE(p) from adding a regressor to the modelis not sufficient to compensate for the loss of one degree of freedom in the denominatorof (4.14). That is, adding a regressor to a p-term model will cause MSE(p + 1) to begreater than MSE(p). Advocates of the MSE(p) criterion will plot MSE(p) against andbase the choice of p on

1. the minimum MSE(p),

2. the value of p such that MSE(p) is approximately equal to MSE for the full model,or

3. a value of p near the point where the smallest MSE(p) turns upward.

39

The subset regression model that minimizes MSE(p) will also maximize R2p. To see

this, note that

R2p = 1− n− 1

n− p(1−R2

p)

= 1− n− 1

n− p

SSE(p)

Syy

= 1− n− 1

Syy

SSE(p)

n− p

= 1− n− 1

Syy

MSE(p)

Thus the criteria minimum MSE(p) and maximum R2p are equivalent.

4.3.4 Mallows’ Cp-Statistics

Mallows [1964, 1966, 1973] has proposed a criterion that is related to the mean squareerror of a fitted value, that is,

E [yi − E(yi)]2 = [E(yi)− E(yi)]

2 + V (yi) (4.15)

where E(yi) and E(yi) are the expected responses from the true regression model andp-term subset model, respectively. Thus E(yi)−E(yi) is the bias at the i-th data point.Consequently the two terms on the right-hand side of (4.15) are the squared bias andvariance components, respectively, of the mean square error. Let the total squared biasfor a p-term equation be

SSB(p) =n∑

i=1

[E(yi)− E(yi)]2

and define the standardized total total mean square error as

Γp =1

σ2

n∑

i=1

[E(yi)− E(yi)]2 +

n∑i=1

V (yi)

=SSB(p)

σ2+

1

σ2

n∑i=1

V (yi) (4.16)

It can be shown that

n∑i=1

V (yi) = pσ2

and that the expected value of the residual sum of squares from a p-term equation is

E [SSE(p)] = SSB(p) + (n− p)σ2

40

Substituting for∑n

i=1 V (yi) and SSB(p) in (4.15) gives

Γp =1

σ2

E [SSE(p)]− (n− p)σ2 + pσ2

=

E[SSE(p)]

σ2− n + 2p (4.17)

Suppose that σ2 is a good estimate of σ2. Then replacing E[SSE(p)] by the observedvalue SSE(p) produces an estimate of Γp, say

Cp =SSE(p)

σ2− n− 2p (4.18)

If the p-term model has negligible bias, then SSB(p) = 0. Consequently E[SSE(p)] =(n− p)σ2, and

E [Cp|Bias = 0] =(n− p)σ2

σ2− n + 2p = p

When using the Cp criterion, it is helpful to construct a plot of Cp as a function of p for

Figure 4.3: Plot of Cp against p

each regression equation, such as shown in Figure 4.3. Regression equation with littlebias will have values of Cp that fall near the line Cp = p (point A in Figure 4.3) whilethose equations with substantial bias will fall above this line (point B in Figure 4.3).Generally small values of Cp are desirable. For example, although point C in Figure 4.3is above the line Cp = p, it is below point A and thus represents a model with lower totalerror. It may be preferable to accept some bias in the equation to reduce the averageprediction error.

To calculate Cp, need an unbiased estimate of σ2. Generally, we use the residual meansquare for the model. It gives Cp = p = K + 1 for the full model. Using MSE(K + 1)

41

from the full model as an estimate of σ2 assumes that the full model has negligiblebias. If the full model has several regressors that do not contribute significantly to themodel (zero regression coefficients), then MSE(K + 1) will often overestimate σ2, andconsequently the values of Cp will be small. If the Cp statistic is to work properly, agood estimate of σ2 must be used.

4.4 Computational Techniques For Variable Selec-

tion

In this section we will discuss several computational techniques for generating subsetregression models.

4.4.1 All Possible Regression

This procedure requires that the analyst fit all the regression equations involving one-candidate regressor, two-candidate regressors, and so on. These equations are evaluatedaccording to some suitable criterion and the ”best” regression model selected. If weassume that the intercept term β0 is included in all equations, then if there are Kcandidate regressors, there are 2K total equations to be estimated and examined. Forexample, if K = 5, then there are 32 possible equations, while if K = 10, there are1024 possible regression equations. Clearly the number of equations to be examinedincreases rapidly as the number of candidate regressors increases. Generally all possibleregression is impractical for problems involving more than a few regressors. This methodis highly computer oriented. There are presently several algorithm available for efficientlygenerating all possible regression. The basic idea underlying all these algorithms is toperform the calculations for the 2K possible subset models in such a way that sequentialsubset models differ by only one variable.

4.4.2 Directed Search on t

The test statistic for testing H0 : βj = 0 for the full model with p = K + 1 regressors is

tK,j =βj

se(βj

)Regressors that contribute significantly to the full model will have a large |tK,j| and willtend to be included in the best p-regressors subset, where best implies minimum residualsum of squares or Cp. Consequently ranking the regressors according to decreasing orderof magnitude of the |tK,j|,j = 1, 2, · · · , K, and then introducing the regressors into themodel one at a time in this order should lead to the best (or one of the best) subsetmodels for each p. Daniel and Wood [1980] call this procedure the directed search ont. It is often a very effective variable selection strategy when the number of candidateregressors is relatively large, for example, K > 20 or 30.

42

4.4.3 Stepwise Variable Selection

For cases when there are a large number of potential predictor variables, a set of pro-cedures that does not involve computing of all possible equations has been proposed.These procedures have the feature that the variables are introduced or deleted from theequation one at a time, and involve examining only subset of all possible equations.With p variables these procedures will involve equation of at most (p + 1) equations,as contrasted with the evaluation of 2p equations necessary for examining, all possibleequations. The procedures can be classified into two broad categories: (1) the forwardselection procedure (FS), and (2) the backward elimination procedure (BE). There isalso a very popular modifications of the FS procedure called the stepwise method.

Forward Selection Procedure

This procedure begins with the assumption that there are no regressors in the modelother than intercept. The first variable included in the equation is the one which hasthe highest simple correlation with the response Y . Suppose that this regressor is x1.This is also the regressor that will produce the largest value of the F -statistic for testingsignificance of regression. This regressor is entered if the F -statistic exceeds a preselectedF value, say FIN ( or F -to enter). The second regressor chosen for entry is the one thatnow has the largest correlation with Y after adjusting for the effect of the first regressorentered (x1) on Y . These correlations are called partial correlations. They are thesimple correlations between the residuals from the regression y = β0 + β1x1 and theresiduals from the regressions of the other candidate regressors on x1, say xj = α1jx1,j = 2, 3, · · · , K.

Suppose that at step 2 the regressor with the highest partial correlation with y is x2.This implies that the largest partial F -statistic is

F =SSR(x2|x1)

MSE(x1, x2)

If this F value exceeds FIN , then x2 is added to the model. In general at each stepthe regressor having the highest partial correlation with Y is added to the model if itspartial F -statistic exceeds the preselected entry level FIN . The procedure terminateseither when the partial F -statistic at a particular step does not exceed FIN or when thelast candidate regressor is added to the model.

Backward Elimination

Forward selection begins with no regressors in the model and attempts to insert variablesuntil a suitable model is obtained. Backward elimination attempts to find a good modelby working in the opposite direction. Here we eliminate stepwise variables withoutinfluence. We first calculate the linear regression for the full model. Eliminate thevariable xk with one of the following (equivalent) properties: (1) xk has the smallestsimple partial correlation among all remaining variables. (2) The removing of xk causesthe smallest change of R2. (3) Of the remaining variables, xk has the smallest t- orF -values. Repeat the whole procedure until one of the following stopping rules is valid:

43

1. The order p of the model has reached a predetermined p∗.

2. The F -value is larger than a predetermined value Fout.

3. Removing xk does not significantly change the model fit.

Stepwise Selection

A kind of compromise between forward selection and backward elimination is givenby the stepwise selection method. Beginning with one variable just like in forwardselection we have to choose one of the four alternatives:

1. Add a variable.

2. Remove a variable.

3. Exchange two variables.

4. Stop the selection.

This can be done with the following rules:

1. Add the variable xk if one of the forward selection criteria is satisfied.

2. Remove the variable xk with the smallest F -value if there are (possibly more thanone) variables with an F -value smaller than Fout.

3. Remove the variable xk with the smallest F -value if this removal result in a largerR2 -value than it was obtained with the same number of variables before.

4. Exchange the variables xk in the model and xl not in the model if this will increasethe R2 -value.

5. Stop the selection if neither of the above criteria is satisfied.

The rules 1, 2 and 3 only make sense if there are two or more variables in the model. Thatis why they are only admitted in this case. Considering rule 3, we see that possibilitythat the same variable can be added and removed in several steps of the procedure.

∗ ∗ ∗ ∗ ∗

44

Chapter 5

Dealing with Multicollinearity

The use and interpretation of a multiple regression model often depends explicitly orimplicitly on the estimates of the individual regression coefficients. Some examples ofinferences that are frequently made include:

1. Identifying the relative effects of the regressor variables,

2. Prediction and/or estimation and

3. Selection of an appropriate set of variables for the model.

If there is no linear relationship between the regressors, they are said to be orthogonal.When the regressors are orthogonal, inferences such as those made illustrated abovecan be made easily. Unfortunately in most applications of regression, the regressorsare not orthogonal. Sometimes the lack of orthogonality is not serious. However, insome situations the regressors are nearly perfectly linearly related and in such cases theinferences based on the regression model can be misleading or erroneous. When thereare near linear dependencies between the regressors, the problem of multicollinearityis said to exist.

5.1 Sources of Multicollinearity

There are four primary sources of multicollinearity:

1. The data collection method can lead to multicollinearity problems when theanalyst samples only a subspace of the region of the regressors,i.e.

∑pj=1 tjX

¯ j = 0¯

if there is a set of constants t1, t2, ..., tp,not all zeros.

2. The constraints on the model or in the population being sampled can causemulticollinearity problems. For example, suppose that an electric utility is investi-gating the effect of family(x1) income and house size(x2) on residential electricityconsumptions. The levels of the two regressors in the sample data will show apotential multicollinearity problem, a physical constraint in the population causedthis problem; namely, families with higher incomes generally have larger houses

45

than families with lower incomes. When physical constraints such as this arepresent, multicollinearity will exist regardless of the sampling method employed.Constraints often occur in problems involving prediction or chemical processes,where the regressors are the components of a product and these components addto a constant.

3. Multicollinearity also be induced by the choice of the model. For example,adding polynomial terms to a regression model causes illconditioning in X ′X.Furthermore, if the range of x is small, adding an x2 term can result insignificantmulticollinearity. We often encounter situations such as these where two or moreregressors are nearly dependent, and retaining all these regressors may contributeto multicollinearity.In these cases some subset of the regressors is usually preferablefrom the standpoint of the multicollinearity.

4. An overdefined model has more regressors variables than observations.Thesemodels are sometimes encountered in medical and behavioral research, where theremay be only a small no of subjects (sample units) available and information iscollected for a large number of regressors on each subject. The usual approach todealing with multicollinearity in this context is to eliminate some of the regressorsfrom consideration.

5.1.1 Effects Of Multicollinearity

The presence of multicollinearity has a number of potentially serious effects on the leastsquares estimates of the regression coefficients.Suppose that there are only two regressorvariables, x1 and x2. The model, assuming that x1, x2 and y, are scaled to unit length,is

y = β1x1 + β2x2 + ε (5.1)

and the least squares normal equations are

X′Xβ¯

= X′y¯(

1 r12

r12 1

)(β1

β2

)=

(r1y

r2y

)where r12 is the simple correlation between x1 and x2 and rjy is the simple correlationbetween xj and y, j = 1,2. Now the inverse of (X ′X) is

C = (X′X)−1 =

(1

1−r212

−r12

1−r212−r12

1−r212

11−r2

12

)(5.2)

and the estimates of the regression coefficients are

β1 =r1y − r12r2y

(1− r212)

,

β2 =r2y − r12r1y

(1− r212)

(5.3)

46

If there is strong multicollinearity between x1 and x2, then the correlation coefficient r12

will be large. From (5.2) we see that as |r12| → 1, V(βj) = Cjjσ2 → ±∞ depending

on whether r12 → +1 or r12 → −1. Therefore strong multicollinearity between x1andx2

results in large variances and covariances for the least squares estimators of the regressioncoefficients. This implies that different samples taken at the same x levels could leadwidely different estimates of the model parameters.

When there are more than two regressor variables, multicollinearity produces similareffects. It can be shown that the diagonal elements of the C = (X′X)−1 matrix are

Cjj = 11−R2

j, j = 1, 2, ..., p (5.4)

where R2j is the coefficient of multiple determination from regression of xj on the re-

maining p − 1 regressor variables. If there is strong multicollinearity between xj andany subset of other p − 1 regressors, then the value of R2

j will be close to unity. Since

the variance of βj is V (βj) = Cjjσ2 = (1 − R2

j )−1σ2, strong multicollinearity implies

that the variance of the least squares estimate of the regression coefficient βj is very

large. Generally the covariance of βiandβj will also be large if the regressors xiandxj

are involved in a multicollinear relationship.Multicollinearity also tends to produce least squares estimates βj that are too large in

absolute value. To see this, consider the squared distance from β¯

to the true parameterβ¯, for example,

L21 = (β

¯− β

¯)′(β

¯− β

¯) (5.5)

The expected squared distance, E(L21), is

E(L21) = E(β

¯− β

¯)′(β

¯− β

¯)

=

p∑j=1

E(βj − βj)2

=

p∑j=1

E(βj)

= σ2Tr(X′X)−1 (5.6)

where the trace of a matrix (denoted by Tr) is just the sum of the main diagonal elements.When there is multicollinearity present, some of the eigenvalues of X′X will be small.Since the trace of a matrix is also equal to the sum of the eigenvalues, (5.6) becomes

E(L21) = σ2

p∑j=1

1

λj

(5.7)

where λj 0, j = 1, 2, ..., p, are the eigenvalues of X′X. Thus if the X′X matrix is ill-conditioned because of multicollinearity, at least one of the λj will be small, and (5.7)

47

implies that the distance from the least squares estimate β¯

to the true parameters β¯may be large. Equivalently we can show that

E(L21) = E(β

¯− β

¯)′(β

¯− β

¯)

= E(β¯

′β¯− 2β

¯

′β¯

+ β¯

′β¯)

or

E(β¯

′β¯) = β

¯

′β¯

+ σ2Tr(X′X)−1 (5.8)

That is, the vector β¯

is generally longer than the the vector β¯

is generally longer thanthe the vector β

¯. This implies that the method of least squares produces estimated

regression coefficients that are too large in absolute value.While the method of least squares will generally produces poor estimates of the

individual model parameters when strong multicollinearity is present, this does not nec-essarily imply that the fitted model is a poor predictor. If predictions are confined toregions of the x-space where the multicollinearity holds approximately, the fitted modeloften produces satisfactory predictions. This can occur because the linear combination∑p

j=1 βjxij may be estimated quite well, even though the individual parameters βj areestimated poorly. That is, often be precisely predicted despite the inadequate estimatesof the individual model parameters.

In general, if a model is to extrapolate well, good estimates of the individual coeffi-cients are required. When multicollinearity is suspected, the least squares estimates ofthe regression coefficients may be very poor. This may seriously limit the usefulness ofthe regression model for inference and prediction.

5.2 Multicollinearity Diagnosis

Several techniques have been proposed for detecting multicollinearity. We will nowdiscuss some of these diagnostic measures. Desirable characteristics of a diagnosticprocedure are that it directly reflect the degree of the multicollinearity problem andprovide information helpful in determining which regressors are involved.

5.2.1 Estimation of the Correlation Matrix

A very simple measure of multicollinearity is inspection of the off-diagonal elementsrij in X′X. If regressors xi and xj are nearly dependent, then |rij| will be near unity.Examining the simple correlations rij between the regressors is helpful in detectingnear linear dependency between pairs of regressors only. Unfortunately whenmore than two regressors are involved in a near linear dependency, there is no assurancethat any of the pairwise correlations rij will be large. Generally inspection of the rij isnot sufficient for detecting anything more complex than pairwise multicollinearity.

48

5.2.2 Variance Inflation Factors

The diagonal elements of C = (X′X)−1 matrix are very useful in detecting multicollinear-ity. The jth diagonal element of C can be written as Cjj = (1 − R2

j )−1, where R2

j isthe coefficient of determination obtained when xj is regressed on the remaining p − 1regressors. If xj is nearly orthogonal to the remaining p − 1 regressors, R2

j is smalland Cjj is close to unity, while if xj is nearly linearly dependent on some subset of theremaining regressors, R2

j is near unity and Cjj is large. Since the variance of the jth

regression coefficient is Cjjσ2, we can view Cjj as the factor by which the variance of βj

is increased due to near linear dependencies among the regressors. We call

V IFj = Cjj = (1−R2j )−1

the variance inflation factor (Marquardt [1970]). The VIF for each term in the modelmeasures the combined effect of the dependencies among the regressors on the varianceof the term. One or more large VIFs indicate multicollinearity. practical experienceindicates that if any of the VIFs exceeds 5 or 10, it is an indication that the associatedregression coefficients are poorly estimated because of multicollinearity.

The VIFs have another interesting interpretation. The length of the normal-theoryconfidence interval on the jth regression coefficient may be written as

Lj = 2(Cjjσ

2)1/2

tα/2,n−p−1

and the length of corresponding interval based on an orthogonal reference design withthe same sample size and root-mean-square (rms) values [i.e., rms =

∑nj=1(xij − xj)

2 isa measure of the spread of the regressor xj] as the original design is

L∗ = 2σtα/2,n−p−1

The ratio of these two confidence intervals is Lj/L∗ = C

1/2jj . Thus the square root of

the jth VIF indicates how much longer the confidence interval for the jth regressioncoefficient is because of multicollinearity. The VIFs can help identify which regressorsare involved in multicollinearity.

5.2.3 Eigensystem Analysis of X′X

The characteristic roots or eigenvalues of X′X, say λ1, λ2, ..., λp, can be used to measurethe extent of multicollinearity in the data. If there are one or more near linear dependen-cies in the data, then one or more of the characteristic roots will be small. One or moresmall eigenvalues imply that there are near linear dependencies among the columns ofX. Some analyst prefer to examine the condition number of X′X, defined as

κj =λmax

λmin

(5.9)

49

This is just a measure of the spread in the eigenvalue spectrum of X′X. Generally ifthe condition number is less than 100, there is no serious problem with multicollinearity.Condition numbers between 100 and 1000 imply moderate to strong multicollinearity,and if κ exceeds 1000, severe multicollinearity is indicated.

The condition indices of the X′X matrix are

κj =λmax

λj

, j = 1, 2, ..., p

Clearly the largest condition index is the condition number defined in (5.9). The numberof condition indices that are large (say ≥ 1000) are a useful measure of the number ofnear linear dependencies in X′X.

5.2.4 Other Diagnostics

There are several techniques that are occasionally useful in diagnosing multicollinearity.The determinant of X′X can be used as an index of multicollinearity. Since the X′Xmatrix is in correlation form, the possible range of values of the determinant is 0 ≤|X′X| ≤ 1. If |X′X| = 1, the regressors are orthogonal, while if |X′X| = 0, there is anexact linear dependency among the regressors. The degree of multicollinearity becomesmore severe as |X′X| approaches zero. While this measure of multicollinearity is easy toapply, it does not provide any information on the source of the multicollinearity. Thereare more methods that can be taken care of.

5.3 Methods for Dealing with Multicollinearity

Several Techniques have been proposed for dealing with the problems caused by mul-ticollinearity. The general approaches include collecting additional data, model speci-fication, and the use of estimation methods other than least squares that are speciallydesigned to combat the problems induced by multicollinearity.

5.3.1 Collecting Additional Data

Collecting additional data has been suggested as the best method of combating multi-collinearity. The additional data should be collected in a manner designed to break upthe multicollinearity in the existing data. Unfortunately, collecting additional data isnot always possible because of economic constraints or because the process being stud-ied is no longer available for sampling. Even when additional data are available, it maybe inappropriate to use if the new data extend the range of the regressor variables farbeyond the analyst’s region of interest. Furthermore if the new data points are unusualor a typical of the process being studied, their presence in the sample could be highlyinfluential on the fitted model.Finally, note that collecting additional data is not viablesolution to the multicollinearity problem when the multicollinearity is due to constraintson the model or in the population.

50

5.3.2 Model Respecification

Multicollinearity is often caused by the choice of model, such as when two highly cor-related regressors are used in the regression equation. In these situation some respec-ification of the regression equation may lessen the impact of multicollinearity. Oneapproach to model respecification is to redefine the regressors. For example, if x1, x2

and x3 are nearly linearly dependent, it may be possible to find out some function suchas x = (x1 +x2)/x3 or x = x1x2x3 that preserves the information content in the originalregressors but reduces the ill-conditioning.

Another widely used approach to model respecification is variable elimination. Thatis, if x1, x2 andx3 are nearly linearly dependent, eliminating one regressor (say x3) maybe helpful in combating multicollinearity. Variable elimination is often a highly effectivetechnique. However, it may not provide a satisfactory solution if the regressors droppedfrom the model have significant explanatory power relative to the response y. That is,eliminating regressors to reduce multicollinearity may damage the predictive power ofthe model. Care must be exercised in variable selection because many of the selectionprocedures are seriously distorted by multicollinearity, and there is no assurance thatthe final model will exhibit any lesser degree of multicollinearity than was present in theoriginal data.

5.3.3 Ridge Regression

When the method of least squares is applied to nonorthogonal data, very poor estimatesof the regression coefficients are usually obtained. In such cases the variance of the leastsquares estimates of the regression coefficients may be considerably inflated, and thelength of the vector of least squares parameter estimates is too long on the average.In such case we generally estimate the parameter using Ridge regression that we shalldiscuss in the next chapter.

5.3.4 Principal Components Regression

Biased estimators of regression coefficients can also be obtained by using a procedureknown as principal components regression. Consider the canonical form of the model,

y¯

= Zα¯

+ ε¯

(5.10)

where

Z = XT, α¯

= T′β¯, T′X′XT = Z′Z = Λ

Recall that Λ = diag(λ1, λ2, ..., λp) is a pxp diagonal matrix of the eigenvalues of X′Xand T is a pxp orthogonal matrix whose columns are the eigenvectors associated withλ1, λ2, ..., λp. The columns of Z, which define a new set of orthogonal regressors, such as

Z =[Z¯1,Z¯2, ...,Z¯p

]are referred to as principal components.

51

The least squares estimator of α¯

is

α¯

= (Z′Z)−1

Z′y¯

= Λ−1Z′y¯

(5.11)

and the covariance matrix of α¯

is

V (α¯) = σ2 (Z′Z)

−1

= σ2Λ−1 (5.12)

Thus a small eigenvalue of X′X means that the variance of the corresponding orthogonalregression coefficient will be large. Since

Z′Z =

p∑i=1

p∑j=1

Z¯iZ¯

′j = Λ

we often refer to the eigenvalue λj as the variance of the jth principal component. If allthe λj are equal to unity, the original regressors are orthogonal, while if an λj is exactlyequal to zero, this implies a perfect linear relationship between the original regressors.One or more of the λj near zero implies that multicollinearity is present. Note also that

the covariance matrix of the standardized regression coefficients β¯

is

V(β¯

)= V (Tα

¯)

= TΛ−1T′σ2

This implies that the variance of βj is σ2(∑p

i=1 t2ij/λi

). Therefore the variance of βj is

a linear combination of the reciprocals of the eigenvalues. This demonstrates how oneor more small eigenvalues can destroy the precision of the least squares estimates βj.

We have observed previously how the eigenvalues and eigenvectors of X′X providespecific information on the nature of the multicollinearity. Since Z = XT, we have

Z¯j =

p∑j=1

tijX¯ j (5.13)

where Xj is the j − th column of the X matrix and tij are the elements of the ith col-umn of T (the ith eigenvector of X′X). If the variance of the ith principal component(λj) is small, this implies that Z

¯ j is nearly a constant, and (5.13) indicates that thereis a linear combination of the original regressors that is nearly constant. This is thedefinition of multicollinearity. Therefore (5.13) explains why the elements of the eigen-vector associated with a small eigenvalue of X′X identifies the regressors involved in themulticollinearity.

The principal components regression approach combats multicollinearity by usingless than the full set of principal components in the model. To obtain the principalcomponents estimator, assume that the regressors are arranged in order of decreasing

52

eigenvalues, λ1 ≥ λ2 ≥ ... ≥ λp > 0. Suppose that the last s of these eigenvalues are ap-proximately equal to zero. In principal components regression the principal componentscorresponding to near-zero eigenvalues are removed from the analysis and least squaresapplied to the remaining components. That is,

α¯ PC = Bα

¯(5.14)

where b1 = b2 = ... = bp−s = 1 and bp−s+1 = bp−s−2 = ... = bp = 0 Thus the principalcomponents estimator is

α¯ PC =

α1

α2...

αp−s

−−−−00...0

p×p

or in terms of the standardized regressors

β¯PC

= Tα¯ PC

=

p−s∑j=1

λ−1j t

¯′jX

′y¯t¯j (5.15)

A simulation study by Gunst and Mason [1977] showed that principal componentsregression offers considerable improvement over least squares when the data are ill-conditioned. They also point out that another advantage of principal components isthat exact distribution theory and variable selection procedures are available.

∗ ∗ ∗ ∗ ∗

53

Chapter 6

Ridge Regression

Ridge regression provides another alternative estimation method that may be used toadvantage when the predictor variables are highly collinear. There are a number ofalternative ways to define and compute ridge estimates. we have chosen to present themethod associated with the ridge trace. It is graphical approach and may be viewedas an exploratory technique. ridge analysis using the ridge trace represents a unifiedapproach to problems of detection and estimation when multicollinearity is suspected.The estimators produced are biased but tend to have a smaller mean squared error thanOLS estimator (Hoerl and Kennard, 1970). The prime competitor of ’subset regression’in terms of variance reduction is ridge regression. Here the coefficients are estimated by(X′X + λI)−1X′y

¯, where λ is a shrinkage parameter. Increasing λ shrinks the coefficient

estimates, but none are set equal to zero. Gruber (1990) gave a recent overview of ridgemethods. Some studies (i.e. Frank and Friedman 1993; Hoerl, Schuenemeyer 1986) haveshown that ridge regressions give more accurate predictions than subset regressionsunless, assuming that y is of the form

y =∑

k

βkxk + ε

all but few of the βk are nearly zero and the rest are large. Thus, although subsetregression can improve accuracy if p (number of regressors) is large, it is usually not asaccurate as ridge.

Ridge has its own drawbacks. It gives a regression equation no simpler than theoriginal OLS equation. Furthermore more it is not scale invariant. If the scales usedto express the individual predictor variables are changed, then the ridge coefficients donot change inversely proportional to the changes in the variable scales. the usual recipeis to standardize the xm to mean 0, variance 1 and then apply ridge. But the recipeis arbitrary; that is, interquartile ranges could be used to normalize instead, giving adifferent regression predictor.

6.1 Ridge Estimation

As we have already introduced in the last chapter that when the method of least squaresis applied to nonorthogonal data, very poor estimates of the regression coefficients are

54

usually obtained. In such cases the variance of the least squares estimates of the re-gression coefficients may be considerably inflated, and the length of the vector of leastsquares parameter estimates is too long on the average. The problem with the methodof least squares is the requirement that β

¯be an unbiased estimator of β

¯. The Gauss-

Markov property referred assures us that the least squares estimator has minimum inthe class of unbiased linear estimators, but there is no guarantee that this variance willbe small. The variance of β

¯is large, implying that confidence intervals on β

¯would be

wide and the point estimate β¯

is very unstable.One way to alleviate this problem is to drop the requirement that the estimator of

β¯

be unbiased. suppose that we can find a biased estimator of β¯, say β

¯

∗, that has a

smaller variance than the unbiased estimator β¯. the mean square error of the estimator

β¯

∗is defined as

MSE(β¯

∗)= E

(β¯

∗ − β¯

)2

= V(β¯

∗)+[E(β¯

∗ − β¯

)](6.1)

or

MSE(β¯

∗)= V

(β¯

∗)+(bias inβ

¯

∗)2

Note that MSE is just the expected squared distance from β¯

∗to β

¯. By allowing a small

amount of bias in β¯

∗, the variance of β

¯

∗can be made small such that the MSE of β

¯

∗is

less than the variance of the unbiased estimator β¯. Figure 6.1(b) illustrates a situation

where the variance of the biased estimator is considerably smaller than the variance of

Figure 6.1: Sampling distribution of (a) unbiased (b) biased estimator of β

the unbiased estimator (Figure 6.1(a)). Consequently confidence intervals of β would be

55

much narrower using the biased estimator. The small variance for the biased estimatoralso implies thatβ

¯

∗is a more stable estimator of β

¯than is the unbiased estimator β

¯.

A number of procedures have been developed for obtaining biased estimators ofregression coefficients. One of these procedures is ridge regression. the ridge estimatoris found by solving a slightly modified version of the normal equations. Specifically wedefine the ridge estimator β

¯Ras the solution to

(X′X + kI) β¯R

= X′y¯

(6.2)

or

β¯R

= (X′X + kI)−1

X′y¯

(6.3)

where k ≥ 0 is a constant selected by the analyst. The procedure is called ridge regres-sion. Note that when k = 0, the ridge estimator is the least squares estimator.

The ridge estimator is a linear transformation of the least squares estimate since

β¯R

= (X′X + kI)−1

X′y¯

= (X′X + kI)−1

(X′X) β¯

= Zkβ¯

Therefore since E(β¯R

)= E

(Zkβ

¯

)= Zkβ

¯, β

¯Ris a biased estimator of β

¯. We usually

refer to the constant k as the biasing parameter. The covariance matrix of β¯R

is

V(β¯R

)= σ2 (X′X + kI)

−1X′X (X′X + kI)

−1(6.4)

The mean square error of the ridge estimator is

MSE(β¯R

)= V ar

(β¯R

)+(bias in β

¯R

)2

= σ2Tr[(X′X + kI)

−1X′X (X′X + kI)

−1]

+k2β¯

′ (X′X + kI)−2

β¯

= σ2

p∑j=1

λj

(λj + k)2 + k2β¯

′ (X′X + kI)−2

β¯

(6.5)

where λ1, λ2, · · · , λp are the eigenvalues of X′X. If k > 0, note that the bias in β¯R

increases with k. However, the variance decreases as k increases.In using ridge we would like to choose a value of k such that the reduction in the

variance term is greater than the increase in the squared bias. if this can be done, themean square error of the ridge estimator β

¯Rwill be less than the variance of the least

squares estimator β¯, provided that β

¯

′β¯

is bounded. the residual sum of squares is

SSE =(y¯−Xβ

¯R

)′ (y¯−Xβ

¯R

)(6.6)

=(y¯−Xβ

¯

)′ (y¯−Xβ

¯

)+(β¯R

− β¯

)′X′X

(β¯R

− β¯

)(6.7)

56

Since the first term on the right hand side of (6.7) is the residual sum of squares forthe least squares estimates β

¯, we see that as k increases, the residual sum of squares

increases. Consequently because the total sum of squares is fixed, R2 decreases as kincreases. Therefore the ridge estimate will not necessarily provide the best ”fit” to thedata, but this should not overly concern us, since we are more interested in obtaining astable set of parameter estimates. The ridge estimates may result in an equation thatdoes a better job of predicting future observations than would least squares.

Ridge Trace

Hoerl and Kennard have suggested that an appropriate value of k may be determineby inspection of the ridge trace. The ridge trace is a plot of the elements of β

¯Rversus

k for values of k usually in the interval 0-1. Marquardt and Snee [1975] suggest usingupto about 25 values of k, spaced approximately logarithmically over the interval [0,1].If multicollinearity is severe, the instability in the regression coefficients will be obviousfrom the ridge trace. As k is increased, some of the ridge estimates will vary dramatically.At some value of k, the ridge estimates β

¯Rare stable. Hopefully this will produce a set

of estimates with smaller MSE than the least squares estimates.The ridge regression estimates may be computed by using an ordinary least squares

computer program and augmenting the standardized data as follows:

XA =

[X√kIp

]y¯A

=

[y0¯p

]where

√kIp is a p× p diagonal matrix with diagonal elements equal to the square root

of the biasing parameter and 0¯p is a p× 1 vector of zeros. The ridge estimates are then

computed from

β¯R

= (X′AXA)

−1X′

Ay¯A

= (X′X + kIp)−1

X′y¯

Some Other Properties of Ridge Regression

Figure 6.2 illustrates the geometry of ridge regression for a two-regressor problem. Thepoint β

¯at the center of the ellipses corresponds to the least squares solution, where

the residual sum of squares takes on its minimum value. The small ellipse representsthe locus of points in the β1, β2 plane where the residual sum of squares is constant atsome value greater than the minimum. The ridge estimate β

¯Ris the shortest vector

from the origin that produces a residual sum of squares equal to the value representedby the small ellipse. That is, the ridge estimate β

¯produces the vector of regression

coefficients with the smallest norm consistent with a specified increase in the residualsum of squares. We note that the ridge estimator shrinks the least squares estimatortoward the origin. Consequently ridge estimators (and other biased estimators generally)are sometimes called shrinkage estimators. Hocking [1976] has observed that the ridge

57

Figure 6.2: A geometrical interpretation of ridge regression

estimator shrinks the least squares estimator with respect to the contours of X′X. Thatis, β

¯is the solution to

Minimizeβ¯

(β¯− β

¯

)′X′X

(β¯− β

¯

)Subject to β

¯

′β¯≤ d2

(6.8)

where the radius d depends on the k.Many of the properties of the ridge estimator assume that the value of k is fixed.

In practice, since k is estimated from the data by inspection of the ridge trace, k isstochastic. It is of interest to ask if the optimality properties cited by Hoerl and Kennardhold if k is stochastic. Several authors have shown through simulations that the ridgeregression generally offers improvement in mean square error over least squares when kis estimated from the data. Theobald [1974] has generalized the conditions under whichridge regression leads to smaller MSE than least squares. the expected improvementdepends on the orientation of the β

¯vector relative to the eigenvectors of X′X.

Obenchin [1977] shown that non-stochastically shrunken ridge estimators yield thesame t- and F -statistics for testing hypotheses as does least squares. Thus althoughridge regression leads to biased point estimates, it does not generally require a newdistribution theory. However, distributional properties are still unknown for stochasticchoice of k.

6.2 Methods for Choosing k

Much of the controversy concerning ridge regression centers around the choice of thebiasing parameter k. Choosing k by inspection of the ridge trace is a subjective procedure

58

requiring judgment on the part of the analyst. several authors have proposed proceduresfor choosing k that are more analytical. Hoerl, Kennard, and Baldwin [1975] havesuggested that an appropriate choice for k is

k =pσ2

β¯

′β¯

(6.9)

where β¯

and σ2 are found from the least squares solution. They showed via simula-tion that the resulting ridge estimator had significant improvement in MSE over leastsquares. In a subsequent paper, Hoerl and Kennard [1976] proposed an iterative esti-mation procedure based on (6.9). Specifically they suggested the following sequence ofestimates of β

¯and k:

β¯k0 =

pσ2

β¯

′β¯

β¯R

(k0)k1 =pσ2

β¯R

(k0)′β¯R

(k0)

β¯R

(k1)k1 =pσ2

β¯R

(k1)′β¯R

(k1)

...

The relative change in kj is used to terminate the procedure. If

kj+1 − kj

kj

> 20T−1.3

the algorithm should continue; otherwise terminate and use β¯R

(kj), where T = Tr(X′X)−1/p.This choice of termination criterion has been selected because T is increases with thespread in the eigenvalues of X′X, allowing further shrinkage as the degree of ill-conditioningin the data increases

McDonald and Galarneau [1975] suggest choosing k so that

β¯

′Rβ¯R

= β¯

′β¯− σ2

p∑j=1

(1

λj

)(6.10)

For cases where the right-hand side of (6.10) is negative, they investigated letting eitherk = 0 (least squares) or k = ∞(β

¯R= 0).

The method of choosing k that we have described so far are focused on improvingthe estimates of the regression coefficients. If the model is to be used for predictionpurposes, then it may be more appropriate to consider prediction-oriented criteria forchoosing k. Mallows modified the Cp statistic to a Ck statistic that can be used todetermine k. He proposed plotting Ck against Vk, where

Ck =SSE(k)

σ2− n + 2 + 2Tr(XL)

Vk = 1 + Tr(X′XLL′)

L = (X′X + kI)−1X′

59

and SSE(k) is the residual sum of squares as a function of k. The suggestion is to choosek to minimize Ck. Note that

XL = X(X′X + kI)−1X′ ≡ Hk

and the Hk is equivalent to the hat matrix in ordinary least squares.Another possibility is to use a PRESSridge procedure involving

PRESSridge =n∑

i=1

(ei,k

1− hii,k

)2

where ei,k is the ith residual for a specific value of k and hii,k is the ith diagonal elementof Hk. The value of k is chosen so as to minimize PRESSridge.

6.3 Ridge regression and Variable Selection

Standard variable selection algorithms often do not perform well when the data arehighly multicollinear. However, variable selection usually works quite well when theregressors have been made more nearly orthogonal by the use of biased estimators, thenvariable selection may be a good strategy. Hoerl and Kennard [1970] suggest the ridgetrace can be used as a guide for variable selection. They propose the following rules forremoving regressors from the full model:

1. Remove regressors that are stable but have small prediction power, that is, regres-sors with small standardized coefficients.

2. Remove regressors with unstable coefficients that do not hold their predictionpower, that is, unstable coefficients that are driven to zero.

3. Remove one or more of the remaining regressors that have unstable coefficients.

The subset of remaining regressors, say p in number, are used in the ”final” model.We may examine these regressors to see if they form a nearly orthogonal subset. This

may be done by plotting β¯

′R(k)β

¯R(k), the squared length of the coefficient vector as a

function of k. If the regressors are orthogonal, the squared length of the vector ridge

estimates should be β¯

′β¯/(1 + k2), where β

¯is the ordinary least squares estimates of

β¯. Therefore if the subset model contains nearly orthogonal regressors, the functions

β¯

′R(k)β

¯R(k) and β

¯

′β¯/(1 + k2) plotted against k should be very similar.

6.4 Ridge Regression: Some Remarks

Ridge regression provides a tool for judging the stability of a given body of data.....................updatefrom hardcopy.....cess under study.

Another practical problem with ridge regression is that it has not been implementedin some statistical packages. If a statistical package does not have a routine for ridge

60

regression, ridge regression estimates can be obtained from the standard least squarespackages by using slightly altered data set. Specially, the ridge estimates of the regressioncoefficients can be obtained from the regression of Y ∗onX∗

1 , X∗2 , ..., X

∗p . The new response

variable Y ∗ is obtained by augmenting Y by p new fictitious observations, each of whichis equal to zero. Similarly the new predictor variable X∗

j is obtained by augmenting Xj

by p new fictitious observations, each of which is equal to zero except the one in the jthposition which is equal to

√k, where k is the chosen value of the ridge parameter. It

can be shown that the ridge estimates θ1(k), ..., θp(k) are obtained by the least squaresregression of Y ∗onX∗

1 , X∗2 , ..., X

∗p without having a constant term in the model.

∗ ∗ ∗ ∗ ∗

61

Chapter 7

Better Subset Regression Using theNonnegative Garrote

A new method, called the nonnegative (nn) garrote, is proposed by Leo Breiman fordoing subset regression. It both shrinks and zeroes coefficients. In tests on real andsimulated data, it produces lower prediction error than ordinary subset selection. It isalso compared to ridge regression. If the regression equations generated by a proceduredo not change drastically with small changes in the data, the procedure is called sta-ble. Subset selection is unstable, ridge is very stable, and the nonnegative-garrote isintermediate.

7.1 Nonnegative Garrote

Subset regression is instable with respect to small perturbations in the data. Say thatn = 100,p = 40 and that using stepwise deletion of variables a sequence of subsetsof variables xm; m ∈ ζk, of dimension k(|ζk| = k), k = 1, · · · , p, has been selected.Now remove a single data case (yi, xi), and use the same selection procedure, gettinga sequence of subsetsxm; m ∈ ζ ′. Usually the ζ ′k and ζk are different so that forsome k a slight data perturbation leads to a drastic change in the prediction equation.On the other hand, if one uses ridge estimates and deletes a single data case, the newridge estimates, for some λ, will be close to the old.

Much work and research have gone into subset-selection regression, but the basicmethod remains flawed by its relative lack of accuracy and instability. Subset regressioneither zeroes a coefficient, if it is not in the selected subsets. or inflates it. Ridgeregression gains its accuracy by selective shrinking. Methods that select subsets , arestable, and shrink are needed. Here is one: Let βk be the original OLS estimates. takesck to minimize ∑

i

(yi −

∑k

ckβkxki

)2

under the constraints

ck ≥ 0,∑

k ck ≤ s.

62

The β(s) = ckβk are the new predictor coefficients. As the garrote drawn tighter by de-creasing s, more of the ck become zero and the remaining nonzero βk(s) are shrunken.

This procedure is called nonnegative (nn) garrote. The garrote eliminates somevariables, shrinks others, and is relatively stable. It is also scale invariant. Breimanshow that it is almost always more accurate than subset selection and that is accuracyis competitive with ridge. In general nn-garrote produces regression equations havingmore nonzero coefficients than subset regression. But the loss in simplicity is offset bysubstantial gains in accuracy.

7.2 Model Selection

7.2.1 Prediction and Model Error

The prediction error is defined as the average error in predicting y from x¯

for futurecases not used in the construction of the prediction equation. There are two regressionsituations, X-controlled and X-random. In the controlled situation, the x

¯i are selectedby the experimenter and only y is random. In the X-random situation, both y and xare randomly selected.

CaseI: When X-controlled

In the controlled situation, future data are assumed gathered using the same x¯i as

in the present data and thus have the form (ynewi ,x

¯i), i = 1, · · · , n. If µ(x¯) is the

prediction equation derived from the present data, then define the prediction error as

PE(µ) = E∑

i

(ynewi − µ(x

¯i))2 ,

where the expectation is over ynewi .

In the data are generated by the mechanism yi = µ(xi) + εi where εi are mean zerouncorrelated with average variance σ2, then

PE(µ) = nσ2 +∑

i

(µ(x¯i)− µ(x

¯i))2 .

The first component is the inherent prediction error due to the noise. The secondcomponent is the prediction error due to lack of fit to the underlying model. Thiscomponent is called model error and denoted by ME(µ). The size of the model errorreflects different methods of model estimation. If µ =

∑βmxm and µ =

∑βmxm, then

ME(µ) = (β¯− β

¯)′(X′X)(β

¯− β

¯).

63

Case II: When X-random

If the x¯i are random, then it is assumed that the (yi,x

¯i) are iid selections from theparent distribution (Y,X). Then if µ(x

¯) is the prediction equation constructed using the

present data, PE(µ) = E(Y − µ(X))2. assuming that Y = µ(X)+ ε, where E(ε/X) = 0,then

PE(µ) = E(ε2) + E(µ(X)− µ(X))2.

Again, the relevant error is the second component. If µ =∑

βmxm and µ =∑

βmxm

then, the model error is given by

ME(µ) = (β¯− β

¯)′(n.Γ)(β

¯− β

¯)

when Γij = E(XiXj).

7.3 Estimation of Error

In each regression procedure we generally get a sequence of models µk(x¯). Variable

selection gives a sequence of subsets of variables xm, m ∈ ζk, |ζk| = k, k = 1, · · · , p,and µk(x

¯) is the OLS linear regression based on xm, m ∈ ζ. In nn-garrote, a sequence

of s-parameter values s1, s2, · · · , sk is selected and µk(x¯), k = 1, · · · , K, is the prediction

equation using parameter sk. In ridge a sequence of λ-parameter values λ1, λ2, · · · , λK

is selected, and µk(x¯) is the ridge regression based on λk.

If we know the true value of PE(µk), the model selected would be the minimizer ofPE(µk). We refer to these sections as the crystal-ball models. Otherwise, the selectionprocess constructs an estimate PE(µk) and selects that µk that minimizes PE. Theestimation methods differ from X-controlled to X-random.

7.3.1 X-Controlled Estimates

The most widely used estimate in subset selection is Mallows Cp. If k is the number ofvariables in the subset, RSS(k) is the residual sum of squares using µk, and σ2 is thenoise variance estimate derived from the full model, then the Cp estimate is

PE (µk) = RSS(k) + 2kσ2

But Breiman [1992] showed that this estimate is heavily biased and does poorly in modelselection.

He also showed that a better estimate for PE(µk) is

PE(µk) = RSS(k) + 2Bt(k)

where Bt(k) defined as follows:

64

Let σ2 be the noise variance, and add iid N(0, t2σ2),0 < t ≤ 1, noise εi to the yi,getting yi. Now using the data (yi,x

¯i), repeat the subset-selection process gettinga new sequence of OLS predictors µk, k = 1, · · · , p. Then

Bt(k) =1

t2E

(∑i

εiµk(x¯i)

),

where the expectation is on the εi only. This is made computable by replacing σ2 bythe noise variance estimate σ2 and the expectation over εi by the average over manyrepetitions. This procedure is called the little bootstrap.

Little bootstrap can also be applied to nn-garrote and ridge. Suppose that the nn-garrote predictor µk has been computed for parameter values sk with resulting residualsum of squares RSS(sk), k = 1, · · · , K. Now add εi to the yi, getting yi, wherethe εi are iid N(0, t2σ2). Using (yi,x

¯i) data, derive the nn-garrote predictor µk forthe parameter value sk, and compute the quantity 1

t2E (∑

i εiµk(x¯i)). Repeat several

times, average these quantities, and denote the result by Bt(sk). The estimate of PE is

PE(µk) = RSS(sk) + 2Bt(sk).

In ridge regression, let µk be the predictor using parameter λk. The little bootstrapestimate is RSS(λk) + 2Bt(λk), where Bt(λk) is computed just as in subset selectionand nn-garrote. It was shown by Breiman [1992] that, for subset selection, the bias ofthe little bootstrap estimate is small for t small. The same proof holds, almost word forword, for the nn-garrote and ridge. But what happens in subset selection is that as t ↓ 0,the variance of Bt increases rapidly, and Bt has no sensible limiting value. Experimentsby Breiman [1992] indicated that the best range for t is [0.6,0.8] and that averaging over25 repetitions to form Bt is usually sufficient.

On the other hand, in ridge regression the variance of Bt does not increase appreciablyas t ↓ 0, and taking this limit results in the more unbiased estimate

PE(µk) = RSS(λk) + 2σ2tr(X′X(X′X + λkI)−1) (7.1)

This turns out to be an excellent PE estimate that selects regression equations µk

with PE(µk) close to mink′PE(µk′).The situation in nn-garrote is intermediate between subset selection and ridge. The

variance of Bt increases as t gets small, but a finite variance limit exists. It does notperform as well as using t in the range [0.6,0.8], however. Therefore, our preferred PEestimates for subset selection and nn-garrote use t ∈ [0.6,0.8] and (7.1) for the ridge PEestimate. the behavior of Bt for t small is reflection of the stability of the regressionprocedures used.

65

7.3.2 X-Random Estimates

For subset regressions µk in the X-random situations, the most frequently encounteredPE estimate is

PE(µk) =1

(1− kn)2

RSS(k).

This estimate can be strongly biased and does poorly in selecting accurate models. Insuch case, cross-validation(CV) is workful.

V-fold CV is used to estimate PE(µk) for subset selection and nn-garrote. The dataL = (yi,x

¯i), i = 1, · · · , p are split into V subsets L1, · · · ,LV . Let L(v) = L − Lv.

Using subset selection (nn-garrote) and the data in L(v), from the predictors

µ(v)k (x

¯)

.

Then the CV estimate is

PE(µk) =∑

v

∑(yi,xi)∈Lv

(yi − µ(v)k (x

¯i))2 (7.2)

and

ME(µk) = PE(µk)− pσ2 (7.3)

To get an accurate PE estimate for the ridge regression µλ(x¯), remove the i-th case

(yi,x¯i) from the data, and recompute µλ(x

¯) getting µ

(−i)λ (x

¯). Then the estimate is

PE(λ) =∑

i

(yi − µ(−i)k (x

¯i))2. (7.4)

This is the leave-one-out CV estimate. If ri(λ) = yi − µλ(x¯i) and hi(λ) = x

¯′i(X

′X +λI)−1x

¯i, then

PE(λ) =∑

i

(ri(λ)/(1− hi(λ)))2. (7.5)

Usually, hi(λ) ' h(λ) is a good approximation, where h(λ) = tr(X ′X(X ′X + λI)−1)/p.With this approximation

PE(λ) = RSS(λ)/(1− h(λ))2. (7.6)

This estimate was first derived by Golub, health, and Wahba [1979] and is called GCV(generalized cross-validation) estimate of PE.

Breiman and Spector (1992) found that the ”infinitesimal” version of CV-that is,leave-one-out–gave poorer results in subset selection that five- or tenfold CV. But leave-one-out works well in ridge regression.

66

7.4 X-Orthonormal

In the X-controlled case, assume that X′X = I and that y is generated as

yi =∑

i

βmxmi + εi, (7.7)

where the εi are iid N(0, 1).Then OLS βm = βm + Zm, where the Zm are iid N(0, 1).Although this is a highly

simplified situation, it can give interesting insights into the comparative behavior ofsubset selection, ridge regression, and the nn-garrote regressions. The best subset of kvariables consists of those xm corresponding to the k largest |βm| so that the coefficientsof best subset regression are

β(S)m = I(|βm| ≥ λ)βm, m = 1, · · · , p (7.8)

for some λ ≥ 0, where I(.) is the indicator function.In nn-garrote, the expression

∑i

(yi −

∑m

cmβmxmi

)2

is minimized under the constraints cm ≥ 0, all m ,∑

m cm = s. The solution is of theform

cm =

(1− λ2

β2m

)+

(7.9)

where λ is determined from s by the condition∑

m cm = s and the subscript + indicatesthe positive part of the expression. The nn-garrote coefficients are

β(G)m =

(1− λ2

β2m

)+

βm. (7.10)

the ridge coefficients are

β(R)m =

1

1 + λβm. (7.11)

All three of these estimates are of the form β′ = θ(β, λ)β, where θ is a shrinkagefactor. OLS estimates correspond to θ ≡ 1. Ridge regression gives a constant shrinkage,θ = 1/(1+λ). Subset selection is 0 for |β| ≤ λ and 1 otherwise. The nn-garrote shrinkage

is continuous, 0 if |β| ≤ λ and then increasing to 1. If the

β′m

are any estimates of

the

βm

then the model error is

ME(β′) =∑m

(βm − β′m)2.

67

7.5 Conclusions And Remarks

Breiman has given evidence that the nn-garrote is a worthy competitor to subset selectionmethods. It provides simple regression equations with better predictive accuracy. Unlessa large proportion of the ”true” coefficients are non-negligible, it gives accuracy betterthan or comparable to ridge methods. Data reuse method such as little bootstrap orV-fold CV do well in estimating good values of the garrote parameter.

Some simulation results can be viewed as intriguing aspects of stability. Each regres-sion procedure chooses from a collection of regression equations. Instability is intuitivelydefined as meaning that small changes in the data can bring large changes in the equa-tions selected. If, by use of crystal ball , we could choose the lowest PE equationsamong the subset selection, nn-garrote, and ridge collections, the differences in accuracybetween the three procedures are sharply reduced. But the more unstable a procedure is,the more difficult it is to accurately estimate PE or ME. Thus, subset-selection accuracyis severely affected by the relatively poor performance of the PE estimates in picking alow PE subset.

On the other hand, ridge regression, which offers only a small diversity of models butis very stable, sometimes wins because the PE estimates are able to accurately locatelow PE ridge predictors. nn-garrote is intermediate. In crystal-ball selection is usuallysomewhat better than the crystal-ball subset selection, but its increased stability allowsa better location of low PE nn-garrote predictions, and this increases its edge.

The ideas used in the nn-garrote can be applied to get other regression shrinkage

schemes. For instance, let

βk

be the original OLS estimates. Take ck to minimize

∑i

(yi −

∑ckβkxki

)2

under the constraint∑

c2k ≤ s. This leads to a procedure intermediate between nn-

garrote and ridge regression. In the X′X = I case, its shrinkage factor is

θ(β, λ) =β2

β2 + λ2.

Unlike ridge, its is scale invariant. Our expectation is that it will be uniformly moreaccurate than ridge regression while being almost as stable. Like ridge regression it doesnot zero coefficients and procedure simplified predictors.

∗ ∗ ∗ ∗ ∗

68

Chapter 8

Subset Auto Regression

A stationary model which is often fitted to detrended (mean estimated and removed)time series data is the autoregressive (AR) model. If T is the set of all integers, thenXt; t ∈ T is a mean zero AR scheme of order p if

Xt + α1Xt−1 + · · ·+ αpXt−p = εt (8.1)

where εt is white noise (WN), i.e.,

E(εt) = 0,

Cov(εtεs) = δt,sσ2. (8.2)

Fitting an AR model is a two step procedure:

1. Choose the order, p, of the model.

2. Estimate the model parameters, α1, · · · , αp.

We shall primarily consider the step (1).The optimal choice of order has been considered by a group of statisticians (Que-

nouille, Walker, Whittle, Akaike and Hannan). A sequential hypothesis testing pro-cedures has been adopted. The application of these stepwise procedures is generallyrestricted to testing the sequence H0 : p = k; k = 1, 2, · · · until the first acceptance,or the sequence H0 : p1 = k; k = K, K − 1, · · · until the first acceptance, for some ar-bitrary chosen K. After the order is chosen, one may wish to determine whether certainintermediate lags have non-zero coefficients.

Cleveland has proposed the use of inverse autocorrelations for determining the orderof an AR model, analogous to the use of the autocorrelations for moving average models.using a sequence of t -tests for determining the importance of individual terms in aregression model we can retain a particular lag in the final model.

Akaike has developed a decision theoretic approach for determining the order. Thedecision function is the estimated one-step-ahead prediction error for each fitted ARmodel, called by Akaike the final prediction error (FPE). The model for which FPEis minimized is chosen as the best model. This technique is most readily adapted tosubset regression methodology, since the decision function can be calculated for any ARmodel.

69

8.1 Method of Estimation

Suppose that the observed (detrended) time series be X1, X2, · · · , XN . Define the sampleautocovariance and autocorrelation functions as

c(h) =1

N

N−h∑i=1

XtXt+h, (8.3)

r(h) = c(h)/c(0), (8.4)

respectively, for h = −N + 1,−N + 2, · · · , N − 1, where c(−h) = c(h). These estimatethe true autocovariance and autocorrelation functions, γ(h) = E(Xt, Xt+h) and ρ(h) =[γ(0)]−1γ(h), respectively.

Let

RK = Toepl1, r1, · · · , rK−1 (8.5)

and

rTK = −r1, r2, · · · , rK (8.6)

where Toepl b1, b2, · · · , bn is the n× n Toeplitz matrix with (i, j)-th element b(i−j)+1,and AT is the transpose of the matrix A. Then, assuming the true model is a p-th orderAR, where p ≤ K, the Yule-Walker system of equations for calculating the estimate,aK = aK(1), aK(2), · · · , aK(K)T , of aK = αh(1), · · · , αh(p), 0, · · · , 0T , is

RKaK = rK . (8.7)

The asymptotic variance-covariance matrix, ΣK , of aK is

ΣK =σ2

NΓ−1

K , (8.8)

where ΓK = Toeplγ(0), γ(1), · · · , γ(K − 1). Pagano’s algorithm, which is both fastand numerically stable, is based upon the Cholesky decomposition of RK+1. This is

RK+1 = LK+1DK+1LTK+1, (8.9)

where LK+1 is unit lower triangular and DK+1 diagonal with positive elements diK+1i=1 .

From this decomposition one obtains a1, a2, · · · , aK and

σ2K = c(0)[1− rT

k R−1k rk]

= c(0)dk+1 k = 1, 2, · · · , K, (8.10)

the sequence of estimated AR coefficient vectors and residual variances, respectively, foreach of the K AR models with lags (1), (1, 2), · · · , (1, 2, · · · , K). One criterion forchoosing the order of the scheme is to observe the non-increasing sequence σ2

1, σ22, · · ·

until it has ”leveled out”, choosing the order to be that point at which no furthersignificant decrease in σ2

k seems likely.

70

Akaike’s decision function is of the form

FPEδ(k) = σ2k

(1 + δ

k

N

), (8.11)

where k is the order being considered and δ a positive constant. if p ≤ k, the value ofδ determines the amount by which the use of ak for αk increases the FPE. Suppose inestimating prediction error we use an unbiased estimator of σ2,

σ2k =

N

N − kσ2

k (8.12)

makes the choice δ = 2. The additional prediction error attributable to estimation of αk

is approximately (k/N)σ2. Thus the one-step-ahead prediction error is

σ2k

(1 +

k

N

)=

(1− k

N

)−1(1 +

k

N

)σ2

k

= FPE2(k) + O(N−1) (8.13)

The sequential application of FPEδ(k) (k = 1, 2, · · · , K) will yield an estimated orderp, where p = minkFPEδ(k); k = 1, 2, · · · , K. However, the resulting AR model oforder p will not necessarily be the method for which employs p of the first K lagsand minimizes FPEδ(p). That is the estimated residual variance may be smallerfor another p-subset of the K lags in which some intermediate lags are constrained tohave zero coefficients. The first problem is to find the k-th order model with minimum

residual variance, preferably without checking all

(Kk

)subsets. Then we can apply

the FPE criterion to the subsequence of best k-subsets to determine p.

8.2 Pagano’s Algorithm

Hocking and Leslie have described an algorithm which almost always avoids checkingevery k-subset of a K variate regression model before finding that with minimum residualvariance. For the AR model, define θi as the increase in the residual variance, σ2

k, whenlag i is removed from the complete K-th order model. Let θi1 ≤ θi2 ≤ · · · ≤ θiK . Supposewe wish to determine the k-lag model with minimum residual variance among thosewhose maximum lag does not exceed K. Let q = K− k, and consider any subset of lagsj1, j2, · · · , jq, ordered according to the system i1, i2, · · · , iK. Then it can be shownthat if that increase in residual variance with θiq+1 . If the increase does not exceed θiq+1 ,the best k-subset consists of that with lags complementing i1, i2, · · · , iq in 1,2,· · · , K.If the increase exceeds θiq+1 , we find the q-subset of lags i1, i2, · · · , iq+1 which, uponremoval from the model, results in the minimum increase in residual variance. If theminimum increase does not exceed θiq+2 , the set of complementary lags correspondingto the minimum increase is the k-subset having minimum residual variance among all(

Kk

)subsets. If the minimum increase exceeds θiq+2 , we proceed with the

(q + 2

q

)q-subsets of i1, i2, · · · , iq+2.

71

Note that

θi = σ2K

a2K(i)

σ2aK(i)

, (8.14)

where

σ2aK(i) = (ΣK)ii =

σ2K

Nc(0)R−1

K ii, (8.15)

the estimated variance of aK(i). The form of R−1K is such that

R−1K =

Nc(0)

σ2K

i∑j=1

[a2K(j − 1)− a2

k(K + 1− j)] (8.16)

where aK(0) = 1, so that

θi =σ2

Ka2K(i)∑i

j=1[a2K(j − i)− a2

K(K + 1− j)], (8.17)

which is easily computed from Pagano’s algorithm.

8.3 Computation of The Increase in The Residual

Variance

Define the K ×K matrix PK = pjK by

pik =

1 if k = K − j + 1; j = 1, 2, · · · , K0 otherwise

(8.18)

which implies that RK = PKRKPK . Now define a modification of RK ,RK(i1, i2, · · · , iq),so that the (m, n)-th element is

rmn =

0 if n = ij, m 6= ij or

m = ij, n 6= ij j = 1, · · · , q1 if n = m = ij, j = 1, · · · , q

r(n−m) otherwise.

(8.19)

Thus RK(i1, i2, · · · , iq) is formed by placing ones on the diagonals and zeros in the off-diagonal positions of the i1, i2, · · · , iq rows and columns of rK . If the true underlyingmodel is AR of maximum order k = K − q with αi1 , · · · , αiq = 0, the estimator

aK(i1, · · · , iq) = R−1K (i1, · · · , iq)rK(i1, · · · , iq) (8.20)

has the properties as does aK if the model is no more than K-th order with no constrainedcoefficients. The residual variance for the reduced model is estimated by

σ2K(i1, · · · , iq) = c(0)[1− rT

K(i1, · · · , iq).R−1K (i1, · · · , iq)rK(i1, · · · , iq)]. (8.21)

72

Define the Cholesky decomposition

RK = LKDKLTK . (8.22)

Lemma

σ2k(i1, · · · , iq) = c(0)dK+1(j1, · · · .jq), (8.23)

where jm = K + 1− im, m = 1, 2, · · · , q.

Proof

Note that

RK+1(j1, · · · , q) =

[RK(j1, · · · , jq) PKrK(i1, · · · , iq)

rTK(i1, · · · , iq)PK 1

](8.24)

implies

det[RK+1(j1, · · · , jq)] = (8.25)

——–update from notes————

8.4 Conclusion

The main steps of the algorithm are:

1. Choose K, the maximum lag to be considered in the AR model (Akaike usedK ≈ N/10), and perform the Cholesky decomposition of RK+1.

2. Calculate aK and σ2K using Pagano’s algorithm. d order θi, i = 1, · · · , K by

(8.17).

3. For k = 1, 2, · · · , K, find the best k-lag AR model (with maximum lag K) using(??).

4. Calculate FPE2(k + m);k = 1, · · · , K, where m is the number of estimatedtrend parameters, for the sequence of best k-lag models. Choose

p = minkFPE2(k + m) (8.26)

This p-lag model has minimum estimated one-step-ahead prediction error amongall AR models with maximum lag K.

The subset autoregression algorithm is useful for determination of model seasonality,for improving autoregressive spectral estimation, for determination of the order of anautoregressive model, and even for making inferences about suitability of an autoregres-sive model for a particular time series. Its speed makes it possible to find the best subsetautoregressive of each order.

∗ ∗ ∗ ∗ ∗

73

Appendix A

Data Analysis

Data on the consumption of oxygen has been considered for 31 American people agedbetween 35 - 60 years. Factors involved in the experiment were Age, Weight, Oxygen,Runtime, Rest Pulse, Run Pulse, Max Pulse. In my project we have used subset selectionprocedures and ridge estimation and nonnegative garrote estimation.

Data For Fitness of American People in Certain experiment

--------------------------------------------------------------------

Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@.

--------------------------------------------------------------------

44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185

44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172

38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176

40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170

44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186

44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192

45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164

54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185

51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168

48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168

57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165

52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155

51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172

51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155

49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176

52 82.78 47.467 10.50 53 170 172

---------------------------------------------------------------------

A.1 Correlation Coefficients

The CORR Procedure

7 Variables: Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse

74

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

Oxygen 31 47.37581 5.32723 1469 37.38800 60.05500

RunTime 31 10.58613 1.38741 328.17000 8.17000 14.03000

Age 31 47.67742 5.21144 1478 38.00000 57.00000

Weight 31 77.44452 8.32857 2401 59.08000 91.63000

RunPulse 31 169.64516 10.25199 5259 146.00000 186.00000

MaxPulse 31 173.77419 9.16410 5387 155.00000 192.00000

RestPulse 31 53.45161 7.61944 1657 40.00000 70.00000

Pearson Correlation Coefficients, N = 31

Prob > |r| under H0: Rho=0

Rest

Oxygen RunTime Age Weight RunPulse MaxPulse Pulse

Oxygen 1.00000 -0.86219 -0.30459 -0.16275 -0.39797 -0.23674 -0.39936

<.0001 0.0957 0.3817 0.0266 0.1997 0.0260

RunTime -0.86219 1.00000 0.18875 0.14351 0.31365 0.22610 0.45038

<.0001 0.3092 0.4412 0.0858 0.2213 0.0110

Age -0.30459 0.18875 1.00000 -0.23354 -0.33787 -0.43292 -0.16410

0.0957 0.3092 0.2061 0.0630 0.0150 0.3777

Weight -0.16275 0.14351 -0.23354 1.00000 0.18152 0.24938 0.04397

0.3817 0.4412 0.2061 0.3284 0.1761 0.8143

RunPulse -0.39797 0.31365 -0.33787 0.18152 1.00000 0.92975 0.35246

0.0266 0.0858 0.0630 0.3284 <.0001 0.0518

MaxPulse -0.23674 0.22610 -0.43292 0.24938 0.92975 1.00000 0.30512

0.1997 0.2213 0.0150 0.1761 <.0001 0.0951

RestPulse -0.39936 0.45038 -0.16410 0.04397 0.35246 0.30512 1.00000

0.0260 0.0110 0.3777 0.8143 0.0518 0.0951

Consumption of Oxygen is highly negatively correlated to RunTime and RunPulse iscorrelated to MaxPulse.

75

A.2 Forward Selection for Fitness Data

The REG Procedure

Model: MODEL1

Dependent Variable: Oxygen

Forward Selection: Step 1

Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 632.90010 632.90010 84.01 <.0001

Error 29 218.48144 7.53384

Corrected Total 30 851.38154

Parameter Standard

Variable Estimate Error Type II SS F Value Pr > F

Intercept 82.42177 3.85530 3443.36654 457.05 <.0001

RunTime -3.31056 0.36119 632.90010 84.01 <.0001

Bounds on condition number: 1, 1

---------------------------------------------------------------------


Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

The REG Procedure

Model: MODEL1




Sum of Mean


Model 2 650.66573 325.33287 45.38 <.0001

76

Error 28 200.71581 7.16842


Parameter Standard


Intercept 88.46229 5.37264 1943.41071 271.11 <.0001

RunTime -3.20395 0.35877 571.67751 79.75 <.0001

Age -0.15037 0.09551 17.76563 2.48 0.1267

Bounds on condition number: 1.0369, 4.1478

----------------------------------------------------------------------


Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

The REG Procedure

Model: MODEL1




Sum of Mean


Model 3 690.55086 230.18362 38.64 <.0001

Error 27 160.83069 5.95669


Parameter Standard


Intercept 111.71806 10.23509 709.69014 119.14 <.0001

RunTime -2.82538 0.35828 370.43529 62.19 <.0001

Age -0.25640 0.09623 42.28867 7.10 0.0129

RunPulse -0.13091 0.05059 39.88512 6.70 0.0154


----------------------------------------------------------------------


Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

The REG Procedure

Model: MODEL1


77



Sum of Mean


Model 4 712.45153 178.11288 33.33 <.0001

Error 26 138.93002 5.34346


Parameter Standard


Intercept 98.14789 11.78569 370.57373 69.35 <.0001

RunTime -2.76758 0.34054 352.93570 66.05 <.0001

Age -0.19773 0.09564 22.84231 4.27 0.0488

RunPulse -0.34811 0.11750 46.90089 8.78 0.0064

MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533


---------------------------------------------------------------------


Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063

The REG Procedure

Model: MODEL1




Sum of Mean


Model 5 721.97309 144.39462 27.90 <.0001

Error 25 129.40845 5.17634


Parameter Standard


78

Intercept 102.20428 11.97929 376.78935 72.79 <.0001

RunTime -2.68252 0.34099 320.35968 61.89 <.0001

Age -0.21962 0.09550 27.37429 5.29 0.0301

Weight -0.07230 0.05331 9.52157 1.84 0.1871

RunPulse -0.37340 0.11714 52.59624 10.16 0.0038

MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316


---------------------------------------------------------------------

No other variable met the 0.5000 significance level for

entry into the model.

The REG Procedure

Model: MODEL1


Summary of Forward Selection

Variable Number Partial Model

Step Entered Vars In R-Square R-Square C(p) F Value Pr > F

1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001

2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267

3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154

4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533

5 Weight 5 0.0112 0.8480 5.1063 1.84 0.1871

Fitted model is:

Oxygen = 102.20428− 2.68252 ∗Runtime− 0.21962 ∗ Age

−0.37340 ∗RunPulse + 0.30491 ∗MaxPulse.

A.3 Backward Elimination Procedure for Fitness Data

The REG Procedure

Model: MODEL1


Backward Elimination: Step 0

All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000


Sum of Mean


Model 6 722.54361 120.42393 22.43 <.0001

Error 24 128.83794 5.36825

79


Parameter Standard


Intercept 102.93448 12.40326 369.72831 68.87 <.0001

RunTime -2.62865 0.38456 250.82210 46.72 <.0001

Age -0.22697 0.09984 27.74577 5.17 0.0322

Weight -0.07418 0.05459 9.91059 1.85 0.1869

RunPulse -0.36963 0.11985 51.05806 9.51 0.0051

MaxPulse 0.30322 0.13650 26.49142 4.93 0.0360

RestPulse -0.02153 0.06605 0.57051 0.11 0.7473


--------------------------------------------------------------------


Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063

The REG Procedure

Model: MODEL1




Sum of Mean


Model 5 721.97309 144.39462 27.90 <.0001

Error 25 129.40845 5.17634


Parameter Standard


Intercept 102.20428 11.97929 376.78935 72.79 <.0001

RunTime -2.68252 0.34099 320.35968 61.89 <.0001

Age -0.21962 0.09550 27.37429 5.29 0.0301

Weight -0.07230 0.05331 9.52157 1.84 0.1871

RunPulse -0.37340 0.11714 52.59624 10.16 0.0038

MaxPulse 0.30491 0.13394 26.82640 5.18 0.0316


-------------------------------------------------------------------

80


Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800

The REG Procedure

Model: MODEL1




Sum of Mean


Model 4 712.45153 178.11288 33.33 <.0001

Error 26 138.93002 5.34346


Parameter Standard


Intercept 98.14789 11.78569 370.57373 69.35 <.0001

RunTime -2.76758 0.34054 352.93570 66.05 <.0001

Age -0.19773 0.09564 22.84231 4.27 0.0488

RunPulse -0.34811 0.11750 46.90089 8.78 0.0064

MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533


----------------------------------------------------------------------

All variables left in the model are significant at the 0.1000 level.

The REG Procedure

Model: MODEL1


Summary of Backward Elimination

Variable Number Partial Model

Step Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 RestPulse 5 0.0007 0.8480 5.1063 0.11 0.7473

2 Weight 4 0.0112 0.8368 4.8800 1.84 0.1871

Fitted Model is:

Oxygen = 98.14789− 2.76758 ∗RunTime− 0.19773 ∗ Age

−0.34811 ∗RunPulse− 0.27051 ∗MaxPulse.

81

A.4 Stepwise Selection For Fitness Data

The REG Procedure

Model: MODEL1


Stepwise Selection: Step 1

Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988


Sum of Mean


Model 1 632.90010 632.90010 84.01 <.0001

Error 29 218.48144 7.53384


Parameter Standard


Intercept 82.42177 3.85530 3443.36654 457.05 <.0001

RunTime -3.31056 0.36119 632.90010 84.01 <.0001

Bounds on condition number: 1, 1

-------------------------------------------------------------------


Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

The REG Procedure

Model: MODEL1




Sum of Mean


Model 2 650.66573 325.33287 45.38 <.0001

Error 28 200.71581 7.16842


82

Parameter Standard


Intercept 88.46229 5.37264 1943.41071 271.11 <.0001

RunTime -3.20395 0.35877 571.67751 79.75 <.0001

Age -0.15037 0.09551 17.76563 2.48 0.1267


---------------------------------------------------------------------


Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

The REG Procedure

Model: MODEL1




Sum of Mean


Model 3 690.55086 230.18362 38.64 <.0001

Error 27 160.83069 5.95669


Parameter Standard


Intercept 111.71806 10.23509 709.69014 119.14 <.0001

RunTime -2.82538 0.35828 370.43529 62.19 <.0001

Age -0.25640 0.09623 42.28867 7.10 0.0129

RunPulse -0.13091 0.05059 39.88512 6.70 0.0154


-----------------------------------------------------------------------


Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

The REG Procedure

Model: MODEL1


83



Sum of Mean


Model 4 712.45153 178.11288 33.33 <.0001

Error 26 138.93002 5.34346


Parameter Standard


Intercept 98.14789 11.78569 370.57373 69.35 <.0001

RunTime -2.76758 0.34054 352.93570 66.05 <.0001

Age -0.19773 0.09564 22.84231 4.27 0.0488

RunPulse -0.34811 0.11750 46.90089 8.78 0.0064

MaxPulse 0.27051 0.13362 21.90067 4.10 0.0533


--------------------------------------------------------------------

All variables left in the model are significant at the 0.1500 level.

No other variable met the 0.1500 significance level for entry into

the model.

The REG Procedure

Model: MODEL1


Summary of Stepwise Selection

Variable Variable Number Partial Model

Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F

1 RunTime 1 0.7434 0.7434 13.6988 84.01 <.0001

2 Age 2 0.0209 0.7642 12.3894 2.48 0.1267

3 RunPulse 3 0.0468 0.8111 6.9596 6.70 0.0154

4 MaxPulse 4 0.0257 0.8368 4.8800 4.10 0.0533

The fitted model is:

Oxygen = 98.14789− 2.76758 ∗RunTime− 0.19773 ∗ Age


84

A.5 Nonnegative Garrote Estimation of Fitness data

Garrote Estimates

Parameters

c1 0.967596

c2 0.837323

c3 0.509481

c4 1.045688

c5 -1.02999E-16

c6 0.845317

c7 0.794596

Variable Beta GARROTE YHATGARROTE YHATGARROTE

Intercept 99.598973 44.835833 46.892219

Age -0.19005 55.950485 46.124924

Weight -0.037792 51.421083 47.735012

RunTime -2.748749 44.654608 39.314405

RestPulse 2.218E-18 40.243897 49.114595

RunPulse -0.312453 48.617018 44.60772

MaxPulse 0.2409351 45.382236 45.520945

Analysis Of variance table

---------------------------------------------------------

Sources DF Sum of Mean F

Squares Squares

---------------------------------------------------------

Model 5 685.72177 114.28696 19.776931

Error 25 132.91244 5.7788016

-----------------------------------------------------------

Corrected 30 818.6342

Total

------------------------------------------------------------

Fitted model is :

Oxygen = 99.598973− 0.19005 ∗ Age− 0.037792 ∗Weight− 2.748749 ∗RunTime


A.6 Ridge Estimation For Fitness Data

Parameter Estimate YHATRIDGE YHATRIDGE

B1 0.606797 47.309277 48.158289

85

B2 0.207098 51.122425 46.588362

B3 0.017955 46.806003 44.790649

B4 -0.652314 44.363172 46.800427

B5 -0.119944 43.359426 46.286207

B6 -0.478536 44.184568 45.626178

B7 0.747344 45.058532 50.078686

Analysis Of variance table

------------------------------------------------------------

Sources DF Sum of Mean F

Squares Squares

------------------------------------------------------------

Model 6 752.57459 125.4291 3.8333333

Error 24 752.57459 32.720634

------------------------------------------------------------

Corrected 30 1505.1492

Total

------------------------------------------------------------

Fitted Model is:

Oxygen = 0.606797 + .207098 ∗ Age + 0.017955 ∗Weight− 0.652314 ∗RunTime

−0.119944 ∗RestPulse + 0.747344 ∗MaxPulse.

A.7 Conclusion

In subset selection procedures the number of variables and variables are differing fordifferent selection procedures, like in Forward Selection the variables included in themodel are Age, weight, RunPulse, RunTime, MaxPulse (5-variables) while in BacwardElimination procedure Age, RunTime, RunPulse and MaxPulse (4-variables), and inStepwise method Age, RunTime, RunPulse and MaxPulse (4-variables) are included inthe model. Since the model is changin from procedure to procedure and if we change thedata a little bit the model will not be so stable. In Nonnegative Garrote estimation wehave 5-variables in the model which are very influential. In this procedure, the coefficientcorresponding to RestPulse is set to zero as it is of less important in Oxygen assumption.While in Ridge estimation all the veriables included in the model.

∗ ∗ ∗ ∗ ∗

86

References

[1]. Alan J. Miller (1990).Subset Selection in Regression. Mono- graphs on Statis-tics and Applied Probability 40, Chapman and Hall. London, New York, Tokyo, Mel-bourne, Madras.

[2]. Albert Madansky. Prescriptions for working statisticians. Publisher : NewYork : Springer-Verlag, 1988

[3]. Asish Sen, Muni Srivastava. Regression analysis: theory, methods, andapplications.

[4]. Douglas C. Montgomery and Elizabeth A. Peck (1992). Introduction to Lin-ear Regression Analysis. Second Edition, John Wiley & Sons, Inc., New York,Chichester, Brisbane, Toronto, Singapore.

[5]. James McClave. Subset Autoregression, TECHNOMETRICS, VOL. 17, NO. 2,MAY 1975, Page 213

[6]. Leo Breiman (1995). Better Subset Selection Using the NonnegativeGarrote, TECHNOMETRICS, VOL. 37, Page 373-384

[7]. Norman R. Draper and Harry Smith (1981).Applied Regression Analysis. Sec-ond Edition, John Wiley & Sons, Inc.,New York.

[8]. Samprit Chatterjee, Ali S. Hadi and Bertram Price (2000). Regression Analysisby Example. Third Edition, John Wiley & Sons, Inc.,New York, Chichester, Brisbane,Toronto, Singapore.

[9]. Regression analysis edited by Lewis-Beck Michael S. Publisher : New Delhi :Sage Pub., 1993

87

Subset Selection in Regression Analysis

Documents

Transcript of Subset Selection in Regression Analysis