Linear Regression: Estimation and Distribution Theory

61
3 Linear Regression: Estimation and Distribution Theory 3.1 LEAST SQUARES ESTIMATION Let y be a random variable that fluctuates about an unknown parameter 77; that is, Y = 77 + e, where e is the fluctuation or error. For example, e may be a "natural" fluctuation inherent in the experiment which gives rise to 77, or it may represent the error in measuring 77, so that 77 is the true response and Y is the observed response. As noted in Chapter 1, our focus is on linear models, so we assume that 77 can be expressed in the form 77 = Q + iXi + + p -iXp-i, where the explanatory variables xi,X2,- ,x p -\ are known constants (e.g., experimental variables that are controlled by the experimenter and are mea- sured with negligible error), and the j (j = 0,1,...,p - 1) are unknown parameters to be estimated. If the Xj are varied and n values, Yi,Y%,. ..,Y n , of Y are observed, then Yi = o+ \Xn + + p-lX^p-! +£i (i= l,2,...,7l), (3.1) where Xij is the ith value of Xj. Writing these n equations in matrix form, we have / Yi \ Y 2 \Yn J £20 111 X21 X\2 Z22 Zl.p-1 ^ X2,p-1 l o l \ or \ ]XnO X n i X„2 • • In,p-1 / Y = X/3 + e, V P -i J ( £l \ £2 (3.2) 35 Linear Regression Analysis, Second Edition by George A. F. Seber and Alan J. Lee Copyright © 2003 John Wiley & Sons, Inc.

Transcript of Linear Regression: Estimation and Distribution Theory

Page 1: Linear Regression: Estimation and Distribution Theory

3 Linear Regression: Estimation and

Distribution Theory

3.1 LEAST SQUARES ESTIMATION

Let y be a random variable that fluctuates about an unknown parameter 77; that is, Y = 77 + e, where e is the fluctuation or error. For example, e may be a "natural" fluctuation inherent in the experiment which gives rise to 77, or it may represent the error in measuring 77, so that 77 is the true response and Y is the observed response. As noted in Chapter 1, our focus is on linear models, so we assume that 77 can be expressed in the form

77 = ßQ + ßiXi +■■■+ ßp-iXp-i,

where the explanatory variables xi,X2,- ■ ■ ,xp-\ are known constants (e.g., experimental variables that are controlled by the experimenter and are mea-sured with negligible error), and the ßj (j = 0 ,1 , . . . , p - 1) are unknown parameters to be estimated. If the Xj are varied and n values, Yi,Y%,. ..,Yn, of Y are observed, then

Yi = ßo+ ß\Xn + ■ ■ • + ßp-lX^p-! +£ i (i= l ,2 , . . . ,7 l ) , (3.1)

where Xij is the ith value of Xj. Writing these n equations in matrix form, we have

/ Yi \ Y2

\Yn J

£20 111

X21

X\2

Z22 Z l . p - 1 ^ X2,p-1

l ßo ßl

\

or

\ ]XnO Xni X„2 • • ■ I n , p - 1 /

Y = X/3 + e,

V ßP-i J

(£l \ £2

(3.2)

35

Linear Regression Analysis, Second Edition by George A. F. Seber and Alan J. Lee

Copyright © 2003 John Wiley & Sons, Inc.

Page 2: Linear Regression: Estimation and Distribution Theory

36 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

where X\o = £20 = • • • = xno = 1. The n x p matrix X will be called the regression matrix, and the ZJJ'S are generally chosen so that the columns of X are linearly independent; that is, X has rank p, and we say that X has full rank. However, in some experimental design situations, the elements of X are chosen to be 0 or 1, and the columns of X may be linearly dependent. In this case X is commonly called the design matrix, and we say that X has less than full rank.

It has been the custom in the past to call the Xj 's the independent variables and Y the dependent variable. However, this terminology is confusing, so we follow the more contemporary usage as in Chapter 1 and refer to Xj as a explanatory variable or regressor and Y as the response variable.

As we mentioned in Chapter 1, (3.1) is a very general model. For example, setting Xij — x\ and k = p - 1, we have the polynomial model

Yi = ßo + ßiXi + ß2x\ + • • • + ßkxk{ + £i.

Again, Yi = ß0 + ßieWil + ß-iWixwn + ß3 sinwi3 + e{

is also a special case. The essential aspect of (3.1) is that it is linear in the unknown parameters ßy, for this reason it is called a linear model. In contrast,

Yi = ßo + ß1e-0^+ei

is a nonlinear model, being nonlinear in ßi. Before considering the problem of estimating ß, we note that all the theory

in this and subsequent chapters is developed for the model (3.2), where XJO is not necessarily constrained to be unity. In the case where i<o ^ 1. the reader may question the use of a notation in which i runs from 0 to p — 1 rather than 1 to p. However, since the major application of the theory is to the case ijo = 1, it is convenient to "separate" ßo from the other /?;'s right from the outset. We shall assume the latter case until stated otherwise.

One method of obtaining an estimate of ß is the method of least squares. This method consists of minimizing J2ie? with respect to ß; that is, setting 0 = X/3, we minimize e'e = ||Y - 0||2 subject to 0 € C(X) = fi, where Q is the column space of X (= {y : y = Xx for any x}). If we let 0 vary in ÎÎ, ||Y - 8\\2 (the square of the length of Y - 0) will be a minimum for 6 = 6 when (Y — 0) ± H (cf. Figure 3.1). This is obvious geometrically, and it is readily proved algebraically as follows.

We first note that 0 can be obtained via a symmetric idempotent (projec-tion) matrix P, namely 0 = PY, where P represents the orthogonal projection onto fl (see Appendix B). Then

Y - 0 = (Y - 0) + (0 - 0),

Page 3: Linear Regression: Estimation and Distribution Theory

LEAST SQUARES ESTIMATION 37

/ fi

Fig. 3.1 The method of least squares consists of finding A such that AB is a minimum.

where from P0 = 0, P ' = P and P 2 = P, we have

( Y - 0 ) ' ( 0 - 0 ) = (Y - PY)'P(Y - 0) = Y'(I„ - P)P(Y - 0) = 0.

Hence

| | Y - 0 | | 2 = | | Y - 0 | | 2 + | | 0 - 0 | | 2

> IIY-0II2 ,

with equality if and only if 0 = 0. Since Y - 0 is perpendicular to il,

X'(Y - 0) = 0

or X'0 = X'Y. (3-3)

Here 0 is uniquely determined, being the unique orthogonal projection of Y onto fî (see Appendix B).

We now assume that the columns of X are linearly independent so that there exists a unique vector ß such that 0 = Xß. Then substituting in (3.3), we have

X'Xp- = X'Y, (3.4) the normal equations. As X has rank p, X'X is positive-definite (A.4.6) and therefore nonsingular. Hence (3.4) has a unique solution, namely,

ß = ( X ' X ^ X ' Y . (3-5)

Here ß is called the (ordinary) least squares estimate of ß, and computational methods for actually calculating the estimate are given in Chapter 11.

Page 4: Linear Regression: Estimation and Distribution Theory

38 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

We note that ß can also be obtained by writing

e'e = (Y - Xß)'(Y - Xß)

= Y ' Y - 2/3'X'Y + ß'X'Xß

[using the fact that ß'X'Y = (ß'X'Y)1 = Y'Xß] and differentiating e'e with respect to ß. Thus from de'e/dß = 0 we have (A.8)

- 2 X ' Y + 2X'Xß = 0 (3.6)

or X'Xß = X'Y.

This solution for ß gives us a stationary value of e'e, and a simple algebraic identity (see Exercises 3a, No. 1) confirms that ß is a minimum.

In addition to the method of least squares, several other methods are used for estimating ß. These are described in Section 3.13.

Suppose now that the columns of X are not linearly independent. For a particular 8 there is no longer a unique ß such that 8 = Xß, and (3.4) does not have a unique solution. However, a solution is given by

ß = (X 'X) -X 'Y ,

where (X'X)~ is any generalized inverse of (X'X) (see A.10). Then

8 = Xß = X ( X ' X ) _ X ' Y = P Y ,

and since P is unique, it follows that P does not depend on which generalized inverse is used.

We denote the fitted values Xß by Y = (Yi,..., Yn)'. The elements of the vector

Y - Y = Y-Xß

= ( I „ - P ) Y , say, (3.7)

are called the residuals and are denoted by e. The minimum value of e'e, namely

e'e = (Y - X/3)'(Y - Xß)

= Y 'Y - 2/3'X'Y + ß'X'Xß

= Y ' Y - ß'X'Y + ß'[X'Xß - X'Y]

= Y ' Y - j S ' X ' Y [by (3.4)], (3.8) = Y'Y-ß'X'Xß, (3.9)

is called the residual sum of squares (RSS). As 8 = Xß is unique, we note that Y, e, and RSS are unique, irrespective of the rank of X.

Page 5: Linear Regression: Estimation and Distribution Theory

LEAST SQUARES ESTIMATION 39

EXAMPLE 3.1 Let Y\ and Y2 be independent random variables with means a and 2a, respectively. We will now find the least squares estimate of a and the residual sum of squares using both (3.5) and direct differentiation as in (3.6). Writing

(5)-(i)"(:)-we have Y = Xß + e, where X = I „ I and ß = a. Hence, by the theory above,

and

â = (X'X) -1X'Y

= {(1,2) ( * ) } ' ( 1 , 2 ^

= k(Yx+2Y2)

= Y'Y - ß'X'Y

= Y'Y - â(Yi + 2Y2) = Y1

2 + Y22-±(Y1+2Y2)2.

We note that

p-U)M;)r«M>-K;o-The problem can also be solved by first principles as follows: e'e = (Yi —

a)2 + (Y2 - 2a)2 and de'e/da = 0 implies that â = i(Yi + 2Y2). Further,

e'e = (Yi - d)2 + (F2 - 2â)2

= Yj2 + Y22 - â(2Yi + 4Y2) + 5a2

= Y12 + Y2

2-i(Y1+2Y2)2 .

In practice, both approaches are used. D

EXAMPLE 3.2 Suppose that Yi, Y2)...,Y„ all have mean ß. Then the least squares estimate of ß is found by minimizing ^ (Y< - ß)2 with respect to ß. This leads readily to ß = Y. Alternatively, we can express the observations in terms of the regression model

Y = ln/? + e,

Page 6: Linear Regression: Estimation and Distribution Theory

40 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

where l n is an n-dimensional column of 1'

4 = ( i ; i n ) - 1 i ; Y =

Also,

/ 1 p = i „ ( i ; i n ) - 1 i ; = i 1

s. Then

^ » Y =

1 ••• 1 ••■

Y.

1 \ 1

\ 1 1

= - J „ D n

1 )

We have emphasized that P is the linear transformation representing the orthogonal projection of n-dimensional Euclidean space, 8?n, onto fi, the space spanned by the columns of X. Similarly, I„ — P represents the orthogonal projection of 9fcn onto the orthogonal complement, fix, of ft. Thus Y = PY 4- (In — P)Y represents a unique orthogonal decomposition of Y into two components, one in fi and the other in n x . Some basic properties of P and (I„—P) are proved in Theorem 3.1 and its corollary, although these properties follow directly from the more general results concerning orthogonal projections stated in Appendix B. For a more abstract setting, see Seber [1980].

THEOREM 3.1 Suppose thatXisnxp ofrankp, so thatP = X(X'X)_1X'. Then the following hold.

(i) P and I„ - P are symmetric and idempotent.

(ii) rank(In - P) = tr(In - P) = n - p.

(Hi) PX = X.

Proof, (i) P is obviously symmetric and (I„ — P) ' = I„ — P ' = I„ — P. Also,

P2 = X(X'X)-1X'X(X'X)-1X' = XIP(X'X)- 1X'= P,

and (In - P)2 = I„ - 2P + P 2 = In - P. (ii) Since In - P is symmetric and idempotent, we have, by A.6.2,

r ank( I n -P ) = tr(In - P) = n - tr(P),

where

tr(P) = trpCfX'X) - 1^] tr[X'X(X'X)-x] tr(Ip) P-

(by A.1.2)

Page 7: Linear Regression: Estimation and Distribution Theory

LEAST SQUARES ESTIMATION 41

(iii) P X = X ( X ' X ) - 1 X ' X = X. D

COROLLARY If X has rank r (r <p), then Theorem 3.1 still holds, but with p replaced by r. Proof. Let Xi be an n x r matrix with r linearly independent columns and hav-ing the same column space as X [i.e., C(Xi) = fi]. Then P = Xi(X' 1 Xi) - 1 X' 1 , and (i) and (ii) follow immediately. We can find a matrix L such that X = XjL, which implies that (cf. Exercises 3j, No. 2)

P X = X i f X i X i ^ X i X i L = XXL = X,

which is (iii). D

EXERCISES 3a

1. Show that if X has full rank,

(Y - Xß)'(Y - Xß) = (Y - Xj8)'(Y - Xß) + (ß - ß)'X'X(ß - ß),

and hence deduce that the left side is minimized uniquely when ß — ß.

2. If X has full rank, prove that £" = 1 (Fi - %) = 0. Hint: Consider the first column of X.

3. Let

Yi = 0 + ei Y2 = 29-<j> + e2

Y3 = e + 2(j) + e3,

where E[si] = 0 (i — 1,2,3). Find the least squares estimates of 6 and

4. Consider the regression model

E[Yi] = ß0 + ßm + ßi{1x] - 2) (t = 1,2,3),

where x\ = — 1, x2 = 0, and x3 = + 1 . Find the least squares estimates of ß0, ßi, and ß2. Show that the least squares estimates of /?o and ßi are unchanged if ß2 = 0.

5. The tension T observed in a nonextensible string required to maintain a body of unknown weight w in equilibrium on a smooth inclined plane of angle 9 (0 < 6 < n/2) is a random variable with mean E[T] = wsin9. If for 9 = 9i (i = 1,2, . . . , n ) the corresponding values of T are T» (i = 1,2,. . . , n), find the least squares estimate of w.

6. If X has full rank, so that P = X(X 'X) - 1 X' , prove that C(P) = C(X).

Page 8: Linear Regression: Estimation and Distribution Theory

42 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

7. For a general regression model in which X may or may not have full rank,show that

n

t = i

8. Suppose that we scale the explanatory variables so that Xij = kjWij for all i,j. By expressing X in terms of a new matrix W, prove that Y remains unchanged under this change of scale.

3.2 PROPERTIES OF LEAST SQUARES ESTIMATES

If we assume that the errors are unbiased (i.e., E[e] = 0), and the columns of X are linearly independent, then

E[ß] = (X'X) -1X'£[Y] = (X'X)_1X'X/9 = ß, (3.10)

and ß is an unbiased estimate of ß. If we assume further that the a are uncorrelated and have the same variance, that is, cov^e , ] = ôij<T2, then Var[e] = tr2I„ and

Var[Y] = Var[Y - Xß] = Var[e].

Hence, by (1.7),

Var[/3] = Var[(X'X)-1X'Y] = ( X ' X ) - 1 ^ Var[Y]X(X'X)-1

= a2(X'X)"1(X'X)(X'X)-1

= ^ ( X ' X ) - 1 . (3.11)

The question now arises as to why we chose ß as our estimate of ß and not some other estimate. We show below that for a reasonable class of estimates, ßj is the estimate of ßj with the smallest variance. Here ßj can be extracted from ß = (ßo,ßi,- ■ ■ ,ßP-i)' simply by premultiplying by the row vector c', which contains unity in the (j+l)th position and zeros elsewhere. It transpires that this special property of ßj can be generalized to the case of any linear combination &'ß using the following theorem.

THEOREM 3.2 Let 0 be the least squares estimate of 0 = Xß, where 6 G fi = C(X) and X may not have full rank. Then among the class of linear unbiased estimates ofc'9, c'0 is the unique estimate with minimum variance. [We say that c'0 is the best linear unbiased estimate (BLUE) of c'0.]

Page 9: Linear Regression: Estimation and Distribution Theory

PROPERTIES OF LEAST SQUARES ESTIMATES 43

Proof. Prom Section 3.1, 0 = P Y , where P 0 = PXß = Xß = 0 (Theorem 3.1, Corollary). Hence E[c'0] = c 'P0 = c'0 for all 0 € fi, so that c'0 [= (Pc)'Y] is a linear unbiased estimate of c'0. Let d 'Y be any other linear unbiased estimate of c'0. Then c'0 = E[d'Y] = d '0 or (c - d) '0 = 0, so that (c - d) ± ft. Therefore, P(c - d) = 0 and Pc = Pd .

Now

var[c'0] = var[(Pc)'Y] = var[(Pd)'Y] = a 2 d ' P ' P d = <r2d'P2d = ff2d'Pd (Theorem 3.1)

so that

var[d'Y] - var[c'0] = var[d'Y] - var[(Pd)'Y] = < r 2 ( d ' d - d ' P d ) = < r 2 d ' ( I „ - P ) d = a 2 d ' ( I n - P ) ' ( I n - P ) d = <r2didi, say, > 0,

with equality only if (In - P )d = 0 or d = P d = Pc . Hence c'0 has minimum variance and is unique. D

COROLLARY If X has full rank, then a'/3 is the BLUE of a'/3 for every vector a. Proof. Now 0 = Xß implies that ß = (X'X)-lX'0 and ß = (X'X^X'O. Hence setting c' = a ' ( X ' X ) - 1 X ' we have that a!ß (= c'0) is the BLUE of a'y9 (= c'0) for every vector a. D

Thus far we have not made any assumptions about the distribution of the £{. However, when the e< are independently and identically distributed as N(0,cr2), that is, e ~ JV(0,a2I„) or, equivalently, Y ~ JV„(X/3,<72I„), then a!ß has minimum variance for the entire class of unbiased estimates, not just for linear estimates (cf. Rao [1973: p. 319] for a proof). In particular, ft, which is also the maximum likelihood estimate of ft (Section 3.5), is the most efficient estimate of ft.

When the common underlying distribution of the e< is not normal, then the least squares estimate of ft is not the same as the asymptotically most efficient maximum likelihood estimate. The asymptotic efficiency of the least squares estimate is, for this case, derived by Cox and Hinkley [1968].

Eicker [1963] has discussed the question of the consistency and asymptotic normality of ß as n —> oo. Under weak restrictions he shows that ß is a

Page 10: Linear Regression: Estimation and Distribution Theory

44 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

consistent estimate of ß if and only if the smallest eigenvalue of X 'X tends to infinity. This condition on the smallest eigenvalue is a mild one, so that the result has wide applicability. Eicker also proves a theorem giving necessary and sufficient conditions for the asymptotic normality of each ßj (see Anderson [1971: pp. 23-27]).

EXERCISES 3b

1. Let Yi = ß0 + ßiXi + ei (i = 1,2,.. . , n), where E[e] = 0 and Var[e] = CT2In. Find the least squares estimates of ßQ and ß\. Prove that they are uncorrelated if and only if x — 0.

2. In order to estimate two parameters 9 and </> it is possible to make observations of three types: (a) the first type have expectation 9, (b) the second type have expectation 9 + 4>, and (c) the third type have expectation 9 — 2<j>. All observations are subject to uncorrelated errors of mean zero and constant variance. If m observations of type (a), m observations of (b), and n observations of type (c) are made, find the least squares estimates 9 and 4>. Prove that these estimates are uncorrelated if m = 2n.

3. Let Y\,Y2,... ,Yn be a random sample from N(9,a2). Find the linear unbiased estimate of 9 with minimum variance.

4. Let

Yi = ßo+ßi(xn - x i ) + ß2(xi2 -x2)+£i (i = l , 2 , . . . , n ) ,

where Xj = ]C"=i xijln, E[e] = 0, and Var[e] = a 2 I n . If ß\ is the least squares estimate of ß\, show that

2

var[Â1 = E«(*«-ïOa(i-H)'

where r is the correlation coefficient of the n pairs ( X Ü , ^ ) -

3.3 UNBIASED ESTIMATION OF a2

We now focus our attention on a2 (= var[£j]). An unbiased estimate is de-scribed in the following theorem.

THEOREM 3.3 / / E[Y) = X/3, where X is an n x p matrix of rank r (r < P)i and Var[Y] = «r2In, then

S2 = (Y ~ 0)'(Y - g) = RSS_ n — r n — r

Page 11: Linear Regression: Estimation and Distribution Theory

UNBIASED ESTIMATION OF a2 45

is an unbiased estimate of a2.

Proof. Consider the full-rank representation 8 = X i a , where Xi is n x r of rank r. Then

Y - 0 = (In - P )Y ,

where P = X i ^ X i ^ X i . Prom Theorem 3.1 we have

(n-r)S2 = Y ' ( I „ - P ) ' ( I „ - P ) Y = Y ' ( I n - P ) 2 Y = Y ' ( I n - P ) Y . (3.12)

Since PÖ = 0, it follows from Theorems 1.5 and 3.1(iii) applied to Xi that

E[Y'(ln - P)Y] = a 2 t r ( I n - P ) + Ô ' ( I n - P ) o = (T 2 (n - r ) ,

and hence E[S2} = <r2. D When X has full rank, S2 = (Y - Xß)'(Y - Xß)/(n - p). In this case it

transpires that S2, like ß, has certain minimum properties which are partly summarized in the following theorem.

T H E O R E M 3.4 (Atiqullah [1962]) LetYuY2,. ..,Yn ben independent ran-dom variables with common variance a2 and common third and fourth mo-ments, fj,3 and Hi, respectively, about their means. If E\Y] = Xß, where X is n x p of rank p, then (n — p)S2 is the unique nonnegative quadratic unbiased estimate of (n — p)a2 with minimum variance when /X4 = 3<r4 or when the diagonal elements of P are all equal.

Proof. Since a2 > 0 it is not unreasonable to follow Rao [1952] and consider estimates that are nonnegative. Let Y'AY be a member of the class C of nonnegative quadratic unbiased estimates of (n - p)o~2. Then, by Theorem 1.5,

(n - p)a2 = fi[Y'AY] = a2 tr(A) + ß'X'AXß

for all ß, so that tr(A) = n-p (setting ß = 0) and ß'X'AXß = 0 for all ß. Thus X 'AX = 0 (A.11.2) and, since A is positive semidefinite, AX = 0 (A.3.5) and X'A = 0. Hence if a is a vector of diagonal elements of A, and 72 = (/*4 - 3<r4)/a4, it follows from Theorem 1.6 that

var[Y'AY] = a ^ a ' a - t - 2<r4tr(A2) + 4a2ß'X'A2Xß + 4^'X'ABL

= a 47 2a 'a + 2<74tr(A2). (3.13)

Now by Theorem 3.3, (n-p)S2 [= Y'(I„ - P ) Y = Y 'RY, say] is a member of the class C. Also, by Theorem 3.1,

tr(R2) = t r ( R ) = n - p,

Page 12: Linear Regression: Estimation and Distribution Theory

46 ■ LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

so that if we substitute in (3.13), var[Y'RY] = a472r'r + 2<r4(n - p). (3.14)

To find sufficient conditions for Y'RY to have minimum variance for class C, let A = R + D. Then D is symmetric, and tr(A) = tr(R) + tr(D); thus tr(D) = 0. Since AX = 0, we have AP = AX(X'X)_1X' = 0, and combining this equation with P 2 = P, that is, RP = 0, leads to

0 = AP = RP + DP = DP

and

Hence DR = D (= D' = RD).

A2 = R2 + DR + RD + D2

= R + 2D + D2

and tr(A2) = tr(R) + 2tr(D) + tr(D2)

= ( n - p ) + t r ( D 2 ) . Substituting in (3.13), setting a = r + d, and using (3.14), we have

var[Y'AY] = a472a'a + 2<r4[(n - p ) + tr(D2)] = <r472(r'r + 2r'd + d'd) + 2<r4[(n - p) + tr(D2)] = o S r ' r + 2<T4(n - p) + 2a4[72(r'd + §d'd) + tr(D2)] = var[Y'RY]

+ 2<74 72 E ^ + i E * + E E 4

To find the estimate with minimum variance, we must minimize var[Y'AY] subject to tr(D) = 0 and DR = D. The minimization in general is difficult (cf. Hsu [1938]) but can be done readily in two important special cases. First, if 72 = 0, then

var[Y'AY] = var[Y'RY] + 2<r4 £ £ d%, » 3

which is minimized when dkj = 0 for all i,j, that is, when D = 0 and A = R. Second, if the diagonal elements of P are all equal, then they are equal to p/n [since, by Theorem 3.1(a), tr(P) = p]. Hence ru = (n —p)/n for each i and

var[Y'AY] = var[Y'RY] + 2<r4

= var[Y'RY] + 2<r4

72(o+iEd?i)+EE4

(b2 + i)E4 + E E 4

Page 13: Linear Regression: Estimation and Distribution Theory

DISTRIBUTION THEORY 47

as Y,iriidii = [(n - p)/n]tr(D) = 0. Now 72 > - 2 (A.13.1), so that var[Y'AY] is minimized when dy = 0 for all i,j. Thus in both cases we have minimum variance if and only if A = R. D

This theorem highlights the fact that a uniformly minimum variance quad-ratic unbiased estimate of a2 exists only under certain restrictive conditions like those stated in the enunciation of the theorem. If normality can be assumed (72 = 0), then it transpires that (Rao [1973: p. 319]) S2 is the minimum variance unbiased estimate of a2 in the entire class of unbiased estimates (not just the class of quadratic estimates).

Rao [1970, 1972] has also introduced another criterion for choosing the estimate of a2: minimum norm quadratic unbiased estimation (MINQUE). Irrespective of whether or not we assume normality, this criterion also leads to S2 (cf. Rao [1970, 1974: p. 448]).

EXERCISES 3c

1. Suppose that Y ~ Nn(Xß, <r2In), where X is n x p of rank p.

(a) Find var[S2].

(b) Evaluate E[(Y'Ai Y - a2)2} for

n — p + I

(c) Prove that Y'AiY is an estimate of a2 with a smaller mean-squared error than S2.

(Theil and Schweitzer [1961])

2. Let Yi,Y2,.. .,Yn be independently and identically distributed with mean 9 and variance a2. Find the nonnegative quadratic unbiased esti-mate of a2 with the minimum variance.

3.4 DISTRIBUTION THEORY

Until now the only assumptions we have made about the e< are that E[e] = 0 and Var[e] = a2ln. If we assume that the et are also normally distributed, then e ~ Nn(0, er2In) and hence Y ~ Nn(Xß,a2In). A number of distribu-tional results then follow.

THEOREM 3.5 / / Y ~ Nn(Xß, a2ln), where X is n x p of rank p, then:

(i) ß~Np(ß,a2(X'X)-1).

(ii) (ß - ß)'X'X(ß - ß)/a2 ~ xl-

Page 14: Linear Regression: Estimation and Distribution Theory

48 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

(iii) ß is independent of S2.

(iv)RSS/o-* = (n-P)S*/a*~x2n-P-

Proof, (i) Since ß = (X 'X) _ 1 X'Y = CY, say, where C is a p x n matrix such that rankC = rankX' = rankX = p (by A.2.4), ß has a multivariate normal distribution (Theorem 2.2 in Section 2.2). In particular, from equations (3.10) and (3.11), we have ß ~ Np{ß,a2{X'X)-1).

(ii) 0 - ß)'X'X0 - ß)/a2 = 0 - ß)'{VB,x\ß})-l{ß - ß), which, by (i) and Theorem 2.9, is distributed as \P-

(iii)

Cov[0 ,Y-X/3] = C o v K X ' X ^ X ' Y . f l n - P J Y ]

= ( X ' X ^ X ' Cov[Y] (I» - P ) ' = a2(X'X)-lX'(ln-P)

= 0 [by Theorem 3.1(iii)].

If U = ß and V = Y - X,3 in Theorem 2.5 (Section 2.3), ß is independent of ||(Y - Xß)\\2 and therefore of S2 .

(iv) This result can be proved in various ways, depending on which theo-rems relating to quadratic forms we are prepared to invoke. It is instructive to examine two methods of proof, although the first method is the more standard one. Method 1: Using Theorem 3.3, we have

RSS = Y'(I„ - P ) Y = (Y - Xj8)'(I„ - P)(Y - Xß) [by Theorem 3.1 (iii)] = e ' ( I „ - P ) e , (3.15)

where I „ - P is symmetric and idempotent of rank n-p. Since e ~ N„(0, cr2ln), RSS/<r2 ~ xl-p (Theorem 2.7 in Section 2.4). Method 2:

Ql = (Y-Xß)'(Y-Xß)

= (Y-Xß + X(ß-ß))'(Y-Xß + X(ß-ß))

= (Y-Xß)'(Y-Xß)

+ 20 - ß)'X'(Y - Xß) + 0- ß)'X'X0 - ß) = (Y-Xßy(Y-Xß) + 0-ß)'X'X0-ß) = Q + Q2, say, (3.16)

since, from the normal equations,

0 - ß)'X'(Y - Xß) = 0 - ß)'(X'Y - X'Xß) = 0. (3.17)

Page 15: Linear Regression: Estimation and Distribution Theory

MAXIMUM LIKELIHOOD ESTIMATION 49

Now Qy/a2 (= Ei^i/f) is xl and Q2/a2 ~ x2, [by (ü)]. Also, Q2 is a

continuous function of ß, so that by Example 1.11 and (iii), Q is independent of Q2. Hence Q/o2 ~ x2_p (Example 1.10, Section 1.6). D

EXERCISES 3d

1. Given Yi, Y2,..., Y„ independently distributed as N(6, a2), use Theorem 3.5 to prove that:

(a) Y is statistically independent of Q = ^ ( Y i ~~ ^ ) 2 -(b) g / a 2 - ^ .

2. Use Theorem 2.5 to prove that for the full-rank regression model, RSS is independent of {ß - ß)'X'X(ß - ß).

3.5 MAXIMUM LIKELIHOOD ESTIMATION

Assuming normality, as in Section 3.4, the likelihood function, L(ß, er2) say, for the full-rank regression model is the probability density function of Y, namely,

L(ß,a2) = (27r<72)-"/2 exp { - ^ H y - Xp)| |2} .

Let l(ß,v) = logL(ß,a2), where v = a2. Then, ignoring constants, we have

l(p>) = ~ l o g t , - ^ | | y - X p - | | 2 ,

and from (3.6) it follows that

i=-^-2X>+2X 'x« and

91 n 1 M tran2

Setting dl/dß = 0, we get the least squares estimate of ß, which clearly maximizes l(ß,v) for any v > 0. Hence

L(ß, v) < L(ß, v) for all v > 0

with equality if and only if ß = /9. We now wish to maximize L(ß,v), or equivalently l(ß,v), with respect to

v. Setting dl/dv = 0, we get a stationary value of v = ||y - X/3)||2/n. Then

l(ß,v)-l(ß,v) = - 2

> 0,

log ( * ) + ! - *

Page 16: Linear Regression: Estimation and Distribution Theory

50 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

since x < ex 1 and therefore logs; < x — 1 for x > 0 (with equality when x = 1). Hence

L(ß,v) <L(ß,v) f o r a l U X ) with equality if and only if ß = ß and v — v. Thus ß and v are the maximum likelihood estimates of ß and v. Also, for future use,

L(/9,<72) = (27râ 2)- n / 2e-"/ 2 . (3.18)

In determining the efficiency of the estimates above, we derive the (ex-pected) information matrix

I = -E[d2l/d0dß'}

= Var[dl/d6}, (3.19)

where 6 = (ß',v)'. As a first step we find that

a/33/3' ~ v*(**>'

_g_=;l(_2XV + X'X/3)

and

921 n 1 11 v a n 2

We note that \\Y-Xß\\2/v = e'e/v ~ x 2 , so that E[e'e] = nv (as E[x2n} = n).

Replacing y by Y and taking expected values in the equations above gives us

/ I ( X ' X ) 0 V o' A \ 2v2

This gives us the multivariate Cramer-Rao lower bound for unbiased estimates of 0, namely,

/ u (X 'X)- 1 0 r ' - ( o' a

\ n Since Var[^] = D ( X ' X ) - 1 , ß is the best unbiased estimate of ß in the sense that for any a, a'ß is the minimum variance unbiased estimate (MINVUE) of a'/3.

Since (n-p)S2/v ~ Xn-p [by Theorem 3.5(iv)] and var[x2_p] = 2(n-p), it follows that var[52] = 2v /(n - p), which tends to 2v2/n as n -> 00. This tells us that 5 2 is, asympotically, the MINVUE of v. However, the Cramer-Rao lower bound gives us just a lower bound on the minimum variance rather than the actual minimum. It transpires that 5 2 is exactly MINVUE, and a different approach is needed to prove this (e.g., Rao [1973: p. 319]).

Page 17: Linear Regression: Estimation and Distribution Theory

ORTHOGONAL COLUMNS IN THE REGRESSION MATRIX 51

3.6 ORTHOGONAL COLUMNS IN THE REGRESSION MATRIX

/ x<°>'x<°> 0

^ 0

0 X(1)'X(1) .

0

0 0 0

• x(p-1>'x(p_1)

Suppose that in the full-rank model E[Y) = Xß the matrix X has a column representation

X = (x(°))x(1),...)x(p-1)))

where the columns are all mutually orthogonal. Then

ß = ( X ' X ^ X ' Y v - l / x(o)/Y \ X ' x^ 'Y

V xfr-^'Y J ( (X (O)/X (O))-IX (O) (Y

(x( i) 'x( i))- ix(D'Y

\ (x<p-1)'x (p_1))_1x (p_1) 'Y )

Thus ßj = x ^ ' Y / x ^ ' x ^ ) turns out to be the least squares estimate of ßj for the model E[Y] = x^ßj, which means that the least squares estimate of ßj is unchanged if any of the other ßi (I ^ j) are put equal to zero. Also, from equations (3.8) and (3.9), the residual sum of squares takes the form

RSS = Y 'Y- /3 'X 'Y p - i

= Y ' Y - J ^ x ^ ' Y i=o p - i

= Y 'Y-53^(x ( J > 'xW) . (3.20) i=o

If we put ßj — 0 in the model, the only change in the residual sum of squares is the addition of the term ßjX^'Y, so that we now have

P-I

Y ' Y - Y, &-x(r),Y. r=0,r^j

(3.21)

Two applications of this model are discussed in Sections 7.1.2 and 7.3.1.

EXAMPLE 3.3 Consider the full-rank model

Y{ =ßo+ßiXn + • • • + ßp-ixp-i +£i (i = l ,2 , . . . ,n) ,

Page 18: Linear Regression: Estimation and Distribution Theory

52 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

where the £j are i.i.d. N(0,a2) and the xy are standardized so that for j = 1,2,. ..,p — l, 2 i Xij = 0 and £)i xlj = c- We now show that

P-I

P

t P - i - £ v a r [ & ] (3.22)

is minimized when the columns of X are mutually orthogonal. From

say, we have *'Hs°c).

p - i

£ v a r [ & ] = tr(Var[/8]) 3=0

a2 t r (C- x ) + i n

= Ev1' (3-23) where Ao = n and Aj (j = 1,2,... ,p - 1) are the eigenvalues of C (A.1.6). Now the minimum of (3.23) subject to the condition tr(X'X) = n + c(p— 1), or tr(C) = c(p— 1), is given by Xj = constant, that is, Xj = c (j — 1,2,... , p - 1 ) . Hence there exists an orthogonal matrix T such that T ' C T = cl p_i , or C = clp_i , so that the columns of X must be mutually orthogonal. D

This example shows that using a particular optimality criterion, the "opti-mum" choice of X is the design matrix with mutually orthogonal columns. A related property, proved by Hotelling (see Exercises 3e, No. 3), is the following: Given any design matrix X such that x^'x^ — c?, then

2 v ar [^] > -j,

ci

and the minimum is attained when x ^ ' x ^ = 0 (all r, r ^ j) [i.e., when x^ is perpendicular to the other columns].

EXERCISES 3e

1. Prove the statement above that the minimum is given by Aj = c (j = l , 2 , . . . , p - l ) .

2. It is required to fit a regression model of the form

E\Yi] = ßo + ßixt + M(xi) (» = 1,2,3),

Page 19: Linear Regression: Estimation and Distribution Theory

ORTHOGONAL COLUMNS IN THE REGRESSION MATRIX 53

where <f>(x) is a second-degree polynomial. If xi = — 1, x2 = 0, and X3 = 1, find (j> such that the design matrix X has mutually orthogonal columns.

3. Suppose that X = (x< , ) 1 x l 1 l | . . . , r ( ' - 1 ) | xW) = (W,xW) has linearly independent columns.

(a) Using A.9.5, prove that

det(X'X) = det(W'W) (x ( p ) 'x ( p ) - x ( p ) ' W ( W ' W ) - 1 W ' x ( p ) ) .

(b) Deduce that det(W'W) 1 det(X'X) - x W ' x W

and hence show that var[/3p] > <72(x(p)'x(p))_1 with equality if and only if x<*>'xW = o (j = 0 , 1 , . . . ,p - 1).

(Rao [1973: p. 236])

4. What modifications in the statement of Example 3.3 proved above can be made if the term ß0 is omitted?

5. Suppose that we wish to find the weights ßi (i = 1,2,. . . , k) of k objects. One method is to weigh each object r times and take the average; this requires a total of kr weighings, and the variance of each average is cr2/r (<x2 being the variance of the weighing error). Another method is to weigh the objects in combinations; some of the objects are distributed between the two pans and weights are placed in one pan to achieve equilibrium. The regression model for such a scheme is

Y = ßixi + ß2x2 + ■■■ + ßkxk + e,

where Xj = 0, 1, or — 1 according as the ith object is not used, placed in the left pan or in the right pan, e is the weighing error (assumed to be the same for all weighings), and Y is the weight required for equilibrium (Y is regarded as negative if placed in the left pan). After n such weighing operations we can find the least squares estimates ßi of the weights.

(a) Show that the estimates of the weights have maximum precision (i.e., minimum variance) when each entry in the design matrix X is ±1 and the columns of X are mutually orthogonal.

(b) If the objects are weighed individually, show that kn weighings are required to achieve the same precision as that given by the optimal design with n weighings.

(Rao [1973: p. 309])

Page 20: Linear Regression: Estimation and Distribution Theory

54 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

3.7 INTRODUCING FURTHER EXPLANATORY VARIABLES

3.7.1 General Theory

Suppose that after having fitted the regression model

E[Y] = X/3, Var[Y] = a2ln,

we decide to introduce additional Xj's into the model so that the model is now enlarged to

G:E[Y] = X 0 + Z 7

= WS, (3.24)

say, where X is n x p of rank p, Z is n x t of rank t, and the columns of Z are linearly independent of the columns of X; that is, W is n x (p + t) of rank p 4-1. Then to find the least squares estimate SQ of «5 there are two possible approaches. We can either compute 6G and its dispersion matrix directly from

{G = ( W W ) - ' W ' Y and Var[<5G] = ff2(W'W)-1,

or to reduce the amount of computation, we can utilize the calculations al-ready carried out in fitting the original model, as in Theorem 3.6 below. A geometrical proof of this theorem, which allows X to have less than full rank, is given in Section 3.9.3. But first a lemma.

LEMMA If R = In - P = I„ - X C X ' X ) - ^ ' , then Z'RZ is positive-definite. Proof. Let Z'RZa = 0; then, by Theorem 3.1(i),

a'Z'R'RZa = a'Z'RZa = 0,

or RZa = 0. Hence Za = X(X'X) _ 1X'Za = Xb, say, which implies that a = 0, as the columns of Z are linearly independent of the columns of X. Because Z'RZa = 0 implies that a = 0, Z'RZ has linearly independent columns and is therefore nonsingular. Also, a'Z'RZa = (RZa)'(RZa) > 0.

D

THEOREM 3.6 Let RG = I„ - W ( W ' W ) - 1 W , L = ( X ' X ^ X ' Z , M = (Z'RZ)"1, and

Then:

(i) 7G = (Z'RZ)-1Z'RY.

Page 21: Linear Regression: Estimation and Distribution Theory

INTRODUCING FURTHER EXPLANATORY VARIABLES 55

(ii) ßG = (X'X)-1X'(Y - Z7G) = ß - L 7 G .

(in) Y ' R G Y = (Y - Z7o)'R(Y - Z7G) = Y'RY - 7GZ'RY.

(iv)

. , r t i i( (X'X)-X-|-LML' - L M \ , o n B , Var[<$G] = <r2 ( ^ ;_ML, M J . (3.25)

Proo/. (i) We first "orthogonalize" the model. Since C(PZ) C C(X),

Xß + Z-r = Xß + PZ7 + (In - P )Z 7

= X a -I- RZ7

= ( X , R Z ) ( ^ )

= VA,

say, where a = ß + (X'X)_1X'Z7 = ß + L7 is unique. We note that C(X) 1 C(RZ). Also, by A.2.4 and the previous lemma,

rank(RZ) = rank(Z'R'RZ) = rank(Z'RZ) = t,

so that V has full rank p + t. Since XR = 0, the least squares estimate of A is

A = (V'V)_1V'Y - 1

/ x'x X'RZ y / x' \ ~ V Z'RX Z'R'RZ ) \ Z'R j

_ (x'x 0 y 1 / x' \ ~ \ 0 Z'RZ J \ Z'R J _ / ( X ' X ^ X ' Y \ _ ( & \ ~ \ (Z'RZ)-XZ'RY ) ~ \ 7 ) '

Now the relationship between (ß, 7) and (a, 7) is one-to-one, so that the same relationships exist between their least square estimates. Hence

7G = 7 = (Z'RZ)-1Z'RY. (3.26)

(ii) We also have

ßG = ä-Lj = ß-^lG = (X'X)-XX'(Y - Z70). (3.27)

Page 22: Linear Regression: Estimation and Distribution Theory

56 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

(iii) Using (3.27) gives

RGY = Y-XßG-ZyG = Y - X ( X ' X ) - 1 X ' ( Y - Z 7 G ) - Z 7 G

= ( I n - X ( X ' X ) - 1 X ' ) ( Y - Z 7 c ) = R ( Y - Z 7 G ) (3.

= R Y - R Z ( Z ' R Z ) - 1 Z ' R Y , (3.

so that by (3.28),

Y ' R G Y = (Y - WtfG)'(Y - Vf 6a)

= (Y - XßG - Z 7 G ) ' ( Y - XßG - Z 7 G )

= (Y - Z 7 o ) ' R ' R ( Y - Z 7 G ) = (Y - Z 7 G ) ' R ( Y - Z 7 G ) , (3.

since R is symmetric and idempotent [Theorem 3.1(i)]. Expanding equation (3.30) gives us

Y ' R G Y = Y 'RY - 2 7 G Z ' R Y + 7 G Z ' R Z 7 G

= Y 'RY - 7 G Z ' R Y - 7 G ( Z ' R Y - Z ' R Z 7 G )

= Y 'RY - 7 G Z ' R Y [by (3.26)].

(iv)

Var[7G] = (Z 'RZ) - 1 Z 'RVar [Y]RZ(Z 'RZ) - 1

= <T2(Z'RZ)-1(Z'RZ)(Z'RZ)-1

= ^ ( Z ' R Z ) - 1 = a 2 M.

Now, by Theorem 1.3,

Cov[/3,7G] = Cov[(X'X)- 1 X'Y,(Z 'RZ)- 1 Z'RY] = (T 2 (X'X)- 1X'RZ(Z'RZ)- 1

= 0, (3

since X ' R = 0. Hence using (i) above, we have, from Theorem 1.3,

Cov[/3G, 7 G ] = Cov[/3 - L7 G , 7 G ]

= C o v [ ^ , 7 G ] - L V a r [ 7 G ]

= - <r2LM [by (3.31)]

and

V a r [ ^ G ] = V a r [ ^ - L 7 G ]

= Vax[ß] - Cov[/3,L7G] - C O V [ L 7 G , / 3 ] + Var[L7G] = Var[/3] + L Var[7G]L' [by (3.31)] =a2 [ (X'X)- 1 + LML'] .

Page 23: Linear Regression: Estimation and Distribution Theory

INTRODUCING FURTHER EXPLANATORY VARIABLES 57

a Prom Theorem 3.6 we see that once X'X has been inverted, we can find

SG and its variance-covariance matrix simply by inverting the t x t matrix Z'RZ; we need not invert the (p +1) x (p + t) matrix W'W. The case t = 1 is considered below.

3.7.2 One Extra Variable

Let the columns of X be denoted by x^ (j = 0,1,2,... ,p - 1), so that

E[Y] = (x<°\x<1\...,x<',-1>)0

Suppose now that we wish to introduce a further explanatory variable, xp, say, into the model so that in terms of the notation above we have Z7 = x^ßp. Then by Theorem 3.6, the least squares estimates for the enlarged model are readily calculated, since Z'RZ (= x'^'Rx''1 ') is only a 1 x 1 matrix, that is, a scalar. Hence

x<p)'RY ßP,a = 7o = (Z>RZ)- 1Z'RY= x ( p ) > R x ( | > ) , (3.32)

ßa = (ß0,G,...,ßP-i,G)' = ß-(X'XrlX'x{p)ßP,G,

Y ' R G Y = Y ' R Y - ^ O X W ' R Y , (3.33)

and the matrix Var[îo] is readily calculated from (X'X) - 1 . The ease with which "corrections" can be made to allow for a single additional x variable suggests that if more than one variable is to be added into the regression model, then the variables should be brought in one at a time. We return to this stepwise procedure in Chapter 11.

The technique above for introducing one extra variable was first discussed in detail by Cochran [1938] and generalized to the case of several variables by Quenouille [1950].

EXAMPLE 3.4 A recursive algorithm was given by given by Wilkinson [1970] (see also James and Wilkinson [1971], Rogers and Wilkinson [1974], and Pearce et al. [1974]) for fitting analysis-of-variance models by regression methods. This algorithm amounts to proving that the residuals for the aug-mented model are given by RSRY, where S = I„ - Z(Z'RZ)~1Z'. We now prove this result. By (3.28) the residuals required are

RGY =RY - RZ 7 G

=R(RY - Z7G) =R[I„-Z(Z'RZ)-1Z']RY (3.34) =RSRY. D

Page 24: Linear Regression: Estimation and Distribution Theory

58 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

The basic steps of the Wilkinson algorithm are as follows: Algorithm 3.1

Step 1: Compute the residuals RY.

Step 2: Use the operator S, which Wilkinson calls a sweep (not to be con-fused with the sweep method of Section 11.2.2), to produce a vector of apparent residuals RY - Z-yo (= SRY).

Step 3: Applying the operator R once again, reanalyze the apparent residu-als to produce the correct residuals RSRY.

If the columns of Z are perpendicular to the columns of X, then RZ = Z and, by (3.34),

RSR = R ( I n - Z ( Z ' R Z ) 1 Z ' ) R = R-Z(Z 'Z ) _ 1 Z 'R = SR,

so that step 3 is unnecessary. We see later (Section 3.9.3) that the procedure above can still be used when the design matrix X does not have full rank.

By setting X equal to the first k columns of X, and Z equal to the (k + l)th column (k = 1,2,... ,p - 1), this algorithm can be used to fit the regression one column of X at a time. Such a stepwise procedure is appropriate in ex-perimental design situations because the columns of X then correspond to different components of the model, such as the grand mean, main effects, block effects, and interactions, and some of the columns are usually orthog-onal. Also, the elements of the design matrix X are 0 or 1, so that in many standard designs the sweep operator S amounts to a simple operation such as subtracting means, or a multiple of the means, from the residuals.

EXERCISES 3f

1. Prove that

Y'RY - Y'RGY = a2% ( Varfrc])-1 %.

2. Prove that ■ya c a n De obtained by replacing Y by Y — Z7 in Y'RY and minimizing with respect to 7. Show further that the minimum value thus obtained is Y ' R G Y .

3. If J3G = (PGJ) and ß = (ßj), use Theorem 3.6(iv) to prove that

vtutfaj] > var[/y.

Page 25: Linear Regression: Estimation and Distribution Theory

ESTIMATION WITH LINEAR RESTRICTIONS 59

4. Given that Y\, Y2,. ■., Yn are independently distributed as N(9, a2), find the least squares estimate of 6.

(a) Use Theorem 3.6 to find the least squares estimates and the residual sum of squares for the augmented model

Yi^e + yxi+ei (» = 1,2, . . . ,») ,

where the e» are independently distributed as N(0,a2).

(b) Verify the formulae for the least square estimates of 6 and 7 by differentiating the usual sum of squares.

3.8 ESTIMATION WITH LINEAR RESTRICTIONS

As a prelude to hypothesis testing in Chapter 4, we now examine what hap-pens to least squares estimation when there are some hypothesized constraints on the model. We lead into this by way of an example.

E X A M P L E 3.5 A surveyor measures each of the angles a, ß, and 7 and obtains unbiased measurements Y\, Y2, and Y3 in radians, respectively. If the angles form a triangle, then a + ß + 7 = 7r. We can now find the least squares estimates of the unknown angles in two ways. The first method uses the constraint to write 7 = 7r - a — ß and reduces the number of unknown parameters from three to two, giving the model

We then minimize (Yi - a)2 + (F2 - ß)2 + (Y3 - n + a + ß)2 with respect to a and ß, respectively. Unfortunately, this method is somewhat ad hoc and not easy to use with more complicated models.

An alternative and more general approach is to use the model

1 0 0

0 0 \ 1 0 0 1

a ß

W and minimize (yx - a)2 + (Y2 - ß)2 + (Y3 - 7)2 subject to the constraint a + ß + 7 = ■K using Lagrange multipliers. We consider this approach for a general model below. D

Page 26: Linear Regression: Estimation and Distribution Theory

60 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

3.8.1 Method of Lagrange Multipliers

Let Y = Xß 4- e, where X is n x p of full rank p. Suppose that we wish to find the minimum of e'e subject to the linear restrictions Aß = c, where A is a known q x p matrix of rank q and c is a known q x 1 vector. One method of solving this problem is to use Lagrange multipliers, one for each linear constraint a.\ß = Cj (i = 1,2,...,q), where aj is the ith row of A. As a first step we note that

£ Mai/*-Ci) = A'(A/3-c) i= l

= (ß'A'-c')X

(since the transpose of a 1 x 1 matrix is itself). To apply the method of Lagrange multipliers, we consider the expression r = e'e 4- {ß'A' - c')A and solve the equations

Aß = c (3.35) and dr/dß = 0; that is (from A.8),

-2X'Y + 2X'X,9 + A'A = 0. (3.36)

For future reference we denote the solutions of these two equations by ßn and \H. Then, from (3.36),

ßa = ( X ' X ^ X ' Y - I fX'X^A'Âi* = ß-^X'Xy'A'Xn, (3.37)

and from (3.35),

c = AßH

= Aß-^AiX'X^A'Xjf.

Since (X'X) - 1 is positive-definite, being the inverse of a positive-definite ma-trix, A(X'X)_1A' is also positive-definite (A.4.5) and therefore nonsingular. Hence

-\\H = [A(X'X)- lA']_ 1 (c - Aß)

and substituting in (3.37), we have

ßH=ß + (X'X)"1 A' [ A t X ' X r ' A ' ] - 1 (c - A/3). (3.38)

To prove that ßn actually minimizes e'e subject to Aß = c, we note that \\X{ß-ß)f = (ß - ß)'X'X(ß - ß)

= (ß-ßH + ßH- ß)'X'X(ß -ßH + ßH-ß) = (0 - ßH)'X'X(ß - ßH) + (ßH - ß)'X'X(ßH - ß) (3.39) = 11X09-&,)||2 + ||XG9„-/9)||a (3.40)

Page 27: Linear Regression: Estimation and Distribution Theory

ESTIMATION WITH LINEAR RESTRICTIONS 61

since from (3.37),

20 - ßHyX'X(ßH -ß) = X'HA(ßH -ß) = \'H(c - c) = 0. (3.41) Hence from (3.16) in Section 3.4 and (3.40),

e'e = \\Y-Xß\\2 + \\X(ß-ß)f

= \\Y-Xß\\i + \\(X0-ßIi)\\'i + \\X0H-ß)\\i (3.42)

is a minimum when ||X(j9/f - ß)\\2 = 0, that is, when X0H — ß) = 0, or ß = ßH (since the columns of X are linearly independent).

Setting ß = ßn, we obtain the useful identity

||Y - Xß„\\2 = ||Y - Xßtf + \\X(ß - ßH)f (3.43)

or, writing Y = Xß and Y H = Xßa,

||Y - Y„| |2 - ||Y - Y||2 = ||Y - Y H | | 2 . (3.44)

This identity can also be derived directly (see Exercises 3g, No. 2, at the end of Section 3.8.2).

3.8.2 Method of Orthogonal Projections

It is instructive to derive (3.38) using the theory of B.3. In order to do this, we first "shift" c, in much the same way that we shifted IT across into the left-hand side of Example 3.5.

Suppose that ßo is any solution of Aß = c. Then

Y - Xß0 = X(ß -ß0)+e (3.45) or Y = X7+e, and A7 = A/3-A/3o = 0- Thus we have the model Y = 9+e, where 9 e Q = C(X), and since X has full rank, A(X'X)~1X'Ö = A7 = 0. Setting Ai = A(X'X) - 1X' and w = A/"(Ai) D Q, it follows from B.3.3 that wx n Q, = C(PnAi), where

PnAi = X(X'X)"1X'X(X'X)-1A' = X(X ,X)~1A'

is n x q of rank q (by Exercises 3g, No. 5, below). Therefore, by B.3.2,

Pn - Pw = Pwxnn = (PnADlAxPäAil^fPnAi)'

= X(X'X)-1A' [ACX'XJ-'AT1 A(X'X)-1X'.

Hence

XßH-Xß0 = X7 W = PWY = P n Y - P w x n n Y = P n Y - Xß0 - X(X'X)-1 A' [AfX 'X) - 1 ^ ] _ 1 (Aß - c),

(3.46)

Page 28: Linear Regression: Estimation and Distribution Theory

62 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

since PnX/30 = X/?o and Aßo = c. Therefore, canceling Xßo and multiplying both sides by (X 'X) _ 1 X' leads to ßn of (3.38). Clearly, this gives a minimum a s | | Y - X ^ | | 2 = | | Y - X 7 / / | | 2 .

EXERCISES 3g

1. (a) Find the least squares estimates of a and ß in Example 3.5 using the two approaches described there. What is the least squares estimate of 7?

(b) Suppose that a further constraint is introduced: namely, at = ß. Find the least squares estimates for this new situation using both methods.

2. By considering the identity Y - Y// = Y - Y + Y — Y H, prove that

||Y - Y „ | | 2 = ||Y - Yj |2 + ||Y - Y „ | | 2 .

3. Prove that

Vax[ßH] = a2 { (X 'X) - 1 - ( X ' X ^ A ' [ A C X ' X ^ A ' ] - 1 A ( X ' X ) " 1 } .

Hence deduce that

var[/3Hj] < var[/?j],

where ßm and ßj are the j'th elements of ßu and ß, respectively.

4. Show that

||Y - Y „ | | 2 - ||Y - Y||2 = a2X'H (Vat[\H]) " ' „.

5. If X is n x p of rank p and B is p x q of rank q, show that rank(XB) = q.

3.9 DESIGN MATRIX OF LESS THAN FULL RANK

3.9.1 Least Squares Estimation

When the techniques of regression analysis are used for analyzing data from experimental designs, we find that the elements of X are 0 or 1 (Chapter 8), and the columns of X are usually linearly dependent. We now give such an example.

EXAMPLE 3.6 Consider the randomized block design with two treatments and two blocks: namely,

Yij - n + ai+Tj+ £ij (i = 1,2; j = 1,2),

Page 29: Linear Regression: Estimation and Distribution Theory

DESIGN MATRIX OF LESS THAN FULL RANK 63

where V^ is the response from the ith treatment in the jth block. Then

/ F i l \

\Y22 )

( 1 1

1 0 1 0

1 0\ 0 1

1 V 1

0 1 0 1

1 0 0 \ )

( » \ «1

a2

\r2 J

+

(en \ £l2

£21

V £22 J

(3.47)

or Y = Xß+e, where, for example, the first column of X is linearly dependent on the other columns. D

In Section 3.1 we developed a least squares theory which applies whether or not X has full rank. If X is n x p of rank r, where r < p, we saw in Section 3.1 that ß is no longer unique. In fact, ß should be regarded as simply o solution of the normal equations [e.g., (X'X)~X'Y] which then enables us to find Y = Xß, ê = Y - Xß and RSS = e'e, all of which are unique. We note that the normal equations X'Xß — X'Y always have a solution for ß as C(X') = C(X'X) (by A.2.5). Our focus now is to consider methods for finding ß-

So far in this chapter our approach has been to replace X by an n x r matrix Xi which has the same column space as X. Very often the simplest way of doing this is to select r appropriate columns of X, which amounts to setting some of the ßi in Xß equal to zero. Algorithms for carrying this out are described in Section 11.9.

In the past, two other methods have been used. The first consists of impos-ing identifiability constraints, H/3 = 0 say, which take up the "slack" in ß so that there is now a unique ß satisfying 0 = Xß and H/3 = 0. This approach is described by Scheffé [1959: p. 17]. The second method involves computing a generalized inverse. In Section 3.1 we saw that a ß is given by (X'X)~X'Y, where (X'X) - is a suitable generalized inverse of X'X. One commonly used such inverse of a matrix A is the Moore-Penrose inverse A + , which is unique (see A. 10).

EXAMPLE 3.7 In Example 3.6 we see that the first column of X in (3.47) is the sum of columns 2 and 3, and the sum of columns 4 and 5. Although X is 4 x 5, it has only three linearly independent columns, so it is of rank 3. To reduce the model to one of full rank, we can set a2 = 0 and T2 = 0, thus effectively removing the third and fifth columns. Our model is now

f en \ £12

£21

\ £22 J

Alternatively, we can use two identifiability constraints, the most common being £V «j = 0 and ]£ r, = 0. If we add these two constraints below X, we

/ Yn \ Yl2

Y21

V 22 J

/ 1 1 1 \ 1 1 0 1 0 1

{1 0 0)

Page 30: Linear Regression: Estimation and Distribution Theory

64 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

get

/ 1 1 1 1

1 0 1 0 0 1 0 1

1 0 \ 0 1 1 0 0 1

0 1 1 0 0

0 0

1 1 /

/ M \

«2

n \T2 /

where the augmented matrix now has five linearly independent columns. Thus given 0, ß is now unique. D

EXERCISES 3h

1. Suppose that X does not have full rank, and let ßi (i = 1,2) be any two solutions of the normal equations. Show directly that

| | Y - X ^ | | 2 = | | Y - X ^ 2 | | 2

2. If the columns of X are linearly dependent, prove that there is no matrix C such that CY is an unbiased estimate of ß.

3.9.2 Estimable Functions

Since ß is not unique, ß is not estimable. The question then arises: What can we estimate? Since each element 9, of 0 (= Xß) is estimated by the tth element of 0 = PY, then every linear combination of the 0<, say b'0, is also estimable. This means that the 0j form a linear subspace of estimable functions, where 6i = x'ß, xj being the ith row of X. Usually, we define estimable functions formally as follows.

Definition 3.1 The parametric function a'ß is said to be estimable if it has a linear unbiased estimate, b'Y, say.

We note that if a'ß is estimable, then a'ß = E[b'Y] = b'0 = b'Xß identically in ß, so that a' = b'X or a = X'b (A.11.1). Hence a'ß is estimable if and only if aeC(X') .

EXAMPLE 3.8 If a'ß is estimable, and ß is any solution of the normal equations, then a'ß is unique. To show this we first note that a = X'b for some b, so that a'ß = b'Xß = b'0. Similarly, a'ß = b'Xß = b'0, which is unique. Furthermore, by Theorem 3.2, b'0 is the BLUE of b'0, so that a'ß is the BLUE of a'ß. D

Page 31: Linear Regression: Estimation and Distribution Theory

DESIGN MATRIX OF LESS THAN FULL RANK 65

In conclusion, the simplest approach to estimable functions is to avoid them altogether by transforming the model into a full-rank model! EXERCISES 3i

1. Prove that a'E[ß] is an estimable function of ß.

2. If a\ß,a'2ß,...,a'kß are estimable, prove that any linear combination of these is also estimable.

3. If a'ß is invariant with respect to ß, prove that a'ß is estimable.

4. Prove that a'ß is estimable if and only if

a'(X'X)-X'X = a'.

(Note that AA~A = A.)

5. If a'ß is an estimable function, prove that

Var[a'£] = <72a'(X'X)-a.

6. Prove that all linear functions a'ß are estimable if and only if the columns of X are linearly independent.

3.9.3 Introducing Further Explanatory Variables

If we wish to introduce further explanatory variables into a less-than-fuU-rank model, we can, once again, reduce the model to one of full rank. As in Section 3.7, we see what happens when we add Z7 to our model X/3. It makes sense to assume that Z has full column rank and that the columns of Z are linearly independent of the columns of X. Using the full-rank model

Y = Xia + Z7 + e,

where Xi is n x r of rank r, we find that Theorem 3.6(h), (iii), and (iv) of Section 3.7.1 still hold. To see this, one simply works through the same steps of the theorem, but replacing X by Xi, ß by a, and R by In — P, where P = Xi(X'1Xi) -1Xi is the unique projection matrix projecting onto C(X).

3.9.4 Introducing Linear Restrictions

Referring to Section 3.8, suppose that we have a set of linear restrictions a'ß = 0 (i = 1,2..., q), or in matrix form, Aß = 0. Then a realistic assumption is that these constraints are all estimable. This implies that a[ = mJX for some rrij, or A = MX, where M is q x n of rank q [as q = rank(A) < rank(M)

Page 32: Linear Regression: Estimation and Distribution Theory

66 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

by A.2.1]. Since Aß = MX/3 = M0, we therefore find the restricted least squares estimate of 0 by minimizing ||Y - 0||2 subject to 6 G C(X) = ft and M0 = 0, that is, subject to

0 € A/"(M)nfi (=w, say).

If P R and Pu are the projection matrices projecting onto ft and w, respec-tively, then we want to find 0U = PWY. Now, from B.3.2 and B.3.3,

Pn - Pw = Puj-nn)

where wx n ft = C(B) and B = P n M' . Thus

Bu = POJY

= PfiY - P w i n n Y = 0 n - B ( B ' B ) - B ' Y .

EXERCISES 3j

1. If P projects onto C(X), show that Z'(In - P)Z is nonsingular.

2. Prove that if Xi i s n x r of rank r and consists of a set of r linearly independent columns of X, then X = XiL, where L is r x p of rank r.

3. Prove that B has full column rank [ i.e., (B'B)~ = (B'B) -1].

4. If X has full rank and 6U = Xy3#, show that

ßH = ß~ (X'X)-1A'(A(X'X)-1A')-1Ay9.

[This is a special case of (3.38).]

5. Show how to modify the theory above to take care of the case when the restrictions are A/3 = c (c ^ 0).

3.10 GENERALIZED LEAST SQUARES

Having developed a least squares theory for the full-rank model Y = X/3 + e, where E[e] = 0 and Var[e] = cr2In, we now consider what modifications are necessary if we allow the €i to be correlated. In particular, we assume that Var[e] = o-2V, where V is a known n x n positive-definite matrix.

Since V is positive-definite, there exists an n x n nonsingular matrix K such that V = KK' (A.4.2). Therefore, setting Z = K ^ Y , B = K_ 1X, and

Page 33: Linear Regression: Estimation and Distribution Theory

GENERALIZED LEAST SQUARES 67

rj = K 1e, we have the model Z = Tiß + r), where B is n xp of rankp (A.2.2). Also, E[r}] = 0 and

Varfa] = Var[K_1e] = K"1 Var[e]K-1' = a ' K ^ K K ' K ' - 1 = <r2In.

Minimizing rj'r) with respect to ß, and using the theory of Section 3.1, the least squares estimate of ß for this transformed model is

/T = (B'B)- lB'Z = (X ' lKK ' j - 'XJ^X ' iKKO^Y = ( X ' V ^ X r ' X ' V ^ Y ,

with expected value

JE?|j9*] = ( X ' V ^ X ^ X ' V ^ X j S = ß,

dispersion matrix

Var[/T] = CT^B'B)-1

= CT^X'V^X)-1, (3.48)

and residual sum of squares

f'f = (Z - Bj9')'(Z - Bß*) = (Y-X/3*) ' (KK ' ) _ 1 (Y-X/ r ) = (Y-Xß*)'V-l(Y-Xß*).

Alternatively, we can obtain ß* simply by differentiating

T)'r) = e'y~le = (Y-Xß)'(Y-Xß) = Y'V _ 1Y - 2/9'X'V-1 Y + # 'X 'V - 1 XY

with respect to ß. Thus, by A.8,

^ = -2X'V-XY + 2X'V"1X0, (3.49) dß

and setting this equal to zero leads once again to ß*. Using this approach instead of the general theory above, we see that X'V_ 1X has an inverse, as it is positive-definite (by A.4.5). We note that the coefficient of 2/3 in (3.49) gives us the inverse of Var[/3*]/cr2.

There is some variation in terminology among books dealing with the model above: Some texts call ß* the weighted least squares estimate. However, we call ß" the generalized least squares estimate and reserve the expression weighted least squares for the case when V is a diagonal matrix: The diagonal

Page 34: Linear Regression: Estimation and Distribution Theory

68 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

case is discussed in various places throughout this book (see, e.g., Section 10.4). EXAMPLE 3.9 Let Y = xß + e, where Y = (F*) and x = (ZJ) are n x 1 vectors, E[e] = 0 and Var[e] = a2V. If V = diag(iuf 1,ui^"1,... ,w~l) (ti>i > 0), we now find the weighted least squares estimate of ß and its variance. Here it is simpler to differentiate Tf'r] directly rather than use the general matrix theory. Thus, since V - 1 = diag(iui,tu2,..., wn),

i

and ^ = -2y£xi(Yi-xiß)wi. (3.50)

i

Setting the right side of (3.50) equal to zero leads to

<-%Sr and from the coefficient of 2/3,

v a r r ] = a 2 f c W i i A - l

- l D

We can also find the variance directly from

(X 'V^X) - 1 = (x 'V^x) - 1 = ( £ > * ? )

Since the generalized least squares estimate is simply the ordinary least squares estimate (OLSE) for a transformed model, we would expect ß* to have the same optimal properties, namely, that a'/3* is the best linear unbiased estimate (BLUE) of a!ß. To see this, we note that

a'ß* = a ' t X ' V ^ X ^ X ' V ^ Y = b'Y,

say, is linear and unbiased. Let bJY be any other linear unbiased esti-mate of a'ß. Then, using the transformed model, a!ß* = a'(B'B) -1B'Z and biY = b ' jKK^Y = (K'bO'Z. By Theorem 3.2 (Section 3.2) and the ensuing argument,

var[a'/3*] < var[(K'bi)'Z] = varfbjY].

Equality occurs if and only if (K'bj)' = a'(B'B)_1B', or

bi = a'CB'BJ^B'K"1 = a ' t X ' V ^ X ^ X ' V - 1 = b'.

Thus a'/3* is the unique BLUE of a'/3. Note that the ordinary least squares es-timate a'ß will still be an unbiased estimate of a'/3, but var[a'ß] > var[a'/3*].

Page 35: Linear Regression: Estimation and Distribution Theory

CENTERING AND SCALING THE EXPLANATORY VARIABLES 69

EXERCISES 3k

1. Let Yi = ßxi+Ei (i = 1,2), where £i ~ N(0,a2), e2 ~ JV(0,2CT2), and £i and £2 are statistically independent. If xi = +1 and x2 = —1, obtain the weighted least squares estimate of ß and find the variance of your estimate.

2. Let Yi (i = 1,2,. . . , n) be independent random variables with a common mean 9 and variances a2/iVi (i = 1,2,... ,n) . Find the linear unbiased estimate of 9 with minimum variance, and find this minimum variance.

3. Let Y\,Y2,...,Yn be independent random variables, and let Yi have a N(i6, i2a2) distribution for i = 1,2, . . . , n . Find the weighted least squares estimate of 9 and prove that its variance is a2/n.

4. Let Yi, Y2,..., Yn be random variables with common mean 9 and with dispersion matrix c 2 V, where vu = 1 (i = 1,2, . . . , n ) and Vij = p (0 < p < 1; i, j = 1,2,. . . , n; i! ^ j). Find the generalized least squares estimate of 9 and show that it is the same as the ordinary least squares estimate. Hint: V - 1 takes the same form as V.

(McElroy [1967])

5. Let Y ~ Nn(Xß,<r2V), where X is n x p of rank p and V is a known positive-definite n x n matrix. If ß* is the generalized least squares estimate of ß, prove that

(a) Q = (Y - X / n ' V - ^ Y - Xß*)/a2 ~ xl-r

(b) Q is the quadratic nonnegative unbiased estimate of (n-p)a2 with minimum variance.

(c) If Y* = Xß* = P*Y, then P* is idempotent but not, in general, symmetric.

6. Suppose that E[Y] - 0, AÔ = 0, and Var[Y] = a2 V, where A is a q x n matrix of rank q and V is a known n x n positive-definite matrix. Let 0* be the generalized least squares estimate of 0; that is, 0* minimizes (Y - OyV-^Y - 0) subject to A0 = 0. Show that

Y - 0 * = V A V ,

where 7* is the generalized least squares estimate of 7 for the model E[Y] = VA'7, Var[Y] = a2V.

(Wedderburn [1974])

3.11 CENTERING AND SCALING THE EXPLANATORY VARIABLES

It is instructive to consider the effect of centering and scaling the x-variables on the regression model. We shall use this theory later in the book.

Page 36: Linear Regression: Estimation and Distribution Theory

70 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

3.11.1 Centering

Up until now, we have used the model Y = X/3 + e. Suppose, however, that we center the x-data and use the reparameterized model

Yi = a0 + ßi(xa - x"i) + • • • + /?p_i(xi,p_i - xp-i) + eit

where ao = ßo + ß{xi + • • • + ßp-ixp-i

and Xj = ^2{ Xij/n. Our model is now Y = X c a + e, where

a ' = (a0, ft,...,/3p_i) = (ao,ß'c),

Xc = ( l n , X), and X has typical element x^ = Xy — Xj. Because the trans-formation between a and ß is one-to-one, the least squares estimate of ßc

remains the same. Then, since X'ln = 0,

(n 0' yl(l'nY\ V 0 X'X ) \ X'Y )

_ Pn-i 0' \ K Y \ " V ° (x'x)-1 y V x'Y ) = V (X'X^X'Y j ' ( 3 , 5 1 )

so that âo = F and ßc = ( X ' X ^ X ' Y . Now C(XC) = C(X), which can be proved by subtracting Xj x column (1) from column (j + 1) of X for each j — l,...,p — 1. Hence Xc and X have the same projection matrices, so that

- 1 p = XcCxiXe)-^;

= (1„,X)( J x ^ ) \l„,X)'

= - 1 „ 1 ^ + X(X'X)-XX'. (3.52) n

Let Xi now represent the ith row of X, but reduced in the sense that the initial unit element (corresponding to a0) is omitted. Picking out the ith diagonal element of (3.52), we get

Pa = n- 1 + ( n - l ) - 1 ( x i - x ) ' S ^ 1 ( x i - x ) = n _ 1 + (n - lJ^MDi, (3.53)

where x = £" = 1 Xj, Sxx is the sample covariance matrix X'X/(n - 1), and MDj is the Mahalanobis distance between the ith reduced row of X and the

Page 37: Linear Regression: Estimation and Distribution Theory

CENTERING AND SCALING THE EXPLANATORY VARIABLES 71

average reduced row (cf. Seber [1984: p. 10]). Thus pa is a measure of how far away Xj is from the center of the x-data.

We note that the centering and subsequent reparameterization of the model do not affect the fitted model Y, so that the residuals for both the centered and uncentered models are the same. Hence, from (3.52), the residual sum of squares for both models is given by

RSS = Y'(I„-P)Y

= Wln-ilnl^-XCX'Xr^Y

= Y'^I„-ilnl^Y-Y'PY

= £(yi-F)2-Y'PY, (3.54) i

where P = X(X'X)_1X'. We will use this result later.

3.11.2 Scaling

Suppose that we now also scale the columns of X so that they have unit length. Let s? = x ? and consider the new variables x*j = Xij/sj. Then our model becomes

Yi = a0 + 7i s« + • • • + 7p_ix,*p_1,

where 7, = ßjSj. Because the transformation is still one-to-one, ■%• = ßjSj and do = Y. If X* = (x^) and 7 = (71,... ,7P-i) ' , then replacing X by X* in (3.51) gives us

7 = (X^XV^'Y = RjiX*'Y, (3.55)

where Hxx is now the (symmetric) correlation matrix

/ 1 ri2 ■•• n , p - i \ »•21 1 ••• r2lP-i

\ rp-i,i rp-i,2 ■■• 1 /

and

i

is the (sample) correlation coefficient of the jth and fcth explanatory variables. If we introduce the notation X* = (x*(1),... ,x* (p-1)) for the columns of X*, we see that r,* = x*k')'x*W.

Page 38: Linear Regression: Estimation and Distribution Theory

72 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

EXAMPLE 3.10 For later reference we consider the special case of p = 3. Then

Rxx = ( r 1 j ' where r = x ' ^ ' x * ^ . Also, from (3.55), we have

( Ï ) 1 r V 1 / x'W'Y r 1 l x'W'Y

1 / 1 - r \ / X ' ^ ' Y \ 1 - r2 ^ -r 1 J ^ x'W'Y J '

so that 7i = — ^ ( x ^ ' Y - r x ' ^ V ) and ft = 7i/*i- (3-56)

1 — r* By interchanging the superscripts (1) and (2), we get

ya = —^{x*^'Y-rx*^'Y) and ß2=72/s2. 1 — r

Since x = x * ( o ° 2 ) = x * s « >

say, it follows that

P = n - x l „ i ; + X'S^BrfX^X'S^-^dX*' =n-1 ln l 'n + X ,(X , ,X*)-1X"

+ r -^ (x- ( 1 >,x*( 2 >)(_ 1r - ^ ( x - W . x - W ) '

^ - ^ „ l ^ + x'Wx'W'

EXERCISES 31

(3.57)

+ YT^^y ~ ™*(2)){x<l) - rx*W)'. (3.58)

D

1. If Yi = Yi - Y and Y = (f t , . . . , fn)', prove from (3.54) that RSS = Y'(I„ - P)Y.

2. Suppose that we consider fitting a model in which the V-data are cen-tered and scaled as well as the z-data. This means that we use Y* = (Yi-Y)/sy instead of Yu where s2

y = ^ ( F i - F ) 2 . Using (3.54), obtain an expression for RSS from this model.

Page 39: Linear Regression: Estimation and Distribution Theory

BAYESIAN ESTIMATION 73

3.12 BAYESIAN ESTIMATION

This method of estimation utilizes any prior information that we have about the parameter vector 0 = (ß',a)'. We begin with the probability density function of Y, /(y, 0), say, which we have assumed to be multivaxiate normal in this chapter, and we now wish to incorporate prior knowledge about 0, which is expressed in terms of some density function f(0) of 0. Our aim is to make inferences on the basis of the density function of 0 given Y = y, the posterior density function of 0. To do this, we use Bayes' formula,

f(9\y) = my) /(y)

= /W/W /(y)

= cf(y\0)f(0), (3.59) where c does not involve 0. It is usual to assume that ß and a have indepen-dent prior distributions, so that

/(0) = /i(/9)/2(<r).

Frequently, one uses the noninformative prior (see Box and Taio [1973: Sec-tion 1.3] for a justification) in which ß and loger are assumed to be locally uniform and a > 0. This translates into f\{ß) = constant and h(o-) oc l/<7. These priors are described as improper, as their integrals are technically infi-nite (although we can get around this by making the intervals of the uniform distributions sufficiently large). Using these along with the independence as-sumption, we obtain from (3.59)

/ ( / M y ) = c/(y|Ö)a-1

= c(27r)-"/2a-("+1> exp (-±||y - X/3||2) .

Using the result /•oo

/ x-(*+1) exp (-a/x2) dx = \a-b'2T{b/2) (3.60) ./o

derived from the gamma distribution, we find that /•OO

f{ß\y) oc / / ( / M y ) do-Jo

oc | | y - X / ? | r . (3.61)

Now, from Exercises 3a, No. 1,

| | v -X/3 | | 2 = | | y - X ^ | | 2 + | |X^ -X/3 | | 2

= ( n - p ) 3 2 + | |X^-X/3 | | 2 , (3.62)

Page 40: Linear Regression: Estimation and Distribution Theory

74 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

so that

This is a special case of the p-dimensional multivariate t-distribution

with v = n - p, £ = s 2 (X 'X) _ 1 , and fi = ß. What estimate of ß do we use? If we use the mean or the mode of the

posterior distribution (which are the same in this case, as the distribution is symmetric) we get ß, the least squares estimate. For interval inferences, the marginal posterior distribution of ßr is a t-distribution given by

ßr - ßr tn-sVcr+l'r+l

where (c") = (X 'X) - 1 . If some information is available on 0, it is convenient, computationally,

to use a conjugate prior, one that combines with / (y |ö) to give a posterior distribution which has the same form as the prior. For example, suppose that f(ß\a2) is the density function for the Np(m, <x2V) distribution and that a1

has an inverted gamma distribution with density function

/ ( a 2 ) « ( a 2 ) - ( d + 2 > / 2 e x p ( - ^ ) . (3.63)

Then

f(ß,o>) = f(ß\o2)f(<r2)

K (<T2)-(d+P+2)/2exp {_^L[(/3 - myV-'OS - m) + a]} .

Combining this prior with the normal likelihood function, we obtain

/(/3,<r2|y) oc f(y\ß,a2)f(ß,a2)

oc (CT2)-(d+p+"+2)/2 exp [-(Q + a)I'(2a2)} ,

where Q = (y - X/3)'(y - Xß) + (ß - m)'Y~\ß - m) .

We can now integrate out a2 to get the posterior density of ß. Thus roo

f(ß\y) = / f(ß,°2\y)do2

JO /•oo

/ (ffa)-(rf+n+p+2)/2exp [_ ( Q + a)/(2CT2)] da2.

Jo oc

Page 41: Linear Regression: Estimation and Distribution Theory

BAYESIAN ESTIMATION 75

Using the standard integral formula /■OO

/ x-("+1) exp{-k/x) dx = T{v)k~v, Jo

we see that the posterior density is proportional to

(Q + o)-(d + n + p^2 . (3.64)

To make further progress, we need the following result. THEOREM 3.7 DefineV* = (X'X+V" 1 ) - 1 and letm, be given bym, = V^X'y + V-1!»!). Then

Q = (ß- m.yv^iß - m.) + (y - Xm)'(I„ + X V X ' ) " ^ - Xm). (3.65)

Proof.

Q = ß'(X'X + V-1)ß-2ß'(X'y + V-1m) + y'y + m'V-1m = ß'V^ß - 2/3'V,-1m. + y'y + Hi'V^m = (ß- m.yv^iß - m.) + y'y + ni'V^m - m ^ V - W

Thus, it is enough to show that

y'y + m V - 1 m - m . V , - 1 m , = (y -Xm) ' ( I n + XVX')~1(y - Xm). (3.66)

Consider

y'y + Hi 'V^m - m ^ V ^ m . = y'y + m'V - 1m - (X'y + V^nO'V.fX'y + V ^ m ) = y'(In - XV,X')y - 2y'XV.V-1m

+ m'CV-1 - V ^ - ^ . V - ^ m . (3.67)

By the definition of V,, we have V.(X'X + V - 1 ) = Ip, so that

v.v-1 = ip - v.x'x and

XV. V - 1 = X - X V , X ' X

= ( I„ -XV,X ' )X. (3.68)

Also, by A.9.3,

v-i-v-^.v - 1 = v^-v-^x'x + v - 1) -^ - 1

= (V + (X'X)-1)"1

= X ' X - X ' X ( X ' X + V- 1 ) - 1 X'X = X'(I„ - XV.X')X. (3.69)

Page 42: Linear Regression: Estimation and Distribution Theory

76 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

Substituting (3.68) and (3.69) into (3.67), we get

y'(In - XV.X')y - 2y'(I„ - XV.X')m + m'(I„ - XV.X')m = (y - Xm)'(I„ - XV,X')(y - Xm). (3.70)

Finally, again using A.9.3,

(In + XVX')_ 1 = I „ - X ( X ' X + V- 1 ) - 1 X' = I „ - X V . X ' ,

proving the theorem. D Using Theorem 3.7, we see from (3.64) that the posterior density of ß is

proportional to

[o. + 08 - m.)'V?(fl - m . ) ] - ( n + d + p ) / 2 ,

where a. = a + (y - Xm)'(In + XVX')_1(y ~ Xm).

This is proportional to

[1 + (n + d)~\ß - m.YVf^iß - m. ) ] - ( r f + n + p ) / 2 ,

where W» = amVm/(n + d), so from A.13.5, the posterior distribution of ß is a multivariate tp(n + d,m„,W„). In particular, the posterior mean (and mode) is m», which we can take as our Bayes' estimate of ß.

These arguments give the flavor of the algebra involved in Bayesian regres-sion. Further related distributions are derived by O'Hagen [1994: Chapter 9] and in Section 12.6.2. Clearly, the choice of prior is critical and a necessary requirement in the conjugate prior approach is the choice of the values of m and V. These might come from a previous experiment, for example. Distribu-tions other than the normal can also be used for the likelihood, and numerical methods are available for computing posterior likelihoods when analytical so-lutions are not possible. Numerical methods are surveyed in Evans and Swartz [1995]. For further practical details, the reader is referred to Gelman et al. [1995], for example.

EXERCISES 3m

1. Derive equations (3.60) and (3.61).

2. Using the noninformative prior for 6, show that the conditional poste-rior density f(ß\y,o) is multivariate normal. Hence deduce that the posterior mean of ß is ß.

3. Suppose that we use the noninformative prior for 8.

(a) If v = a2, show that f(v) a 1/v.

Page 43: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 77

(b) Obtain an expression proportional to f(ß,v\y). (c) Using (3.62), integrate out ß to obtain

/(V|y) «w-("/»+D e x p ( - J ) ,

where v = n - p and a = ||y - X/9||2/2. (d) Find the posterior mean of v.

3.13 ROBUST REGRESSION

Least squares estimates are the most efficient unbiased estimates of the re-gression coefficients when the errors are normally distributed. However, they are not very efficient when the distribution of the errors is long-tailed. Under these circumstances, we can expect outliers in the data: namely, observations whose errors e% are extreme. We will see in Section 9.5 that least squares fits are unsatisfactory when outliers are present in the data, and for this reason alternative methods of fitting have been developed that are not as sensitive to outliers.

When fitting a regression, we minimize some average measure of the size of the residuals. We can think of least squares as "least mean of squares" which fits a regression by minimizing the mean of the squared residuals (or, equivalently, the sum of the squared residuals). Thus, least squares solves the minimization problem

1 n

min-5>?(b) , t=i

where ei(b) = Yi - xjb. Here, average is interpreted as the mean and size as the square. The sensitivity of least squares to outliers is due to two factors. First, if we measure size using the squared residual, any residual with a large magnitude will have a very large size relative to the others. Second, by using a measure of location such as a mean that is not robust, any large square will have a very strong impact on the criterion, resulting in the extreme data point having a disproportionate influence on the fit.

Two remedies for this problem have become popular. First, we can measure size in some other way, by replacing the square e2 by some other function p(e) which reflects the size of the residual in a less extreme way. To be a sensible measure of size, the function p should be symmetric [i.e., p(e) = p{—e)], positive [p(e) > 0] and monotone [/o(|ei|) > p(\e2\) if |ei| > |e2|]. This idea leads to the notion of M-estimation, discussed by, for example, Huber [1981: Chapter 7], Hampel et al. [1986: Chapter 6], and Birkes and Dodge [1993: Chapter 5].

Second, we can replace the sum (or, equivalently, the mean) by a more robust measure of location such as the median or a trimmed mean. Regression

Page 44: Linear Regression: Estimation and Distribution Theory

78 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

methods based on this idea include least median of squares and least trimmed squares, described in Rousseeuw [1984] and Rousseeuw and Leroy [1987]. A related idea is to minimize some robust measure of the scale of the residuals (Rousseeuw and Yohai [1984]).

3.13.1 M-Estimates

Suppose that the observed responses Y{ are independent and have density functions

fi{y^ß,°) = \f(^£), (3.71)

where a is a scale parameter. For example, if / is the standard normal density, then the model described by (3.71) is just the standard regression model and a is the standard deviation of the responses.

The log likelihood corresponding to this density function is n

l(ß, a) = - n log a + £ log / [(Y{ - xjj8)/<r],

which, putting p = - log / , we can write as

-InlogcT + ^ p ^ - x ^ / a ] ! .

Thus, to estimate ß and a using maximum likelihood, we must minimize n

nlogs + 5 > [ e i ( b ) / S ] (3.72) t = i

as a function of b and s. Differentiating leads to the estimating equations n

5>[e«(b)/«]xi = 0, (3.73)

n

5 > [ e i ( b / S ) ] e j ( b ) = ns, (3.74)

where ip = p1.

EXAMPLE 3.11 Let p(x) = \x2 so that V>(z) = *• Then (3.73) reduces to the normal equations (3.4) with solution the least squares estimate (LSE) ß, and (3.74) gives the standard maximum likelihood estimate

n i= l

Page 45: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 79

EXAMPLE 3.12 Let p(x) = \x\. The corresponding estimates are values of s and b that minimize

nlogs + i ^ | e i ( b ) | . (3.75) s i = i

Clearly, a value of b minimizing (3.75) is also a value that minimizes

and is called the L\ estimate. Note that there may be more than one value of b that minimizes (3.75). There is a large literature devoted to L\ estimation; see, for example, Bloomfield and Steiger [1983] and Dodge [1987]. Note that the L\ estimate is the maximum likelihood estimator if / in (3.71) is the double exponential density proportional to exp(-|y|). An alternative term for the L\ estimate is the LAD (Least Absolute Deviations) estimate. D

If we have no particular density function / in mind, we can choose p to make the estimate robust by choosing a p for which ip = p' is bounded. We can generalize (3.73) and (3.74) to the estimating equations

n

J > [*(!>)/«]* = 0, (3.76) i= l

n

5>[e,(b)/a] = 0, (3.77) i= l

where x is a l s o chosen to make the scale estimate robust. The resulting estimates are called M-estimates, since their definition is motivated by the maximum likelihood estimating equations (3.73) and (3.74). However, there is no requirement that I/J and x be related to the density function / in (3.71).

EXAMPLE 3.13 (Huber "Proposal 2," Huber [1981: p. 137]) Let

x, -k<x<k, (3.78) k, x > k,

where jfc is a constant to be chosen. The function (3.78) was derived by Hu-ber using minimax asymptotic variance arguments and truncates the large residuals. The value of k is usually chosen to be 1.5, which gives a reason-able compromise between least squares (which is the choice giving greatest efficiency at the normal model) and L\ estimation, which will give more pro-tection from outliers. D

An estimate 6 of a parameter 6 is consistent if 9 -¥ 6 as the sample size increases. (Roughly speaking, consistency means that 6 is the parameter

Page 46: Linear Regression: Estimation and Distribution Theory

80 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

actually being estimated by 8.) It can be shown that a necessary condition for consistency when the parameters are estimated using (3.76) and (3.77) is

Ety(Z)} = 0, (3.79)

and E[x(Z)} = 0, (3.80)

where Z has density function / . Equation (3.79) will be satisfied if / is symmetric about zero and if V IS antisymmetric [i.e., tp(—z) = ip{z)]. This will be the case if p is symmetric about zero. We note that the conditions (3.79) and (3.80) are only necessary conditions, so the estimates may be biased even if they are satisfied. However, Huber [1981: p. 171] observes that in practice the bias will be small, even if the conditions are not satisfied.

EXAMPLE 3.14 In Huber's Proposal 2, the function t/> is asymmetric, so condition (3.79) is satisfied. The scale parameter is estimated by taking x(x) = ip2(x) — c for some constant c, which is chosen to make the estimate consistent when / is the normal density function. From (3.80), we require that c = E{tp(Z)2], where Z is standard normal. D

EXAMPLE 3.15 Another popular choice is to use xiz) = signO I _ Ve) for some constant c. Then (3.77) becomes

n

^sign( |e i(b) | -s /c)=0, i= l

which has solution (see Exercises 3n, No. 1, at the end of this chapter)

s — c medianj |ej(b)|.

This estimate is called the median absolute deviation (MAD); to make it consistent for the normal distribution, we require that c - 1 = $ -1(3/4) = 0.6749 (i.e., c = 1.4326). D

Regression coefficients estimated using M-estimators are almost as efficient as least squares if the errors are normal, but are much more robust if the error distribution is long-tailed. Unfortunately, as we will see in Example 3.23 below, M-estimates of regression coefficients are just as vulnerable as least squares estimates to outliers in the explanatory variables.

3.13.2 Estimates Based on Robust Location and Scale Measures

As an alternative to M-estimation, we can replace the mean by a robust measure of location but retain the squared residual as a measure of size. This leads to the least median of squares estimate (LMS estimate), which minimizes

medianjei(b)2.

Page 47: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 81

The LMS estimator was popularized by Rousseeuw [1984] and is also discussed by Rousseeuw and Leroy [1987]. An alternative is to use the trimmed mean rather than the median, which results in the least trimmed squares estimate (LTS estimate), which minimizes

£ ew(b)2 (3.81) t = l

where h is chosen to achieve a robust estimator and e(x)(b)2 < • • • < e(n)(b)2

are the ordered squared residuals. The amount of trimming has to be quite severe to make the estimate robust. The choice h = [n/2] +1 (where [x] is the greatest integer < x) is a popular choice, which amounts to trimming 50% of the residuals. The choice of h is discussed further in Section 3.13.3.

These estimates are very robust to outliers in both the errors and the explanatory variables but can be unstable in a different way. In certain cir-cumstances, small changes in nonextreme points can make a very large change in the fitted regression. In Figure 3.2(a), the eight points lie on one of two lines, with the point marked A lying on both. If a line is fitted through the five collinear points, all five residuals corresponding to those points are zero. Since a majority of the residuals are zero, the median squared residual is also zero, so a line through these points minimizes the LMS criterion.

Now move the point marked B to be collinear with the remaining three points, resulting in Figure 3.2(b). This results in a new set of five collinear points. Using the same argument, this small change has resulted in the fitted LMS line now passing through the new set of collinear points. A small change in point B has resulted in a big change in the fit.

y y 4>

-» x -7> x

(a) (b)

Fig. 3.2 Instability of the LMS estimator.

Page 48: Linear Regression: Estimation and Distribution Theory

82 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

In addition, these estimates are very inefficient compared to least squares if the data are actually normally distributed. In this case, the asymptotic relative efficiency of LMS relative to the LSE is zero. (That is, the ratio of the variance of the LSE to that of the LMS estimate approaches zero as the sample size increases.) The equivalent for the LTS is 8% (Stromberg et al. [2000]). These poor efficiencies have motivated a search for methods that are at the same time robust and efficient. Before describing these, we need to discuss ways of quantifying robustness more precisely.

3.13.3 Measuring Robustness

We will discuss two measures of robustness. The first is the notion of break-down point, which measures how well an estimate can resist gross corruption of a fraction of the data. The second is the influence curve, which gives information on how a single outlier affects the estimate.

Breakdown Point of an Estimate

Suppose that we select a fraction of the data. Can we cause an arbitrarily large change in the estimate by making a suitably large change in the selected data points?

Clearly, for some estimates the answer is yes; in the case of the sample mean we can make an arbitrarily large change in the mean by making a sufficiently large change in a single data point. On the other hand, for the sample median we can make large changes to almost 50% of the data without changing the median to the same extent.

Definition 3.2 The breakdown point of an estimate is the smallest fraction of the data that can be changed by an arbitrarily large amount and still cause an arbitrarily large change in the estimate.

Thus, the sample mean has a breakdown point of 1/n and the sample median a breakdown point of almost 1/2. We note that a breakdown point of 1/2 is the best possible, for if more than 50% of the sample is contaminated, it is impossible to distinguish between the "good" and "bad" observations, since the outliers are now typical of the sample.

Since the least squares estimate of a regression coefficient is a linear combi-nation of the responses, it follows that an arbitrarily large change in a single response will cause an arbitrarily large change in at least one regression coef-ficient. Thus, the breakdown point of the least squares estimate is 1/n.

Since the median has a very high breakdown point, and the median of the data Yi,...,Yn minimizes the least absolute deviation ^ |Fj - 6\ as a function of 6, it might be thought that the L\ estimator of the regression coefficients would also have a high breakdown point. Unfortunately, this is not the case; in fact, the breakdown point of L\ is the same as that of least squares. It can be shown, for example in Bloomfield and Steiger [1983: p. 7], that when the regression matrix X is of full rank, there is a value minimizing

Page 49: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 83

Y,?=i le«(b)l f° r which at least p residuals are zero. Further, Bloomfield and Steiger [1983: p. 55] also prove that if one data point is arbitrarily far from the others, this data point must have a zero residual. It follows that by moving the data point an arbitrary amount, we must also be moving the fitted plane by an arbitrary amount, since the fitted plane passes through the extreme data point. Thus replacing a single point can cause an an arbitrarily large change in the regression plane, and the breakdown point of the L\ estimate is 1/n. The same is true of M-estimates (Rousseeuw and Leroy [1987: p. 149]).

We saw above that the LMS and LTS estimates were inefficient compared to M-estimates. They compensate for this by having breakdown points of almost 1/2, the best possible. If we make a small change in the definition of the LMS, its breakdown point can be slightly improved. Let

Ä = [n/2] + [(p + l)/2], (3.82)

where [x] denotes the largest integer < x. If we redefine the LMS estimator as the value of b that minimizes e(/,)(b)2, rather than the median squared residual, the LMS breakdown point becomes ([(n - p ) / 2 ] + l ) /n . If h is given by (3.82), then the LTS estimate which minimizes

X>«(b)2

t = i

also has breakdown point ([(n - p)/2] 4- l ) /n , slightly higher than with the choice h = [n/2] + 1. These results are discussed in Rousseeuw and Leroy [1987: pp. 124, 132].

Influence Curves

Suppose that F is a fc-dimensional distribution function (d.f.), and 0 is a population parameter that depends on F, so that we may write 6 = T(F). We call T a statistical functional, since it is a function of a function.

EXAMPLE 3.16 Perhaps the simplest example of a statistical functional is the mean EF[X] of a random variable X, where the subscript F denotes expectation with respect to the d.f. F. In terms of integrals,

T(F) = EF{X) (3.83)

= I xdF{x). O

EXAMPLE 3.17 If Z is a random &-vector with distribution function F, then the matrix J3j?[ZZ'] is a statistical functional, also given by the k-dimensional integral

T(F)= fzz'dF(z). (3.84)

D

Page 50: Linear Regression: Estimation and Distribution Theory

84 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

Definition 3.3 / / Z i , . . . , Zn are independent and identically distributed ran-dom vectors each with distribution function F, the empirical distribution func-tion (e.d.f.) Fn is the d.f. which places mass n~l at each of the n points Zj, i = l , . . . , n .

Integration with respect to the e.d.f. is just averaging; if ft is a function, then

/ / i ( z ) ^ n ( z ) = n - 1 ^ / i ( Z i )

Many statistics used to estimate parameters T(F) are plug-in estimates of the form T(Fn), where Fn is the e.d.f. based on a random sample from F.

EXAMPLE 3.18 (Vector sample mean) Let Z j , . . . , Z n be a random sample from some multivariate distribution having d.f. F. The plug-in estimate of

T(F)= fzdF{z)

is

T(Fn) = JzdFn(z)

the sample mean. D

EXAMPLE 3.19 The plug-in estimate of (3.84) is

T(Fn) = J zz'dFn(z) (3.85)

n

= n-iy;zjz;. D i = l

Consider a regression with a response variable Y and explanatory variables xi,... ,xp-i. When studying the statistical functionals that arise in regres-sion, it is usual to assume that the explanatory variables are random. We regard the regression data (xi,Yi),i = l , . . . , n , as n identically and inde-pendently distributed random (p+ l)-vectors, distributed as (x, Y), having a joint distribution function F, say. Thus, in contrast with earlier sections, we think of the vectors Xj as being random and having initial element 1 if the regression contains a constant term. As before, we write X for the (random) matrix with ith row xj.

We shall assume that the conditional distribution of Y given x has density function g [(y - ß'x)/a], where g is a known density, for example the standard

Page 51: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 85

normal. For simplicity, we will sometimes assume that the scale parameter a is known. In this case, we can absorb a into g and write the conditional density as g{y — ß'x).

EXAMPLE 3.20 (Least squares) Consider the functional

T(F) = {EFIXX'^EFIXY].

The plug-in estimator of T is

T(Fn)=L-1Ylxix'A fn-^x^J (3.86)

= ( X X ^ X ' Y . D

To assess the robustness of a plug-in estimator T(F„), we could study how it responds to a small change in a single data point. An alternative, which we adopt below, is to examine the population version: We look at the effect of small changes in F on the functional T(F). This allows us to examine the sensitivity of T more generally, without reference to a particular set of data.

Suppose that F is a distribution function. We can model a small change in F at a fixed (i.e., nonrandom) value zo = (x'0,yo)' by considering the mixture of distributions Ft = (1 - t)F + tSZo, where 5Zo is the distribution function of the constant zo, and t > 0 is close to zero. The sensitivity of T can be measured by the rate at which T(Ft) changes for small values of t.

Definition 3.4 The influence curve (IC) of a statistical functional T is the derivative with respect to t of T(Ft) evaluated at t = 0, and is a measure of the rate at which T responds to a small amount of contamination at z0.

We note that the influence curve depends on both F and zo, and we use the notation

dT(Ft) IC(F,z0) = dt

to emphasize this. Cook and Weisberg [1982: Chapter 3], Hampel et al. [1986] and Davison and Hinkley [1997] all have more information on influence curves.

EXAMPLE 3.21 (IC of the mean) Let T be the mean functional defined in Example 3.16. Then

T(Ft) = jxdFt(x)

= (1-t) j xdF(x) + t j xd6X0(x)

= (1 - t)T{F) + tx0,

Page 52: Linear Regression: Estimation and Distribution Theory

86 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

so that

and IC(xo,F)=xo-T(F).

This is unbounded in xo, suggesting that a small amount of contamination can cause an arbitrarily large change. In other words, the mean is highly nonrobust. D

EXAMPLE 3.22 (IC for the LSE) Let T be the LSE functional defined in Example 3.20. Write E F = EF[xx'] and j F = EF[xY]. Then

T(Ft) = {ZFl}-1lFt. (3.87)

We have

EF l = EFl[xx'}

= (1 - t)EF[xx'] + tx0x'0

= (1 - t )£ F + tx0x'0

= (l-t){VF + t'x0x'0},

where t' = t/(l

Similarly,

- * ) . By A.9.1 we j

ï f i ' ^ l - i ) ' 1

get

lFt = (1 - t)yF + tx0y0.

(3.88)

(3.89)

Substituting (3.88) and (3.89) in (3.87) yields

T(F t) = T(F) + t'2plxoyo ~ f E ^ x o ^ S ^ 1 + o{t),

so that T(Ft) - T(F) = E - i X o y o _ s - i X o X ^ T ( F ) + o ( i ) .

Letting t -» 0, we get

IC(z0,F) = -ZF1x0[y0-x'0T(F)}.

We see that this is unbounded in both x0 and yQ, indicating that the LSE is not robust. D

The situation is somewhat better for M-estimates.

Page 53: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 87

EXAMPLE 3.23 (IC for M-estimates) For simplicity we will assume that the scale parameter a is known. Consider the functional T defined implicitly by the equation

EF (xl>{[Y - X'T(F)]/CT}X) = 0. (3.90)

The plug-in version is T(Fn) is the solution of

-X>{[^-xrw)]Mxi = o, t=i

which is of the form (3.76). Thus, the functional T defined by (3.90) is the M-estimation functional.

To derive its influence curve, we substitute Ft for F in (3.90). This yields

(1 - t)EFy>{[Y - x'T(Ft)]/a}x] + trp{[y0 - x'0T(Ft)]/a}x0 = 0. (3.91)

Let f t = dT{Ft)/dt, et = [Y - x'T(Ft)]/a and r)t = [y0 - x'0T(Ft))/a, and note the derivatives

dt and

dEFty(et)x] dt = EF

di>(et) ' x dt

= -EF[il,'{et)xx!]Tt/o.

Now differentiate both sides of (3.91). We obtain

(1 - O ^ y * * 1 - EM«)*] + ^ x o + ^ t ) x 0 = 0,

which, using the derivatives above, gives

- (1 - t)EF [^'(et)xx'] ft/a - EFty{et)x\ - trP'(T)t)x0Xo'Tt/(T + Vfo)x0 = 0.

Now set t = 0. Noting that Ft = F when t = 0, and using (3.90), we get

EFty(e0)} = EFty{[Y-x'T(F)]/cr}] = 0,

and from the definition of the IC, T0 = IC(z0, F). Thus,

-EF ry {[y - x'T(F)]/a)}xx'] IC(z0,F)/<r + ^{[y0 - x'0T(F)]/a}x0 = 0,

so finally, IC(«o, F) = <np{[yo - x'0T(F)]/a}M-lx0, (3.92)

where M = EF[rJ>{(Y -x'T(F))/cr}xx']. Thus, assuming that tp is bounded, the influence curve is bounded in yo, suggesting that M-estimates are robust

Page 54: Linear Regression: Estimation and Distribution Theory

88 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

with respect to outliers in the errors. However, the IC is not bounded in xo, so M-estimates are not robust with respect to high-leverage points (i.e., points with outliers in the explanatory variables; see Section 9.4). D

The robust estimates discussed so far are not entirely satisfactory, since the high breakdown estimators LMS and LTS have poor efficiency, and the efficient M-estimators are not robust against outliers in the explanatory vari-ables and have breakdown points of zero. Next, we describe some other robust estimates that have high breakdown points but much greater efficiency that LMS or LTS.

3.13.4 Other Robust Estimates

Bounded Influence Estimators

As we saw above, M-estimators have influence curves that are unbounded in xo and so are not robust with respect to high-leverage points. However, it is possible to modify the estimating equation (3.73) so that the resulting IC is bounded in xo. Consider an estimating equation of the form

n

Y, ^(x i)V{e i(b)/[aw(x i)]}x i = 0, (3.93) t=i

where, for simplicity, we will assume that the scale parameter a is known. This modified estimating equation was first suggested by Handschin et al. [1975], and the weights are known as Schweppe weights. It can be shown (see Hampel et al. [1986: p. 316]) that the IC for this estimate is

IC(z0 ,F) = aw(x0)^{(y0 - x(,T(F))/(<r«;(xo))}M-1xo,

where M is a matrix [different from the M appearing in (3.92)] not depending on z0. The weight function tu is chosen to make the IC bounded, and the resulting estimates are called bounded influence estimates or generalized M-estimates (GM-estimates).

To make the IC bounded, the weights are chosen to downweight cases that are high-leverage points. However, including a high-leverage point that is not an outlier (in the sense of not having an extreme error) increases the efficiency of the estimate. This is the reason for including the weight function w in the denominator in the expression ej(b)/[crw(xj)], so that the effect of a small residual at a high-leverage point will be magnified. An earlier version of (3.93), due to Mallows [1975], does not include the weight w(xi) in the denominator and seems to be less efficient (Hill [1977]).

The weights can be chosen to minimize the asymptotic variance of the estimates, subject to the influence curve being bounded by some fixed amount. This leads to weights of the form tu(x) = | |Ax | | - 1 for some matrix A. More details may be found in Ronchetti [1987] and Hampel et al. [1986: p. 316].

Page 55: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 89

Krasker and Welsch [1982] give additional references and discuss some other proposals for choosing the weights.

The breakdown point of these estimators is better than for M-estimators, but cannot exceed 1/p (Hampel et al. [1986: p. 328]). This can be low for problems with more than a few explanatory variables. To improve the break-down point of GM-estimators, we could combine them in some way with high breakdown estimators, in the hope that the combined estimate will inherit the desirable properties of both.

The estimating equation (3.93) that defines the GM-estimate is usually solved iteratively by either the Newton-Raphson method or Fisher scoring (A. 14), using some other estimate as a starting value. (This procedure is discussed in more detail in Section 11.12.2.)

A simple way of combining a high breakdown estimate with a GM-estimate is to use the high breakdown estimate as a starting value and then perform a single Newton-Raphson or Fisher scoring iteration using the GM iteration scheme discussed in Section 11.12.2; the resulting estimate is called a one-step GM-estimate. This idea has been suggested informally by several authors: for example, Hampel et al. [1986: p. 328] and Ronchetti [1987].

Simpson et al. [1992] have carried out a formal investigation of the prop-erties of the one-step GM-estimate. They used the Mallows form of the es-timating equation (3.93), with weights w(xi) based on a robust Mahalanobis distance. The Mallows weights are given by

w(xi) = min ' • l r -a/2'

m ) ' C - 1 ( x i - m ) (3.94)

where b and a are tuning constants, m and C are robust measures of the location and dispersion of the explanatory variables, and the Xj's are to be interpreted in the "reduced" sense, without the initial 1. Thus, the denomi-nator in the weight function is a robust Mahalanobis distance, measuring the distance of Xj from a typical x. Suitable estimates m and C are furnished by the minimum volume ellipsoid described in Section 10.6.2 and in Rousseeuw and Leroy [1987: p. 258].

If the robust distance used to define the weights and the initial estimate of the regression coefficients both have a breakdown point of almost 50%, then the one-step estimator will also inherit this breakdown point. Thus, if LMS is used as the initial estimator, and the minimum volume ellipsoid (see Section 10.6.2) is used to calculate the weights, the breakdown point of the one-step estimator will be almost 50%. The one-step estimator also inherits the bounded-influence property of the GM-estimator. Coakley and Hettmansperger [1993] suggest that efficiency can be improved by using the Schweppe form of the estimating equation and starting with the LTS estimate rather than the LMS.

Page 56: Linear Regression: Estimation and Distribution Theory

90 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

S-Estimators

We can think of the "average size" of the residuals as a measure of their dispersion, so we can consider more general regression estimators based on some dispersion or scale estimator s(ei , . . . , e„). This leads to minimizing

D(b) = a[e1(b),...,en(b)]) (3.95)

where 5 is a estimator of scale. The scale parameter a is estimated by the minimum value of (3.95).

EXAMPLE 3.24 If we use the standard deviation as an estimate of scale, (3.95) reduces to

n n

5 > ( b ) -e~(b)]2 = £ [Yi -¥-bi(xa -Wi) tp-ite,,,-! -xp_i)]2 , i = l «=1

which is the residual sum of squares. The estimates minimizing this are the least squares estimates. Thus, in the case of a regression with a constant term, taking the scale estimate s to be the standard deviation is equivalent to estimating the regression coefficients by least squares. D

EXAMPLE 3.25 Using the MAD as an estimate of scale leads to minimiz-ing median« |ei(b)|, which is equivalent to minimizing medianj |ej(b)|2. Thus, using the estimate based on the MAD is equivalent to LMS. D

Rousseeuw and Yohai [1984] considered using robust scale estimators s = s(ei , . . . , e„) defined by the equation

t=i

where K = E[p(Z)] for a standard normal Z, and the function p is symmetric and positive. They also assume that p is strictly increasing on [0, c] for some value c and is constant on (c, oo). Estimators defined in this way are called S-estimators.

Rousseeuw and Yohai show that the breakdown point of such an estimator can be made close to 50% by a suitable choice of the function p. The biweight function, defined by

o(x) _ / *2/2 - *V(2c2) + *7(6c4), M < c, p[X)-\cy6, \x\>c,

is a popular choice. If the constant c satisfies p(c) = 2E[p(Z)}, where Z is standard normal, then Rousseeuw and Yohai prove that the breakdown point of the estimator is ([n/2] - p + 2)/n, or close to 50%. For the biweight esti-mator, this implies that c = 1.547. The efficiency at the normal distribution is about 29%.

Page 57: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 91

R-Estimators

Another class of estimators based on a measure of dispersion are the R-estimators, where the dispersion measure is defined using ranks. Let an(i),i = 1 , . . . , n, be a set of scores, given by

an{i) = h[i/{n + l)], (3.96)

where h is a function defined on [0,1]. Examples from nonparametric statistics include the Wilcoxon scores, [h(u) = u - 0.5], the van der Waerden scores \h{u) = $ - 1 (u ) ] and median scores [h(u) = sign(u - 0.5)]. All these scores satisfy ]T"=1 a„(t) = 0.

Jaeckel [1972] defined a dispersion measure by

n

s{ex,...,en) = ^ a n ( ß i ) e i ( (3.97) i = l

where Ri is the rank of e» (i.e., its position in the sequence {ci,. . . , e n } ) . Since the scores sum to zero, the dispersion measure will be close to zero if the ej's are similar. For a fixed vector b , let Ri be the rank of ej(b). Jaeckel proposed as a robust estimator the vector that minimizes s[ei(b) e„(b)].

Note that since the scores satisfy X)?=i an{i) — 0, the measure s has the property

s(ei +c,...,en + c) = s ( e i , . . . , e „ ) .

Thus, for any vector b , if the regression contains a constant term, the quantity s[ci ( b ) , . . . , en(b)] does not depend on the initial element &o of b . If we write b = (bo ,^ ) ' , then s[e i (b) , . . . ,e„(b)] is a function of b i alone, which we can denote by D(hi). It follows that we cannot obtain an estimate of ßo by minimizing D(bi); this must be obtained separately, by using a robust location measure such as the median applied to the residuals ej(b), where b = (0,bj) ' , bx being the minimizer of D(bi).

The estimate defined in this way has properties similar to those of an M-estimator: For the Wilcoxon scores it has an influence function that is bounded in 2/o but not in xo, has a breakdown point of 1/n, and has high efficiency at the normal distribution. These facts are proved in Jaeckel [1972], Jureckova [1971], and Naranjo and Hettmansperger [1994].

The estimate can be modified to have a better breakdown point by modi-fying the scores and basing the ranks on the absolute values of the residuals, or equivalently, on the ordered absolute residuals, which satisfy

| e ( 1 ) ( b ) | < - - - < | e ( B ) ( b ) | .

Consider an estimate based on minimizing

n

£>(b) = 2>„(i)|e(i)(b)|, (3.98) i= l

Page 58: Linear Regression: Estimation and Distribution Theory

92 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

where the scores are now of the form an(i) = h+(i/(n + 1)), where h+ is a nonnegative function denned on [0,1] and is zero on [a, 1] for 0 < a < 1. Then Hössjer [1994] shows that for suitable choice of h+, the breakdown point of the estimate approaches min (a, 1 — a) as the sample size increases. The efficiency decreases as the breakdown point increases. For a breakdown point of almost 50%, the efficiency is about 7% at the normal model, similar to LTS.

The efficiency can be improved while retaining the high breakdown prop-erty by considering estimates based on differences of residuals. Sievers [1983], Naranjo and Hettmansperger [1994], and Chang et al. [1999] considered es-timates of the regression coefficients (excluding the constant term) based on minimizing the criterion

D(b!)= J2 «'«|c«(b)-e,-(b)|1 (3.99)

which, like Jaeckel's estimator, does not depend on i>o- The weights Wij can be chosen to achieve a high breakdown point, bounded influence, and high efficiency. Suppose that b and ö are preliminary 50% breakdown estimates of ß and er. For example, we could use LMS to estimate ß, and estimate a using the MAD. Chang et al. [1999] show that if the weights are defined by

Wij = max < 1, CU)(Xi)w(Xj)

[e<(b)/â][Cj(b)/â] (3.100)

then the efficiency can be raised to about 67% while retaining a 50% break-down point. In (3.100), the weights W(XJ) are the Mallows weights defined in (3.94), and c is a tuning constant. If u/y = 1 for all i < j , then the estimate reduces to Jaeckel's estimate with Wilcoxon scores (see Exercises 3n, No. 2).

Similar efficiencies can be achieved using a modified form of S-estimate which is also based on differences of residuals. Croux et al. [1994] define a scale estimate s = s(ei, . . . , en) as the solution to the equation

where h = [(n +p + l)/2]. Then the estimate based on minimizing s(bi) = s[ei(b),...,e„(b)] is called a generalized S-estimate. Note that again this criterion does not depend on b0.

Defining

P ( « ) = { i : \:\ -K j ; (3.102)

gives an estimate called the least quartile difference estimate (LQD estimate), since (see Exercises 3n, No. 3) the resulting s is approximately the lower

quartile of all the I 1 differences |e<(b) - e,(b)|. Croux et al. [1994] show

Page 59: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 93

that the LQD estimator has a breakdown point of almost 50% and roughly 67% efficiency. It does not have a bounded influence function.

A similar estimate, based on a trimmed mean of squared differences, is the least trimmed difference estimate (LTD estimate), which minimises the sum of the first I ) ordered squared differences. This estimate, introduced in

Stromberg et al. [2000], has properties similar to those of the LQD.

EXERCISES 3n

1. Let xiz) — signd^l ~ Ve) f°r some constant c. Show that the solution of (3.77) is the MAD estimate

s = c median* |ej(b)|.

2. Show that if we put itiy = 1 in (3.99), we get Jaeckel's estimator defined by (3.97) with Wilcoxon weights.

3. Show that if s is the solution of (3.101) with p given by (3.102), then the

resulting s is approximately the lower quartile of the I 1 differences

| e i ( b ) - e j ( b ) | .

MISCELLANEOUS EXERCISES 3

1. Let Yi = üißi+biß2 +Ei (i = 1,2,..., n), where the Oj, bi are known and the Si are independently and identically distributed as N(0, er2). Find a necessary and sufficient condition for the least squares estimates of ß\ and /?2 to be independent.

2. Let Y = 0+e, where E[e\ = 0. Prove that the value of 0 that minimizes || Y — 0||2 subject to AÖ = 0, where A is a known q x n matrix of rank q, is

B = (I„ - A'(AA')_1A)Y.

3. Let Y = Xß + e, where E[e] = 0, Var[e] = a2In, and X is n x p of rank p. If X and ß are partitioned in the form

X^ = (X1,X2) (£)■ prove that the least squares estimate ßi of ßi is given by

ß2 = [X^X2-X^X1(X'1X1)-1X'1X2]-1

x [X2Y - XiXxtXiXO^XiY] .

Page 60: Linear Regression: Estimation and Distribution Theory

94 LINEAR REGRESSION: ESTIMATION AND DISTRIBUTION THEORY

Find Varßa]-

4. Suppose that E[Y] = X/3 and Var[Y] = o2ln. Prove that a'Y is the linear unbiased estimate of E[a!Y] with minimum variance if and only if cov[a'Y,b'Y] = 0 for all b such that E\b'Y] = 0 (i.e., b'X = 0').

(Rao [1973])

5. If X has full rank and Y = X/9, prove that

£var[Yi] = <rap.

6. Estimate the weights ßi (i = 1,2,3,4) of four objects from the following weighing data (see Exercises 3e, No. 5, at the end of Section 3.6 for notation) :

Xl

1 1 1 1

x2

1 - 1

1 - 1

Z3

1 1

- 1 - 1

X4

1 - 1 - 1

1

Weight (Y)

20.2 8.0 9.7 1.9

7. Three parcels are weighed at a post office singly, in pairs, and all to-gether, giving weights Yijk (i,j, k = 0,1), the suffix 1 denoting the pres-ence of a particular parcel and the suffix 0 denoting its absence. Find the least squares estimates of the weights.

(Rahman [1967])

8. An experimenter wishes to estimate the density d of a liquid by weighing known volumes of the liquid. Let Fj be the weight for volume n (i = 1,2,...,n) and let E[Yi] = dxt and var[YJ] = a2f(xi). Find the least squares estimate of d for the following cases:

(a) f(Xi) = 1. (b) f{xi) = Xi. (c) /(«,) = xl

9. Let Yi = ßo + ßiXi + ei (i = 1,2,3), where E[e] = 0, Var[e] = CT2V with

V = f pa a* pa } , <°'f U n k n ° r n )

p pa 1 0 < p < l

and xi = — 1, X2 = 0, and £3 = 1. Show that the generalized least squares estimates of ßo and ß\ are

(ßi\-(r_1 {(«2-ap)yi + (i-2 ap+p)Y* + («2-*p)y*} \ \ßi)~\ -fr + i* )

Page 61: Linear Regression: Estimation and Distribution Theory

ROBUST REGRESSION 95

where r — 1 + p + 2a2 - 4ap. Also prove the following:

(a) If a = 1, then the fitted regression Y* = ßg+ß^Xi cannot lie wholly above or below the values of Yt (i.e., the Yi - Y* cannot all have the same sign).

(b) If 0 < a < p < 1, then the fitted regression line can lie wholly above or below the observations.

(Canner [1969])

10. If X is not of full rank, show that any solution ß of X ' V - 1 X / 3 = X ' V " 1 Y minimizes (Y - X/9) 'V-1(Y - Xß).

11. Let

Yi = 01+Ö2+6! , y2 = el-2e2 + e2,

and Y3 = 2#i — 02 +£3,

where E[ei] = 0 (i = 1,2,3). Find the least squares estimates of 0\ and 62. If the equations above are augmented to

Yx - 01+02 + 03 + ei, Y2 = 0 i - 2 0 2 + 03+ £2, Y3 = 2öi — Ö2 + 03 + £3,

find the least squares estimate of 93.

12. Given the usual full-rank regression model, prove that the random vari-ables Y and YliO^i - Y)2 are statistically independent.

13. Let Yi = ßxi + Ui, Xi > 0 (i = 1,2,. . . , n), where U{ = pv.i-\ + £j and the £j are independently distributed as N(0,(T2). If ß is the ordinary least squares estimate of ß, prove that var[/3] is inflated when p > 0.

14. Suppose that E[Yt] = ßo + ßicos(2irkit/n) + /?2sin(27rM/n), where t = 1,2,. . . ,n, and fci and k2 are positive integers. Find the least squares estimates of ßo, ß\, and /?2-

15. Suppose that E[Yi\ = a0 + ßi(xn - xi) + ß2(xi2 - x2), i = 1,2,... ,n. Show that the least squares estimates of ao, ßi, and ß2 can be obtained by the following two-stage procedure:

(i) Fit the model E[Yi) = a0 + ßi{xiX -xx).

(ii) Regress the residuals from (i) on (x,2 - ^2).