ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A …boos/library/mimeo.archive/... ·...

. 1\

•

• ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A

COORDINAIEWISE MEAN SQUARE ERROR CRITERION OF GOODNESS

Burt S. Holland

Institute of StatisticsMimeograph Series No. 693July, 1970

iv

TABLE OF CONTENTS

LIST OF TABLES .

1. INTRODUCTION

..Page

v

1

1. 1 The Model • • . . . • . • . • 11. 2 Mu1 tico11ineari ty . . . • . 21.3 Estimation with a Mean Square Error Criterion of

Goodness . . . . . . . . . . . . . 31.4 Multivariate G~nera1izations of Mean Square Error. 6

2. REVIEW OF LITERATURE .

2.1 The Stein-James Estimator.2.2 Test-Estimation PrQcedures2.3 Ridge R~gression

9

91011

3. ALTERNATIVE ESTIMATORS OF P

3.1 Construct~on of the Estimators

13

13

3.1.1 b2

3.1.2 b33.1.3 b

43.1.4 b53.1.5 b q

. . .. , ..• •

14

17

17

19

23

3.2 Asymptotic Distribution Theory of the Estimators 233.3 Relative Mean Square Efficiency of the Estimator&I

A Simulation Experiment . . 313.4 Discussion of the Es~imaforS 35

..

II

4. SUMMARY, CONCLUSIONS AND RECONMENOATIONS

4.1 Summary.. • •......4.2 Conclusions and Recommendations.

41

4143

5. LIST OF REFERENCES ..... , .. 48

6. APPENDIX: THE SIMULATION DESIGN AND PROGRAM . 50

..

3.1

3.2

LIST OF TAB~ES

Estimated relative efticienGies EZ' ~3' and ES' basedon N ~ 500 itera~ion~ . . . .

Relative efficiency E6 for var~ou~ values of qj

v

Page

33

37

1. INTRODUCTlON

1.1 The Model,

Consider the linear model

=: X~ + e , (1. 1)

is an n ~ p matri~ of known fixed quantities with rankwhe:re X ;: [x tj ]

p =: n, P ;: (~l' ... , p ) I a p x 1 vector of unknown parameters top

be estimated, and e an n x 1 vector of random variables distributed

with mean vector 0 and dispersion matrix ~2I, ~2 unknown. The Gauss-

Markov Theorem (Graybill, 1961, pp. 115_116) states that for this model

the minimum variance linear unbiased esti~ator of p is given by

with dispersion matrix ~2(X'X)-1. By "minimum variance" of the vector

(b lj ) ~ va~ (b!j) for each j ;: 1,

·b* ) I i h l'... , . s any ot er tnear un-lp

estimator b1

we mean here that Var

* tr *2, ... , p, where b1 ;: (bll' b12 ,

biased estimator of p.

The minimum varian~e quad~a~ic unbiased estimator pf ~2 is given

•

by

(1. 3)

2

When interest lies in constructing confidence regions or tests

2of hypotheses concerning functions of ~ and ~ , further specification

of the distribution of e is necessary. If, as is often the case, we

can assume that e follows a multivariate normal probability law, b1,,2

and ~ are the jointly sufficient maximum likelihood estimators of

their expectations."2 2

Also un~er this assumption, (n - p)~ / ~

..

2 "2 4X (n - p) and var (~) = 2~ / (n - p).

In this thesis we are concerned with the estimation of ~J where

in order to elucidate insofar as possible the mechanism underlying the

model, primary interest lies in efficient structural estimation. We

are less interested here in improved prediction of the response

corresponding to a given p_tup1e (x tlJ x t2 ' "'J x tp) or in judiciously

choosing such a p-tup1e to optimize the response y.

1.2 Mu1tic911inearity

Despite its desirable properties, the estimator bl

occasionally

leaves a lot to be desired. If X (and hence XIX) is nearly singular

so that IXIxl is "small," the variances of the {blj } get very large

and the estimators themselves become quite sensitive to changes in

specification of the model. This ill-conditioning of the model, often

referred to as multicollinearity, has long been a prickly problem for

investigators in all fields of application.

There are several avenues of approach to the multicollinearity

problem. The most satisfactory one is to insist upon additional

information (~.~., more sample data or elaboration of the model). If,

as is often the case, such information is unobtainable, some investi-

gators would consider the possibility of reducing the dimensionality

3

1of the problem; ~'~') discarding some of the regressors. However) if

the theoretical considerations underlying the construction of the model

are not to be neglected) this approach is inappropriate when the

statistician's aim is structural estimation. The incorrect omiesion of

an important though multicollinear variable from the list of independ-

ent variables introduces a perceptible bias in the estimation of the

, , ff" 2rema1n1ng coe 1C1ents.

This thesis follows a third route in considering some slightly

biased estimators of~. In particular) we consider alternative

estimators obtained by modifying the criterion of goodness from linear

minimum variance unbiasedness to emallness of mean square error

(m.s.e.). It is felt that in practice) few people would seriously

object to this minor change in loss structure.

1. 3 Estimation with a Mean Square Error Criterion of Goodness

For estimating the scalar parameter e) the m.s,e. of the estimator

e (having finite variance) is given by

E([e - E(e)]

A 2 A

var(e) + bias (9) .

Implicit in the adoption of this risk function is the acceptance of

1See Draper and Smith) 1969.

2See Farrar and Glauber (1967) for an excellent account of themulticollinearity problem and some proposed remedies.

4

bias in the estimation if the reduction in variance surpasses t~e

newly introduced squared bias term. When there is substantial

multicollinearity in the model (1.1), the large variance of the

estimators means that even a small percentage reduction in variance'

can be appreciable in absolute terms.A

Whereas minimum variance unbiasedness stipulates that 9 be made

close to E(e) subject to the condition E(S) e, the mean square

A

error criterion requires that e be close to e itself. Both

(9 - E(8))2 and (e _9)2 are attractive loss functions in that

they possess the property of convexity. But in practice, minimum

variance unbiased (m.v.u.) estimators are usually easy to construct

while minimum m.s.e. estimators are not. For unbiased estimation with

a strictly convex loss function, the Rao-Blackwell Theorem (Fraser,

1966, p. 57) gives an explicit procedure for construction of the

unique m.v.u. estimator when there exists a complete sufficient

statistic for the- family of densities (Pei e e ~}. There is no

analogous result for minimum m.s.e. estimation.

As an illustration of the difficulty tn obtaining minimum m.s.e.

A

estimators, suppose ~ is the m.v.u. estimator of a scalar parameter

~ and one asks what constant "c" will minimize

2 A 2 2= c var(~) + (c-l) ~ . (1. 4)

2 2 A

The optimum c is clearly ~ ; [~ + var(~) J, so that

A 2; 2 A~~ [~+ v~r(~)J (1. 5)

5

has smaller m.s.e. than~. However} ~ cannot be computed without prior

2 "knowledge of the value of the ratio ~ / var~. This dilemma occurs

when we wish to use a modification of the arithmetic mean of a random

sample of size n to estimate the unknown mean of a populatio~ haVing

unknown variance. Such interference of "nuil3ance parameters" has

led Kendall and Stuart (1967) p. 22) to state}

Minimum m.s.e. estimators are not much used} butit is as well to recognize that the objection tothem is a practical} rather than theoretical one.

If it should happen that var ; = ~~2} ~ known} we can get a minimum

m.s.e. estimator for~. As an example} consider the estimation of

2rr in the model (1.1) with normal errors:

~2 "2 2 2 2 2 4rr "" rr (rr) / [(rr) + 2rr / (n - 1)]

= (n _ 1) ~2 / (n + 1)

and

-2 2m.l3.e. (rr ; rr) = 2CJ4

/ (n + 1)

< I.rr4/ (n _ 1)

"2 2::: m. $. e. (cr j cr) .

Where minimum m.s.e. e$timators do not exist} it may I3till b~

possible to construct estimators that have smaller m.s.e. than

traditional estimators over ~ wide range of values for various un-

known parameters. We shall se~ that this is the caSe when estimating

13 in (1. 1) .

6

1.4 Multivariate Generalizations of Mean Square Error

Thus far the discussion has been confined to the estimation of a

single parameter. Upon moving to the multivariate problem, we must

adopt a suitable gen~ra1ization of m.s.e. to the case of a vector

estimator e= (a1, 82, •.. , 8p) I of a vector parameter

9 = (91' 62, ... , 6p

) 1.

One attractive generalization that has been proposed (Bhattacharya,

1966) is

(1. 6)

where W is a p x p symmetric positive definite matrix of known weights.

(This guarantees that the cprr~sponding loss function is non~egative

and convex.) W is often taken to be D, a diagonal matrix of positive

elements, or more particularlY, the identity matrix:

(1. 7)

Geometrically inclined readers will refer to (1.6) as the expected

(squared) distance between e and 6 in the metric of W.

Instead of employing a ~ing1e criterion such as (1.6) we shall,

in this thesis, Gonstrvct vector estimators by simu1taneQus1y attending

to the p univariate proo1ems of rendering (m.s.e. (e.; e.), j = 1,J J

2, ... , p} as small as possible. Thus we would call eIIbetter fl than

E(6. _ 6.)2J J

< (1. 8)

7

for every j = I, 2, ... , p, and at least one of the inequalities is

strict. If eis better than eaccording to (1.8), it is better also

according to (1.6) if W =D (but not necessarily if W f D). However,

the converse of this ~tatement is ~learly false.

The criterion (1.8) lacks the g~ometric appeal of (1.6). It

fails to take account of the cross product terms (E(e. - e.)dL, - eLI)},J J J J

an omission th~t may not be w~rranted.

The difficulty with the employm~nt of (1.6) lies in the dilemma

of choosing a satisfactory W. Unless W is judiciously chosen, (1.8)

may be preferable to (1.6) in that if the latter is used:

(i) Some of the (e.) may be poo~ly estimat~d (although othersJ

are nQt).

(ii) Unjustified emphasi$ may be placed on the efficient estima-

tion of some subs~t of (e.) relative to another subset.J

(iii) The criterion is not scale invariant.

In view of our declarep aim of estimation rather than efficient pre-

diction or optimization, it seems unwise, in the absence of further

information, to consiqer a means of estimation that may perform

well for one part of the model to the detriment of another. 3 Our

criterion attend~ to the estimation o~ each ~. apart from that fo~J

the remaining elements of~. Recall too that the Gauss-Markov Theorem

chooses ~ = bl to minimize each E[P j - ~(~j)J2 rather than

~ ~ A ~

E[ f3 - E(f3) ] 'W[ f3 - E(f:J)] for some W,

3The m.s.e. of prediction of a "futqre" response Yt corresponding

to a stochastic p-tuple (~tl' x t2 , ... , x tp) with dispersion matrix W2 ~ . A

is given by ~ + E(f3 - ~) IW(~ - f3).

An even more restri~tivecriterion.than(1.8) is to caH e"better in m.s.e." than eiff m.B.e. (h'S) ~m.s.e. (h'~) for every

h (p x l)} or equ;iv~lently} iff [E(e - 6) (6 - e) 1 - E(e - e) (9 - 6) I]

is a positive semi-definite matrix.4

Fraser (1966) ~. 60) states a multivariate generalizat~on of the

Rao_Blackwell Theorel\l whi~h elllplors the notion of "ellipsoid of

concentratio~' as the lll~ltiva~i~tQ generalization of varianGe.

4See Toro-Vizcarrondo an~ Wallace (19p8) p. 560).

8

9

2. RE;VIEW OF LlTERATURE

2.1 The Stein-J~m~s E~ttmator

Like most investigators) Ja~es and Stein (19Q1) have preferred to

use criterion (l..7). For the special case where XIX =;: I (~:.~.:)

orthogonal polynomial regressi~n») 6 mu1tivariat~ normal) and p ~ 3)

they have shown that

~(y) =;: [1 _ y;.2 ! Y'X4CY]X'Y (2.1)

is uniformly (in 13 pnd i) bett~r than b1 =; X'Y) for y any positiv~

number less than Z(p-2)(n-p)!(R_p+2») a.nd thatm.s.e. [~(Y)J 13] is

minimized by taking 'Y =;: (p - 2) (n _ p) ! (n _ p + 2) . The coefficient

[1 - y~2! Y'XX 'Y] will be between 0 and l. for all admisl;lib1e y

whenever

Thus the Stein_James estimator (2.1) is si~ply a (scalar) constant

multiple of b l . They merely prQve that P(y) is better than b1)

omitting the motivation behind its construction. Normality is not

necessary to render bl inadmissible) ~ut in its absence no alternative

estimator has been proposed.

Baranchik (1970) has generalized (2.1) to allow y to be a certain

bounded function of F •p)n"TP

10

Stein (1956) has also shown that bl is admissible when p $ 2 [the

risk function still being (1.7)]. This means that in the two re-

gressors case with criterion (1.8), we cannot simultaneously imp~ove

2upon bl1 and b12 for all possible values of ~ and ~ .

Sc10ve (1966, 1968) has pointed out that P(y) generalizes to the

case XiX f I provided that we take W ~ XIX. Thus (1.6) now appears as

m.s.e. (6; ~) ~ E(j - ~) Ix'x(i - ~) . (2.2)

Since XIX is proportional to the inverse of the variance_covariance

matrix of b1, this choice of W removes objection (iii) in Section 1.4,

and goes a long way toward the withdrawaL of (i). Bhattacharya (1966)

has indicated that for Wf I and arbitrary XIX, we can at least trans-

form the problem to the W ~ 0 Gase.

Sc1ove's paper (1968) surveys all the literature discussed in

this section in ord~r to interpFet some highly mathematical res~lts

for the benefit of applied statisticians.

2.2 Test-Estimation ProceduresI j

Consider the model

(2.3)

which differs from (1.1) with p ~ 2 only in that the vaFiables are now

d f h · 5correcte or t e~r means. Bancroft (1944) has discussed the

estimator ~l of ~l specified as foll~ws. Perform the level ~

Student's t test of the hypothesis H: ~ ~ ° vs H: P2. f 0,o 2 a

5The distinctio1;l between (2.3) and (1.1) will be further discussedin Section 3.3 below.

11

Then let

=n.E (xlt - xl) (Y t - Y)

t=l ,

if H is rejected.o

if H is accepted.o

Toro_Vizcarrondo (1968) has invest~gat~d the same estimator where

6a m.s.e. test is used in place of Stude~t's t.

Baranchik (1964) ~onsidered a ~odification of the Stein_James

"estimator where ~(y) is taken to ~~ the null vector when a preliminary

F test of H :o IJ = 0 vs H:a is accepted.

These so-called "j:est.,.~st~mation" p:J;'ocedures are actually more

akin to the "discard~n& regressors" approB,f;:h to multicollinearity

than they are to the "n~w esHlllator" procedure to be investigated in

the next chapter.

2.3 ~idge Regre~sion

Hoerl and Kennard (1970a~ present th~ estimator

6See Toro-Vizcarrondo and Wallac~ (1968).

(2.4)

12

where the scala~ k is chosen so as to stabilize the syst~m by makin~

the estimator less sensitive than bl to small changes in model speciei

cation. It is demonstr~ted that w~th risk as in (1.7) there are

admissible values of k ~uch that ~(k) is better than bl

. However) no

explicit expression for k is presented. The authors suggest ~hat its

choice be base~ on a graph of the (~j(k)} vs k (called a "ridge

trace"); that k be the smallest value such that for k"( > k) the

(~j(k*)} are nea~ly independent pf k*. They claim that the k that

one will employ in practice will only slightly incr~ase the error sum

of squares [Y ~ X~(k)] I [y ~ X~(k)]. It i~ also noted that (2.4) is the

Bayes estimator ot p when the parameter vector has a pr~or distribution

wi th mean 0 (p x 1) and dispersion (cr2/k) I (p:x;p).

A second paper by the sam~ authors (~oerl and Kennard) 1970b)

contains illustra~ions of the performance of uhe ridge-trace procedure.

13

3. ALTERNATIVE ESTIMATORS OF /3

3.1 Constru~tion of the Estimators,

For lack of ~ny otper me~ns ~f approach, all estimatora considered

here are essentially modifie~fions of b1.

Analogous to the situa~ion for scalar estimation discussed in

Section 1.3, we find that our modified estimators contain th~

t Q and Cf2parame ers t" That is,

In the fopm (3.1), Pis clearly of no use. Two procedures wer~

considered for th~ construction of employable estimators;

(i) In (3.1) set Cf2 ~ ~2 anc;1 let our estimator of /3 be the

solution b of the equation

to be dete~in~q QY iteratio~ or otherwise;

(ii) In (3.1) set Cf~ ~ ~~ and /3 ~ b1

to obtain

(3.2)

" "2b = /3(b~, cr) • (3.3)

Procedu~e (1) was qutck1y di~missed. For all of the new

estimators, rule~ for ~boosing good starting values to Prime the

iteration were hard to come by. Often the iteration diverged, or

converged too slowly to be pf practical use.

Procedure (ii) was the one chosen to be employed here. This has

the advantage of rendering an estimator that is a funct~on of the

sufficient statistics b1 and ~2. It see~ed intuitively desirable not

to depart from the use of sufficient statistics.

14

The "raw form" estimators E(3.1) as oppo$ed to (3.3)J to be

computed are mini~um m.s.e. estimators) but the employable es~imators

of form (3.3) are not. the ques~ion to be answered was} "Does there

exist a ~ that is lappreciably' better than h1 pver a 'wide' range of

possible values of p and (J'2?"

We shall denote the ~ive new estim,to~s by

When in the "r~w form ll Pliior to making the substitutions p = b1 and

2 A2 0(J :;:: (J ) we shall wri te the e~tim~tors as bi . The j.,; th component of

any vector h will be wl'it;ten hjor (h)j' In this section) ~n asterisk

superscript denotes the optimal value of ~ny variable.""

The reader is rem~nded th~~ the objective used to compute the

"raw form" estimator~ is sep~rCJte minimiZation of E<Pj

_ Pj)2 fo1,"

each j :;:: 1} 2} "'} p.

To construct b~} we attempt to find that p x p matrix K which

optimizes (in the indiqate~ Sense) the estimator Kb1.

o * h k*Let kj denote the j.th ~ow of K. Then b2j =kjh l } were j

" is chosen to minimi~e E(kjblZ- Pj ) .

We compute:

bias (kjbl ; Pj ) .. k~P ... Pj ;

(kjb1) 2

k (X'X)-lk~var :;:: r;rj J

...

15

(3.4)

Differentiating withrespe~t to the vector k. and setting theJ

result equal to the n~ll vector) we obtain

:::; 0 . (3.5)

Notice that the second derivative with respect to kj

is a positive

definite matrix.

Continuing from (3.5»)

whence

and

Employing the identity

(3.6)

(A ') -1 ~ A-I+ uv - - A- 1 'A- 1 / (1 ,-1 )uv + v Au) (3.7)

16

where A is a square nonsingular matrix, u and v are column vectors,

and (A + uv') is nonsingular, we can write

1 X I Xj3j3 I] XIX=-2 [ r - 2 .

0- 0- + j3'X'Xj3

Then

0= ~pl [ X'X/3/3'

b2 I - ] X'y2 2 + /3' X1X/30- cr

and

b b l X'Xb b l1 1 I IbZ =~ [ I ~ ~2 ] X'ycr ~ + b'X'Xb

1 1

bl

blX'Xb[ 1 1 I ] b'X1y=~ ... - ;"2 i Irr+ cr + blX'Xb

I I

blX'yI

=; [ ...2 ] b l .cr + b'X'y1

(;3.8)

This

17

....2Recall that biX'Y is the regression sum of squares and ~ is

the error mean squ~re in the standard Analysis of Variance t~b1e.

*At first glance it seem~ as though we have found K to be a

/....2

constant [b'X'Y (~ + biX'Y)] multiple of the identity matrix.

is not the qase) how~ver) b~cause the end result (3.8) is a consequence

of having substituted,bl for ~.

This estimator c9nstrains the K in Section 3.1.1 to be a diagonal

matrix with j~th entry mj . o *Thus b3j = mjb1j. From (1.5) we find that

The estimatprs b2 ~nd b3 are mu1tip1~cative modifications of b1.

We now consider an additive modification) b4) whose raw form is

written

b *,;:: l+DXY.

18

Like the Hoer~ and K~nnard procedure discussed in Section 2.3,

this estimator attempts to reduce the p variances by altering the

diagonal elements of (X ' X)-l.p

The estimator b4j is found by computing the minimum with respect

to D of the j-th diagonal element of the "m.s.e. matrix"

E(bl + DX;'Y - 13) (bi

+ DX'Y - /3) I

= E[(b l - 13) + DX'~f3 + DX ' s][(b1 - 13) + DX'X/3 + DX'sJ'

+ DX; I Xf3f3 IXI XD + EDX ISS IXD

+ DX'Xf3f3 IX 'XD .

h 2 I -1Apart from t e ~ (X; X) term which is free of D, the j-th diagonal

element is found to be

where Zj is the j_th row of XIX, dj the j_th diagonal element of D, and

[Sj.t] =X1X.

Thus the j_th diagonal element of n* is

19

Therefore}

~4J' =: blJ

· - i(x'y). / «()2~ .. + z.f3f3'z~). J JJ J J

and

"2 n "2 n 2- () r: xt·Yt / [() s .. + ( r: xt·Y t) ] }

tr::::l J JJ t=:l J(3.9)

Attempts to combine the p equations (3.9) into a compact e~pression

for the b4 vector were un~uccessful.

The fourth propo~ed estimator ~s radically di~ferent from the

first three in both appearapce and ~opoeption. We consider first the

case p =: 2:

2Yf~+1=: ([ I - S (I _ SX I XS) J -1 }S ]b l j w* ~ 0

j(3.10)

-~where S is the 9ia~onal matrix with j-th~ntry s ..JJ

Abbreviating r12

as r} we see that SX'XS is simply the correlation matrix

1 r( ) }

r 1

while

(I _ SX I XS) 2w+1 :: (

o 2w+l-r

) .

20

(3.11)

Since rank (X) ::o

as w - 00 and bSj

2w+l 0-r

2, we h~ve Ir I < 1. Thus (3.11) tends to 0 (2 x 2)

* * 0 2tends to b1j as wj - 00. For wj :: 0, bSj :: (6 XrY)j'

the Gau$s-Markov estimator of ~j when 8 12 :: O. It was hoped that with

* 0a judicious choice of w. between these extremes, bs ' would be an effec-J J

tive compromise between b1j

and the estimator when the (3-j)-th Gol~m~

of X is ignored. To find w~, the value of w. that makes th~ j-thJ J

00.diagonal element of the m. s. e. matrix E(bS - ~) (9

5- ~) I as small as

opossible, we begin by computing the dispersion matrix of bs :

.f(

2 2w.+1 1 1 ~:: (J [I - S (I - SX I XS) J S- ] (X I X) - [I _ S- (I

*2w.+1_ SX'XS) J S].

One finds that

* *2 .. 2 ~.~ 4w.~;:;: (J (sJJ) s .. (1 + 2r J + r J )

JJ

*2w.+1+ 2s jj s jt Vs. s (r + 2r J

jj jt

*4w.+3+ r J )

* i(2w +2 4w +2+ (sj~ 2 sU (1 + 2r j + r j ) } ,

21

where.e, = 3 - j, j = 1,2. The "bias_cobias" matrb;;is

*2w +1(E[(b

1- (3) - SCI - SX'XS) j S-lb1]}(E[(b1 - (3)

'i~

2w.+l_ SCI .- SX'XS) J s-lb

1J}' ,

the j-th diagonal element of which is found to be

*4w , +2 1: ., ~ '.e,r J (S ../SD,)2SJJ(S'9[3. + su[3,) + (S9'/S .. ) sJ (s .. [3,

JJ ~ J~ J ~ ~ ~ JJ JJ J

o *The squared bias of bSj is clearly at a maximum when wj = 0 and

*decreases monotonically to zero as w, _ 00. Likewise, it can be shownJ

02*that var (bS ') is at a minimum (~/s .. ) when wj = 0 and increa~es

J JJ

monotonically to ~2/(1_r2)sj' as w~ _ 00. Thus it is not surprisingo J J

that m.s.e. (bS,j [3.) attains a minimum for some positive and finiteJ J

'i~

wj ' Further ca1culatiQn r~ve~ls this optimal wj to be

(3.12)

*In practice, w. is rounded off to the nearest integer. We thenJ

o 2obtain bSj from bSj by replacing [3.e, and ~ in (3.1~) wi thb1.e, and

"2~ respectively.

22

Unfortunately, it was found that except in a very special C~SeJ

bS does not generalize to p > 2 because

(I ~ SX'XS)w _ 0

as w - 00

(p x p) (3.13)

fails to hold in general. Furthermore, (3.13) is valid with decreasing

frequency as p increases. Consider for example

SX'xs '"

1

a

a

a

1

a , ..

a

a

1

(p x p)

.;

which is an admissible correlation matrix for -l/(p-l) < a ~~. ~t can

be shqwn that (3.13) hQ1ds here iff lal < l/(p-l). A general neces-

sary and suffici~nt condition for the validity of (3.13) is ~bat the

largest eigenvalue of SX1XS be less than" 2.

Also, the computation of w~ involves the solution of a (p_l)_stJ

order difference equa~ion. Its solution is intractable for p > 2.

bS does generalize when p >2 if XiX is "block diagonal" with

all blocks 2 x 2 or scalars. Then all ~j corresponding to a co~umn

of X that is one of a correlated pair are estimated just as in the

p = 2 case by igporing the remaining p - 2 columns of X. The r~maining

~. are estimated as in (3.10) for the p = 1 case--bS ' = (X'Y)j/s .. --forJ J JJ

here I - SX'XS = 1 - 1 =O.

24

It was found that the nqrm~lity assumption was necessary in order

to obtain any results concerning b~) bS) and b4

. We shall assume its

...

presence in what follows:

ordered eigenvalues of XIX .

Also) let 0 < ~1 < ~2 ~ ... < ~ be the- n - n - - pn

Definition. The sequence of p¥:p matrices (A ) = ([a~~)J) h said ton l,.J

converge to the ma~rix A = [atjJ

... ) p. (Notation: An - ~ or

Lemma 1. If

(n)if a.. _ a ..l-J ;LJ

lim A = A) •n....oo n

for each i) j = 1) 2)

then

for all non-null ~.

(p x p) as

as

n_oo

n-oo

(3.15)

(3.16)

Proof. Note that the eigenvalues of the p.d. matrix (X'X)-l are the

ijreciprocals of the (~jn) and that (3.15) s~ys s(n) - 0 a~ n - ~ for

all i) j =1) 2) ... ) p. Using a th~arem from Bodewig (1956) p. Q6»)

we have for every n:

0 < ~-l1n

< [ 2:: (si j )2 J-\. . (n);L) J

0 as n-oo )

-1~ln - 0 as n -t- 00. It follows from a result stated by Rao

[1965) p. 50) (If.2.l)J that

25

~ln

hence

..... 00 as

(3. 17)

for all non_nu11~. Then (3.16) must hold, for otherwise we reach

a contradition of (3.17).

Lemma 2. If

--n

..... A as (3.18)

where A is a finite p x p positive definite matrix, then (3.15) holds.

Proof.XiX

Let -- = A .n nThe determinant of a ~quare ~atrix is a

continuous function of its elements, hence A _ A impliesn

det (A ) ..... det (A) > O. It follows immediat~ly from this and from then

well-known formula for the inverse of a nonsingu1ar matrix A in termsn

of its cofactors and determinant ('£'':::' the "method of adjoints"),

o (p x p) .

26

(3.18) is the usual regularity condition assumed in discussions of

large sample properties of estimators in linear models. 7

Lemma 3. The sequence of random variables Un converges in probability

to a random variable U iff for some g > 0 ,

E[~]l+jun-ul

gQ as n_oo,

For a proof of this Lemma, see Loeve (1963, p. 158).

Lemma 4. If any of the three conditions (3.15), (3.16), or (3.18)

holds, then plim (;2 / blX 'Y) = O.

/"2Proof. Since biX'Y p~ ~ F '

we have, for arbitrary 0 > 0,

F'(p, n-p; AI), where A' = ~'x,x~/~2,n n

(3.19)

From Lemmas 1 and 2 we see that any of the three conditions mentioned

above implies that Al _ 00 as n _ 00.n

It is a well-known property of the noqcentral F distribution that

this enables us to conclude that the right-hand side of (3.19) converges

to zero and hence that plim (~2 /biX'Y) = O. The fact that the

denominator d.f. of F ' is an increasing function of n considerably

speeds this convergence.

7See for example Malinvaud (19p6, p. 174).

27

Corollary 1, If any of the three conditions (3.15)}. (3.16)} or (3.18)

hold) then

as n_oo.

Proof. Applying Lemmas 3 and 4 with Un

we have that

E [~2 / bJ.X 'Y

----=---- ]1 + ~2 / bJ.X I Y

_ 0as

The result follows immediately.

Lemma 5. If U is a sequence of rand9m variables and Wa constant}n

then a sufficient condition for the equalities

plim Un = lim E(Un) Wn-oo

is that E(U _ W) 2 _ 0n

as n-ooo

Proof. The consistency of U as an estimator of Wis a consequence ofn

Tcheby~ff's Inequality} while the asymptotic unbiasedness follows from

the inequalities

<

28

Theorem 1. If either of the conditions (3.15) or (3.18) holds, then

for each j = 1, 2, .,., p, b2j is a consistent and asymptotically un

biased estimator of ~ ..J

Proof. First suppose ~ 1= 0 (p xl). Then

o 2< E(b2 · - ~.)

J J

b'X'Y A2< E(bl.-~.)2 + 2E( I~.I [ 2 1 ] [ 2' () ] Ib!. -~·I}

J J J; +b'X'Y ; +b'X'y J J1 1

(3.20)

2 2"Since E(bl . - ~.) =() sJJ, (3.15) (or, by Lemma 2, (3.18)) implies that

J J

the first two terms of (3.20) tend to zero as n - 00. Corollary 1

establishes the convergence to zero of the third term of (3.20). Hence

applying Lemma 5, we have the desired result.

If, on the 0 ther hand, ~ = 0 (p x 1), then the third term of

(3.20) is identically zero and Lemma 4 and Corolla(y 1 are no longer

required for the proof.

n.....oounbiased estimator of ~.,

J

Theorem 2. If lim sjj = 0, then b3j is a consistent and asymptotically

29

The proof of this theorem is almost identical to that of Theorem 1.

2 "Z"F' (1 n-p A") where A" = R, Icy sJJ., 'n ' n t'J

.. Turning now to b4 " it is seen that to establish its asymptotic. J

unbiasedness and consistency as an estimator of /3, we merely need toJ

show that the second term on the right hand side of (3.9) has expecta-

tion zero and probability limit zero, respectively. We can rewrite

this term as

1 +

=1 + *F (I, n-p;

(3.21)

'1(where b

ljis the simple linea~ regression coefficient obtained when all

* * ,*) 4S aelements of /3 other than /3, are ignored, and F = F (I, n-p; ~ L

J n

noncentral F random variable with noncentrality parameter

=s ,)'(sl"

PJ J2

cy s, ,JJ

sZ" ',., s ,)/3J PJ. (3.ZZ)

The author was unable to show that the expectation of (3.21)

tended to zero under certain conditions; however he is fairly cer~ain

that this is the case. The difficulty arises from an inability to

separate this random variable into two components whose expectation~

can be taken separatelY' That is, we cannot (for example) claim that

< I * I *)-1E blj E(l +F

30

Although it seems plausible that Ib1j

I and (1 +F-{()-l should be

negatively correlated (since the size of J b1j

I and that of (1 + F-{() are

both directly related to I~ .J)} attempts to formally establish thisJ

result were unsuccessful. However) we can show that under a certain

regularity condition the probability limit of (3.21) is zero.

Lemma 6. If as n _ 00 ) then plim (1 +F-{() -1 = O.

Pre (1 +F*) -1 > oJ_ 0

as

in the same fashion as the right-hand side of (3.19) discussed in

Lemma 4.

Theorem 3. If s .. - 00 andJJ

*A - 00nas then

plim b4j

= ~ ..J

Proof. As noted above} we need only show that the probability limit of

From

n_oo

-~ ..J

. * ) 2 -1(3.21) is zero. Smce Var(-b1 . = (J" s .. ) s .. - 00 asJ JJ JJ

*implies) via Tchebycheff's Inequality) that p1im (_b1j

)

-{( -1Lemma 6) we have that plim (1 +F) = O. Then using a result in

"Cramer (1963) p, 255)) it follows that

o (3.23)

As for b6j (qj) = qj b1j ) it clearly has none of the desirable

asymptotic properties except in the trivial case ~j = 0,

31

3.3 Relative Mean Square Efficiency of the Estimators:

A Simulation Experiment

Since b2

, b3

, b4

and b5

are nonlinear in €, obtaining their

p.d.f. 's or even first and second moments seemed a near impossible

task. Thus) in order to assess the quality of these estimators, it

was necessary to resort to a simulation experiment.

A detailed account of the simulation is deferred until the

Appendix. The presentation in this section will encompass the salient

results of the study.

Practical limitations on the size of the experiment made it

necessary to confine attention almost exclusively to the two re-

gressors case. Conjectures concerning the nature of generalizations

to P > 2 will be made in the following Section and in Chapter 4.

We wish to ascertain how the m.s.e. for b .. (i > 1) compares~]

with that of blj

. Therefore we are interested in relative mean square

error efficiencies of blj

to bij

) i > 1:

E. -- m.s.e.(b .. ; 130) /m.s.e.(bl .; ~.) .~ ~]] ] ]

i > 1

From considerations of symmetry, it is clear that E. is independent of~

The simulation computes estimates {Ei } for a range of values of

quantities upon which the (E.} depend. It uses n = 25 throughout, but~

from ancillary experiments) it was determined that the values of the

A

Ei

are virtually stable in the range 10 ~ n < 50. For j = 1) 2)

22"let A. = ~. / ~ s]] denote the noncentrality parameter of the Student's

] ]

t test of H' ~. = 0 vs H: ~]. ~ O. Then it was determined that for0'] a r

..

..

..

32

"estimating ~jJ the Ei depend only on r, Aj

, and At as follows, where,

as before, t = 3 - j, j = 1, 2:

'" EZ(Aj , At.?E2r)

".~

E3

= E3

(Aj

)

(3.24)

A

It was discovered that EZ

depends more heavily on Aj

than At.

The estimator b4

was discarded at an early stage of the investi

gation because it was found that E2

is less than E4

for virtually all

AjJ AtJ r, and that whenever the contrary is true, it is by a very

A

small margin and moreover, E4 exceeds E3

.

The experiment assumed that € is multivariate normal. It remains

to be seen whether departures from this assumption seriously affect

the conclusions we shall draw.

A

The E. are presented in Table 3,1 (page 33) for 4 values of r,~

7 or 8 values of A (= A, or At' whichever is a more important deter-J

A A

minant of the E. in question), and 2 or 3 values of At for E2· No~

formal investigation was made of the reliability of the entriesj

to have done so would have entailed a large simulation experiment in

itself. The author is confident, however, that all entries are accu-

rate to within ± .02 and that the accuracy increases to as fine as

8± .0005 as the entries get very close to 1.

8A discussion of this statement is given in the Appendix.

34

"In addition to the E.) the actual estimators themselves were~

computed. b2j and b3j were always slightly smaller than blj

in

absolute value) with b2j

usually being very close to blj-_these

observations are actually deducible from inspection of the estimators

themselves. b5j

is in a sense the most ambitious estimator because it

Es tima tes werewas occasionally greater than blj

in absolute value.

made of the proportion of m.s.e. attributable to squared bias) and that

for b5

was often as large as .25. The proportion for b2

and b3

rarely

exceeded .15.

Some people would criticize the present model,

t (3.25)

on the grounds that the fitted regression plane is constrained to pass

through the origin rather than the point (y) xl) x2); that we should

consider instead the model

(3.26)

The simulation program was incapable of handling this setup) but the

similar model

(3.27)

was subjected to examination. This is (1. 1) for p = 3 with s12 = s13

::.:; 0, and differs from (3.26) in that b21

and b31

are both unequal to y.

The result of the simulation of (3.27) was that the performance of the

35

b2j

and b3j

were almost indistinguishable from that of the b2 and b3

estimators in 9(3. 2S) •

Finally, the rounding of w~ to the nearest integer was found toJ

have a negligible effect on the relative efficiency of bSj as compared

with its employment exactly as in (3.12).

3.4 Discussion of the Estimators

Prior'to the commencement of the simulation, it was conjectured

that b2

would be the "worst" of the proposed estimators because of

the basic simplicity of its modification of b1. The modifying factor

is a function only of b I Xly/;2, a summary statistic that indicates the

departure of the entire vector ~ from the null vector. For our

coordinatewise criterion of goodness, an estimator that separately

modified each b1j in a way peculiar to that coordinate was thought to

have a greater change of success. Moreover, it seemed reasonable to

guess that the quality of b2

would decrease as p increased because

blX1y /;2 contains a relatively decreasing amount of information about

each individual b1j as the number of coordinates grows.

The results of the simulation show that at least the first of

A A

these conjectures was far from the truth. E2

is less than either E3

...or ES far more often than when the contrary is true. For r > 0, E

2

is only rarely greater than 1 (i.e., b2

, less efficient than b1

,).- - J J

As r approaches +1, one is increasingly unlikely to encounter a

(A1, A2} for which E2 > 1, but on the other hand, the increases in

efficiency over b1j become increasingly negligible.

9This is a rather loose statement because for p > 2, it is notquite certain which rls and AIS are the relevant ones for E

2and E

3.

36

"The asymmetry of E2 with respect to changes in the sign of r is

a rather puzzling finding. In Table 3.1, the discrepancy between EZ

in

the two cases r = .7 and r = -.7 is far greater than can be accounted

for by the stated error inherent in the simulation. The Gauss_Markov

estimator is symmetric in r and intuitively) the estimation problem

seems to be of equal difficulty when the sign of r is changed. No

adequate analytical explanation has been conceived for this phonomenon.

As previously mentioned) the values b2j were very close to the

/"2corresponding b

lj. The modifying fac tor b I XI Y (b 'X I Y + 0-) can be

written as pF / (pF + 1)) where F follows the variance-ratio distribution

with p and n - p degrees of freedom and noncentrality parameter

Since F is likely to vary directly with A.) we see that theJ

modification can be expected to be considerable for ~j near zero) but

only slight for ~j distant from the origin. For small I~j I) the

"modification is in the "correct direction; II thus E2

is small for A.J

near zero.

Let us compare blj with b6j ) the family of estimators consisting

of constant fractions of blj ) j = 1) 2. From (3.14) we have

2 2qJ' + (1 - q.) A.

J J

~'~') the relative m.s.e. efficiency of blj

to b6j

is actually linear

in Aj . For purposes of comparison) we present in Table 3.2 a listing

37

of E6 for several values of q and the same set of Aj values appearing

in Table 3.1.

Table 3,2 shows that b6j

is an extremely attractive alternative

to blj for some range of qj as long as one is quite confident that Aj

does not exceed a certain value. As q. increases to 1, the relativeJ

efficiency gets very close to 1 even for small values of A., but atJ

the same time) the value of Aj

below which E6

is less than 1,

(1 + q.) / (1 - q.), increases markedly. E6 is minimized at A. = A. byJ J J JO

choosing q. = A. / (1 + A.). Therefore, unless one is an intransigentJ JO JO

minimaxer or knows that Aj is quite likely to be large, there is

probably some qj for which b6j is preferable to both blj and b2j . If

the contrary is true, the choice of some other estimator is indicated.

The evaluation of b6 can obviously be carried much further if one is

willing to attribute a prior probability distribution to A••J

Table 3,2 Relative efficiency E6 for various values of qj

~ ,5 1.0 1.5 2 5 10 100

,30 .335 .580 .825 1.07 2.54 4.99 49.1

.50 .375 ,500 ,625 .750 1.50 2. 75 25.3

,70 .535 .580 .625 ,670 .940 1. 39 9.49

.90 ,815 ,820 .825 .830 .860 .910 1. 81

.95 .904 ,905 .906 .908 .915 .928 1. 15

.99 . .980 .980 .980 .980 .981 .981 .990

38

b2

is similar in form to the Stein-James estimator, which is

applicable when X'X = I and p > 2. Using the "optimal"

y = (p - 2) (n

(2.1) to be

p) / (n - p + 2), the modifying constant is seen from

.....2(p - 2) (n.- p) cr

1 -.(n-p+2)Y'XX'Y ,

/.....2

while that for b2

is y'XX'Y (Y'XX'Y + cr ). The two are quite similar;

and one is led to speculate that the decrease in relative efficiency

due to the employment of (2.1) rather than bl is often negligible.

It is fairly safe to conclude from Table 3.1 that E2 decreases to

2the increase in A. corresponds to cr .... 0;J

This observation is easy to explain.1 as A...... 00.J

Unless IJ. = 0,J

!..=.., b'X'y / (b'X'Y +~2) .... 1,

or b2 .... bl • Similar explanations can be given for E3 and ES' (Recall

that E6 increases without bound as Aj .... 00).

Theorem 1 concerning the asymptotic behavior of b2 embodies a

regularity condition (3.1S) or (3.18). These conditions require that

the sequences (x .} > 1 do not dwell near the origin. The hypothesistJ t_

of Theorem 2 includes the weaker statement "lim sjj = 0." It seemsn-oo

improbable that any of these requirements will often fail to be met in

practice (especially when one is working with time series data).

Since for p = 2 the relative efficiency of b3j depends only on Aj ,

the quality of this estimator is unlikely to undergo substantial change

as p increases so long as p doesn't get too close to n. But the in-

crease in p should have some effect on the precision of the estimators

blj and ~2, and the value of sjj; hence on bij

/ (bi j +~2sjj) also.

39

The estimator b3j

also was not ~ priori expected to be successful

because whereas

individual blj'

contains too little information concerning the

ignores the behavior of all components in bl asideA

from blj . The fact that E3 is unaltered by changes in r leads to the

A

Note also that ES is symmetric in r.

to the two regressors case severely limits

conclusion that m.s.e. (b3

,; ~,) depends on r only to the sam~ extentJ Jsjj.as var(b

l,); i.e., throughJ --

The restriction of bS

its applicability. Certain facts that are always true when p = Z are

occasionally false for p ~ 3--two of these were mentioned in Section

3.l.4--and another is the nonexistence of partial correlations among

the columns of X until p > 2. It is intriguing that the Stein-James

estimator is valid only for p > 2; ~.~., when and only when bS

is

(in general) not.

The estimators b2, b3, and b6, as well as the Stein-James

estimator, alter bl by shifting each of its components closer to the

origin. This type of modification is the most obvious way to decrease

the variance of an estimator.

A

var (c6) Z= c var

A

( 6)

A

< var e o < c < 1 (3.28)

where c is a constant, and one would expect (3.28) to hold even if c isA

a random variable unless the choice of c and 6 is rather bizarre. This

is one explanation for the relative success of bZ

' b3

, and b6

as

contrasted with that of b4

and bS

.

40

In addition to their relatively poor performance in "improving!!

upon bl

, the latter two estimators permit an occasional difference in

arithmetic sign between blj

and b4j

. In many contexts, incorrect

estimation of the sign of ~. can be a serious error. The first threeJ

10estimators listed above perform no worse on this score than bl .

At the outset of this investigation we noted that the need for

alternatives to bl becomes especially acute as Irl t 1. Since r is

a known quantity, an alternative to bl that performs well only for

Jrl near 1 would have been just as welcome as one that is relatively

efficient for all r. Unfortunately, none of the new estimators display

any dramatic overall improvement in relative efficiency as Irl t 1.

lOaf course, if the investigator incurs a special loss fromincorrect estimation of sign, this information should be included. inthe specification of his risk runction. In practice, however, this isinfrequently done.

•

41

4. SUMMARY, CONCLUSIONS AND RECOMMENDATIONS

4,1 Summary

This thesis is an attempt to provide alternatives to best linear

unbiased (Gauss-Markov) estimation in the general linear hypothesis

model of full rank (Graybill, 1961). Alternatives are especially

desirable in the presence of multicollinearity because the variances

of the Gauss.-Markov estimators may then be excessively large. By

changing the criterion of goodness to mean square error in each

separate coordinate of the vector estimator, it is occasionally pos-

sible to construct slightly biased estimators having far smaller

variances than those of the usual estimator. It is felt that when

the statistician's aim is in efficient structural estimation (rather

than prediction), in practice, few people would have serious reserva-

tions about this minor change in loss structure.

Five new estimators (b., i = 2, .,', 6} are constructed and1

presented as prospective applicable alternatives to the Gauss-Markov

estimator (called bl

herein). Each of the proposed estimators takes

the form of a modification of bl

.

The direct determination of the quality of the new estimators

was possible only for b6

, That of the remaining four estimators was

disclosed in the two regressors case by a computer simulation experi-

ment. Statements are made concerning the prospects for generalization

of the results of the simulation to situations where there are three

or more independent variables, The results of the simulation are

presented in a table of estimated relative mean square error

4Z

efficiencies of b1 to the Cb i }. The entries in the table are found to

be dependent upon various combinations of Cr, A1, AZ}' where r denotes

the correlation between the two regressors and the CA.} are noncentra1J

ity parameters of conventional statistical tests relating to the model.

Results concerning the asymptot~c properties of the new estimators

are given for bZ

' b3

, b4 and b6.

The three estimators bZ' b3 and b6

are all of the form

i = Z, 3, 6, (4.1)

where bij

and b1j

refer to the j_th coordinate of the vector bi

or b1,

and gi is a random variable bounded by 0 and 1. As a general rule,

these three estimators are found to be preferable by far to either b4

or bS

' Their relative efficiencies are less than 1 for a surprisingly

wide range of Cr, A1, AZ}' The estimator b

6is in many cases an

attractive alternative to b1

, but at other times it has some extremely

unfavorable properties that premonish against its use. The use of

b4

or bS

is not advised under any ~ircumstances.

An estimator that has frequently appeared in the statistical

literature, originally conceived by James and Stein (1961), also

happens to be of the form (4.1). The conclusions which emerge from

the simulation herein lead to some tentative notions about the behavior

of this estimator. The results in this thesis are appraised in the

light of the findings in James and Stein (1961) and recent research

along similar lines by other investigators.

A detailed account of the simulation study with particular

emphasis on its design aspect is presented in the Appendix.

»

43

4.2 Concluaions and Recommendations

One of the vital (though too often ~nderemphasized) properties

of b l is ita robustness to departures from the distributional assump

tions made for E. ~efore certifying any of the new estimators other

than b6 for actual use in prac~ice) a study must be made of their

robustness. Another aspect! of the (bi

} that needs to be examined

is their sensitivity to minor ~hanges in X. When multicollinearity

is present to a serious extent) b1 ia overly responsive to such

changes. It seems unlikely) however) that the proposed estimatora

will be much less sensitiv~ than bl

because of their heavy functional

dependence on it.

Among the many virtues of bl) is the ease in obtaining a best

quadratic unbiased estimate of its variance. No means of obtaining

"good" estimates of measures of reliI;loility of the new estimators has

been presented herein. In vi~w of the f~ct that the exact small-sample

moments of b2

) b3

) b4J I;lnd 05

ar~ unknown) the constrl,lction of such

estimates is likely to be a difficu~t analytical problem.

Since the results of the simulation are almost exclus~vely

limited to the p ~ 2 cas~) ~t ia necessary to consider the probable

effects of the relaxation of this assumption. My guess is that so

long as p does not get "too close" to n) the relative efficiencies EZ

and E3

will behave similarly to what has been discovered when there

are only two regressor~ in the model.

The absence of any knQwle4ge of the values of the (~.} has beenJ

2an explicit assumption throughout this thesis because.~ and rr are

unknown in (1.1). (If there exists such prior knowledge and it is

44

formally considered to be an inherent part of the model, bl may lose

many of its optimum properties.) Hence it is impossible to recommend

any single b. over all others because none of them is uniformly best~

over all O.. }.J

More often, however, the investigator has some idea of the range

of the {A.}, though it may b~ difficult to incorporate such vagueJ

information into the estimation procedure. Depending upon his

willingness to risk using an inefficient estimator in order to have

the opportunity to use a possibly efficient one, he may wish to

consider b2 or b6 as an al. te-rna tive to bl · b2j might be considered

if A. is known to be rat1;le~ small, and b6j merits attention if he

J

is sure that A. is not very large. With the l.lse of b2, one hasJ

little to gain but little to lose, while t1;le employment of b6

can lead

to appreci~ble gain or extreme regret. It is difficult to conceive of

circumstances where one wpuld wish to use b3

, b4

or bS

'

A new line of research that uhis brings to mind is a two-stage

estimation procedure wherein we first estimate the {A.}, (say withJ

~

{A.}), and based on these estimates choose some estimator (possiblyJ

bl ) that is relatively efficient for

"of the {A..}. A convenient estimatorJ

the {A.} in some neighborhoodJ

" 2 "2 j jis A.. ::: bl . / 0- s , which is

J J

distributed as a noncentral ~ variate with 1 and n - p degrees of

freedom and noncentrality pa~ameter A... J

Consider, for instance, the following outline o~ a two_stage

procedure utilizing b6

, which supposes that our objective is to use

b6j rather than blj subjec~ to the guarantee that

45

for some preassigned a s (O) 1). First choose ~jO such that

"Pr~ [A. < A. ]

f\j J - JO> 11

1 - a . (4.2)

Then choose q. just large enough so that E6

.(q.) = 1 when A. = A. ;J J J J JO

viz. : q. = (A. - 1) I (~. + 1).J JO JO

The prior literature dealing with the subject of this thesis is

hardly more encouraging than the results reported here. Of the

references surveyed in ~hapter 2J only two give one much over which

to be optimistic concerning the likelihood of significant future

progress in the study of biased estimation of regression coefficients.

I am impressed with the finding by James and Stein (1961) that (with

their loss function) b l is an inadmi~sible estimator for p > 2. But

admissibility is not often a crucial property of estimators for the

applied statistician because it is so rare that he cannot (with the

knowledge of theoretical cans!derations underlying the model) place

some sort of bounds on the likely ~anges of parameters to be estimated.

tConversely} inadmissible esti~ators are not to be hastily abandoned,

James and Stein have made no mentio~ of the probable quality of their

estimator. As indicated earlier} our results concerning the quality

~. < ~.J - JO

See

test of H :o

distribution.

llThere exists a uniformly most powerful

vs H A. > A. based on the nonaentral betaa J JO .Toro-Vizcarrondo and Wallace {l968} p. ~~4) and Toro-Vizcarrondo(1968) for a full discussion. It follows from Lehmann (l966) pp. 68}80) that there is a uniformly most accurate confidence bound for ~.

of the form indicatecl in (4.2). J

46

of b2 give rise to an educated conjecture that the improvement of (2.1)

over bl

will often be insignificant. Moreover) for the reasons given

in Section 1.4) the applicability of a weighted sum of mean square

errors loss function is often highly doubtful. It is hoped that future

work along these lines by mathematical statisticians will be somewhat

more considerate of the needs of experimental researchers) not the

least of which is a loss structure of form (1.8) rather than (1.6).

While employing the loss structure (1. 7)) Hoerl and Kennard (1970 a)

b) have taken a fresh) novel approach ~o the whole problem) which for

several examples they present has been an unqualified success. The

question of the stability of b l in the face of small changes in the

data is in this context equivalent to the problem of large variances.

It remains to be seen how much more ~tability can be achieved without

adding large biases to the individual estimators.

The prospects for future major improvements upon Gauss-Markov

estimation are not particularly promising. Aside from the ridge

regression procedure) the few successes to date are of limited appli-

cability because they either presuppose much prior knowledge about

the CA,} or are improvements to only a negligible extent. I thinkJ

there is some chance that two-stage estimation procedures of the

sort discussed above may yield sl~ghtly better estimators than those

examined herein) but it should be recognized that ease of computation

is one of the virtues of bl

and as we proceed to explore increasingly

complex estimators) we must begin tq consider whether the extra

computational effort is justified by the prospective gain in efficiency.

47

The Gauss-Markov and Rao~~lackwell Theorems are results of

remarkable conceptual simplicity. If Qne must rule out the possibility

of bringing additiona~ infprmation to bear) I intuitively feel that the

absence of similar~y appealing ~heo~ems for estimation with a mean

square error criterion of goodness sign~fies that a truly satisfying

solution to the problem (a~ ca~t in this thesis) will never be

attained.

48

5. LIST PF ~EfERENCES

Bancroft) T. A. 1944. On biases in est~mation du~ to the use ofpreliminary tests of sig~i£icance. Annals of MathematicalStatistics 15:190-204.

Baranchik, A. J. 1964. MU1Fiple regression and estimation of themean of a multivariate normal distribution. Technical ReportNo. 51, Department of Statistics, Stanford University, Stanford,California.

Baranchik, A. J. 1970. A family of minimax estimators of the meanof a multivariate normal distribution. Annals of MathematicalStatistics 41:642~645.

Bhattacharya) P. K. 1966. Estimating the mean of a multivariatenormal population wi~h gen~ra1 quadratic loss function. Annalsof Mathematical Statistics 37;1819-1824.

Bodewig, E. 1956. Matri~ Calcp1u~; North Holland Publishing Co.,Amsterdam.

Cram{r, H. 1963. Mathematical Methods of Statistics. PrincetonUniversity Press, P~inc~ton} New Jersey.

Farrar, D. E., and Glauber, R. R. 1967. Multicollinearity inregression analysis: the problem revisited. Review of Economicsand Statistics 49:96-10&.

Fraser, D. A. S. 1966. NQqparametric Methods in Statistics. JohnWiley and Sons, In~., New York, New York.

Graybill) F. A. 1961. An Introduction to ~inear Statistical Models)Vol. 1. McGraw-Hill Book Co., Inc.} New York) New York.

Hoerl, A. E.) and Kennard, R. W. 1970~. Ridge regression. Biasedestimation for nonorthogonal problems. Technometrics lZ:55-68.

Hoerl) A. E., and Kennard, R. W. 1970b. Ridge regression. Applica_tions to nonorthogonal problems. 7echnometrics 12:69-82.

James) W.) and Stein} C. 1961. Estimation with quadratic loss.Proceedings of the Fourth Berkeley Symposium o~ MathematicalStatistics and Probability 1:361-379. University of CaliforniaPress, Berkeley and Los Angel~s.

Kendall) M. G.) and Stuart} A. 1967. The Advanced Theory ofStatistics, Vol. II. Hafner P~blishing Co., New York, New York.

49

Lehmann, E. L. 1966. Testing Statistical Hypotheses. John Wiley andSons, Inc., New York) New York.

Lo~ve, M. 1963. Probability Theory. D. Van Nostrand Co., Inc.,Princeton) New Jersey.

Malinvaud, E. 1966. Statistical Methods of Econometrics. RandMcNally and Co.) Inc.) Chicago) Illinois.

Rao, C. R. 1965. Linear Statistical Inference and Its Applications.John Wiley and Sons, Inc.) New York) New York.

Sclove, S. L. 1966. Improved estimation of regression parameters.Technical Report No. 125) Department of Statistics) StanfordUniversity) Stanford) California.

Sclove, S. L. 1968. Improved estimation for coefficients in linearregression. Journal of the American Statistical Association 63:596-606.

Stein) C. 1956. Inadmissibility of the usual estimator for themean of a multivariate normal distribution. Proceedings of theThird Berkeley Symposium on Mathematical Statistics andProbability 1:197-206. University of California Press,Berkeley and Los Angeles.

Toro-Vizcarrondo) C. 1968. Multicollinearity and the mean squareerror criterion in multiple regression: a test and somesequential estimator comparisons. Unpublished ph.D. thesis)Department of Experimental Statistics) North Carolina StateUniversity at Raleigh. University Microfilms) Ann Arbor)Michigan.

Toro-Vizcarrondo, C.) and Wallace, T. D. 1968. A test of themean square criterion for restrictions in linear regression.Journal of the American Statistical Association 63:558-572.

so

6. APPENDIX: THE SIMULATION DESIGN AND PROGRAM

A simulation experiment was used to compute the estimated relative

efficiencies appearing in Table 3.1.12

As explained in Section 3.3,

the computations in the table are based on the assumptions that the

random errors are normally distributed and n = 2S.

The input for the simulation consists of the full rank matrix

2X (n x p), the parameter vector ~, and ~. The program generates the

2n random N1(0, ~) disturbances which comprise e, and computes the

vector Y = X~ + e. Then pretending that we do not know ~ and ~2, it

calculates from X and Y the values of the estimators bl , b2

, b3

, b4

and

bS

(with the exception that the calculation of bS is omitted if p > 2).

This operation is repeated with a new random e for a total of N itera-

tions,' and estimates

m,s.e.(b .. ; ~J')~J

ave (b .. _ ~.)2~J J

(6.1)

are computed for i = 1, 2, 3, 4, Sand j = 1, 2, ... , p. In (6.1),

"ave" refers to the average value over iterations. Finally, the

relative efficiencies are estimated according to

i

(6.2)

2 ..While m. s. e, (b

lj) is known to be equal to (J sJJ, the estimate rather

121 am grateful to Mr, James Goodnight, Department of ExperimentalStatistics, North Carolina State University at Raleigh, for writingthe computer program used in this study.

•

51

than the population value was used in the denominator of (6.2) to check

the effect of any systematicality that might have been present in the

nN generated errors.

Clearly) the input quantities were not chosen haphazardly. The

major task in the design of the simulation was to answer the question)

I!In what respects can X) 13) and cr2 be selected wi thout loss of

generality?1! It was determined that) for p = 2) they can be taken

arbitrarily subject to their leading to the desired values of the

variables r) AI) and A2 defined in Section 3.3. To describe the

behavior of an E.) we estimate it for a number of configurations of~

the quantities upon which the estimate depends.

Equations (3.24) were arrived at through what was essentially aA

trial and error procedure. For example) E3

was unaffected by

99 2 ". ~~ h'l E

A

t b t d bl' h fAd JJ~n s w ~ e 2 was no) u ou ~ng eac 0 ~j) cr ) an sA

r constant) left E2

invariant.

a change

(keeping

Given a finite amount of available computer time) it was neces-

sary to choose a rather limited number of the r) AI) and A2

, Four rls

were chosen: .3),7) .98) and -.7. These are) roughly speaking) a low)

average) high and average negative correlation) respectively. The 7

or 8 values of the I!more important AI! were deemed sufficient to give a

good indication of the functional relationship under consideration.

2In practice) cr was set equal to 1. Next) X was conveniently

chosen subject to its yielding the desired r f fl. The choice of X

fixed the sjj, Then the 13. were selected so as to give the desiredJ

2 2"A. = 13. / cr sJJ) j = 1) 2,J J

52

Another major problem that had to be tackled was the method of

choice of the number of iterations) N. A large number of iterations

was needed to stabilize the sample estimates (6.2») but the computer

time involved was roughly in proportion to N. To check on their

stabilization) the cumulative efficiencies E. were printed out at~

intervals of 100 iterations. It was found that by taking N = 500)

Table 3.1 could be constructed to the degree of accuracy indicated in

Section 3.3. This was thought to be adequate in view of the goal of

the simulation) which was merely to make a comparison between esti-A

mators and not the formal tabulation of moments. The E. in the table~

were informally (but not casually) obtained from a careful examination

of the results at the end of 300) 400) and 500 iterations. As an

illustration of the procedure employed) we consider two examples.

For computing E3

with Aj

= 1 and r = .7) the estimates after 300) 400

and 500 iterations were .786) .768) and .773 respectively. Thus. 77

was employed in Table 3.1. In Section 3.3) an accuracy of ±.02 was

claimed for the CEil; ~'!') that E3 lies between .75 and .79. This

assumption seems fairly safe in view of the stepwise estimates

A

obtained above. Next consider the computation of E2 with Aj = At = 10A

and r = .98. Here the values of E2 after 300) 400 and 500 iterations

were .9979) .9983) and .9982 respectively. Thus the value .998 was

used for Table 3.1. It is even clearer in this case that we have ±.02

accuracy for our estimate; it is not unlikely that the true accuracy

is as fine as ±.OOI.

It would have been preferable to choose N according to some

stopping procedure built into the program; this would have assured that

•

53A

the Eo are measured with approximately equal precision. But it was~

felt that such an inordinate complication of the program would not

significantly enhance the quality of the study.

The same nN = 12,500 random Nl(O, 1) numbers were used (in the

same order) for each entry in Table 3.1 in order to insure the

ceteris paribus nature of the measurement of the effect of a change

in estimator, r, or AIS,

For each entry in the table, the corresponding estimates were made

of "the proportion of m.s.e. attributable to squared bias" by computing

the ratios bi:s2

(b. 0) /m,;.e. (b .• j (30)' i = I, 2, 3, 4, 5, where~J ~JJ

bias (bij ) = ave (bij ) - I3 j '

The construction of Table 3.1 utilized approximately 30 minutes of

time on an l.B.M. 360-75 Computer. This excludes all time consumed in

the design of the simulation and ancillary experiments.

Generalization of these results to p > 2 is likely to present the

investigator with formidable problems of simulation design, for it is

conceivable that some Ei

might depend upon as many as (~) r's and

2P - 2 A's.

..

630.

631.

632.

633.

634.

635.

636.

637.

638.

639.

640.

641.

642.

643.

644.

645.

646.

647.

648•..649.

"650.

651.

•

NORTH CAROLINA STATE UNIVERSITY

INSTITUTE OF STATISTICS(Mimeo Series available for distribution at cost)

Sen, P. K. and M. L. Puri. On some selection procedures in two-way layouts.

Loynes, Robert M. An invariance principle for reversed martingales.

Simons, Gordon. A martingale decomposition theorem.

Sen, P. K. On fixed size confidence bands for the bundle strength of filaments.

Ghosh, Malay. Asymptotic optimal non-parametric tests for miscellaneous problems of linear regression.

Kelly, Douglas G. Concavity of magnetization as a function of external field strength for ising ferromagnets.

Sproule, Raymond Nelson. A sequential fixed-width confidence interval for the mean of aU-statistics.

Loynes, R. M. Stopping time on Brownian motion: Some properties of Roots' construction.

Wegman, Edward J. Non-parametric probability density estimation.

Michaels, Scott Edward. Optimization of testing and estimation procedures for a quadratic regression model.

Cole, J. W. L. Multivariate analysis of variance using patterned covariance matrices.

Leadbetter, M. R. On certain results for stationary point processes and their application.

Fretwell, S. D. On territorial behavior and other factors influencing habitat distribution in birds.

Loynes, R. M. Theorems of ergodic type for stationary sequences with missing observations.

Johnson, Mark Allyn. On the Kiefer-Wolfowitz process and some of its modifications. Ph.D. 'Thesis.

Sen, P. K. and S. K. Chatterjee. On the Kolmogorov-Smimov-type test of synunetry.

Helms, Ronald William. A procedure for the selection of terms and estimation of coefficients in a response surface modelwith integration-orthogonal terms. Ph.D. 'Thesis.

Wegman, Edward. Maximum likelihood estimation of a unimodel density, II.

Sen, P. K. and Malay Ghosh. On bounded length sequential confidence intervals based on one-sample as in rank-orderstatistics.

Seheult, Allan Henry. On unbiased estimation of density functions. Ph.D. 'Thesis.

Williamson, Norman. Some topics in system theory. Ph.D. Thesis.

Weber, Donald Chester. A stochastic model for automobile accident experience. Ph.D. 'Thesis.

ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A …boos/library/mimeo.archive/... ·...

Documents

Transcript of ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A …boos/library/mimeo.archive/... ·...