Asymptotic Relative Efficiency

142
Asymptotic Relative Efficiency Pitman EJG (1979) Some Basic Theory for Statistical Inference. London: Chapman & Hall. Noether GE (1950) Asymptotic properties of the Wald-Wolfowitz test of randomness. Annals of Mathematical Statistics 21(2):231-246. Noether GE (1955) On a theorem of Pitman. Annals of Mathematical Statistics 26(1):64-68. Rothe G (1981) Some properties of the asymptotic relative Pitman efficiency. Annals of Statistics 9(3):663-669. Serfling RJ (1980) Chapter 10: Asymptotic relative efficiency. In: Approximation Theorems of Mathematical Statistics, John Wiley & Sons. DasGupta (2008) Chapter 22: Asymptotic efficiency in testing. In: Asymptotic Theory of Statistics and Probability, Springer. Nikitin Y(2011) Asymptotic relative efficiency of two tests. International Encyclopedia of Statistical Science. Serfling RJ (2011) Asymptotic relative efficiency in estimation. International Encyclopedia of Statistical Science.

description

This is a collection of classical and current work on the concept of asymptotic relative efficiency, which was originally due to E. J. G. Pitman. Unfortunately, I was not able to include his well-cited notes from 1949 for his graduate course at Columbia. His 1979 book, however, covers the idea, and the other articles and book chapters cite Pitman as the developer of this important concept.

Transcript of Asymptotic Relative Efficiency

Page 1: Asymptotic Relative Efficiency

Asymptotic Relative Efficiency

Pitman EJG (1979) Some Basic Theory for Statistical Inference. London: Chapman & Hall.

Noether GE (1950) Asymptotic properties of the Wald-Wolfowitz test of randomness. Annals of Mathematical Statistics 21(2):231-246.

Noether GE (1955) On a theorem of Pitman. Annals of Mathematical Statistics 26(1):64-68.

Rothe G (1981) Some properties of the asymptotic relative Pitman efficiency. Annals of Statistics 9(3):663-669.

Serfling RJ (1980) Chapter 10: Asymptotic relative efficiency. In: Approximation Theorems of Mathematical Statistics, John Wiley & Sons.

DasGupta (2008) Chapter 22: Asymptotic efficiency in testing. In: Asymptotic Theory of Statistics and Probability, Springer.

Nikitin Y(2011) Asymptotic relative efficiency of two tests. International Encyclopedia of Statistical Science.

Serfling RJ (2011) Asymptotic relative efficiency in estimation. International Encyclopedia of Statistical Science.

Page 2: Asymptotic Relative Efficiency

)

I'

ii ,I

r

~Some Basic Theory

for Statistical Inference

E.J.G.~ITMAN M.A. IY.Sc. F.A.A.

Emeritus Professor of Mathematics University of Tasmania

LONDON CHAPMAN AND HALL

A Halsted Press Book JOHN WILEY & SONS, NEW YORK

Page 3: Asymptotic Relative Efficiency

First published 1979 by Chapman and Hall Ltd 11 New Fetter Lane, London EC4P 4EE

© 1979 E.J.G. Pitman

Printed in Great Britain at the University Printing House, Cambridge

ISBN 0 412 21720 1

All rights reserved. No part of this book may be reprinted, or reproduced or utilized in any form or by any electronic, mechanical or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publisher.

Distributed in the U.S.A. by Halsted Press, a Division of John Wiley & Sons, New York

Q4 .~ 'l {r

-'-'5'::<7 .. .J !

Library of Congress Cataloging in Publication Data

Pitman, Edwin J. G. Some basic theory for statistical inference.

(Monographs on applied probability and statistics) Includes bibliographical references. 1. Mathematical statistics. I. Title.

QA276.P537 519.5 78-11921 ISBN 0-470-26554-X

Preface

Chapter one

CONTENTS

Basic Principles of the Theory of Inference, The Likelihood Principle, Sufficient Statistics

Chapter two Distance between Probability Measures

Chapter three Sensitivity of a Family of Probability Measures with respect to a Parameter

Chapter four Sensitivity Rating, Conditional Sensitivity, The Discrimination Rate Statistic

Chapter five Efficacy, Sensitivity, The Cramer-Rao Inequality

Chapter six Many Parameters, The Sensitivity Matrix

Chapter seven Asymptotic Power of a Test, Asymptotic Relative Efficiency

Chapter eight Maximum Likelihood Estimation

Chapter nine The Sample Distribution Function

Appendix: Mathematical Preliminaries

References

Index

v

page vii

1

6

11

24

29

50

56

63

79

98

107

109

Page 4: Asymptotic Relative Efficiency

PREFACE

This book is largely based on work done in 1973 while I was a Senior Visiting Research Fellow, supported by the Science Research Council, in the Mathematics Department of Dundee University, and later while a visitor in the Department of Statistics at the University of Melbourne. In both institutions, and also at the 1975 Summer Research Institute ofthe Australian Mathe­matical Society, I gave a series of talks with the general title 'A New Look at Some Old Statistical Theory'. That title indicates fairly well my intentions when I started writing this book.

I was encouraged in my project by some remarks of Professor D.V. Lindley (1972) in his review of The Theory of Statistical I riference by S. Zacks:

One point that does distress me aboUt this book-and let me hasten to say that this is not the fault of the author-is the ugliness of some of the material and the drabness of most of it ... The truth is that the mathematics of our subject has little beauty to it. Is it wrong to ask that a subject should be a delight for its own sake? I hope not. Is there no elegant proof of the consistency of maximum likelihood, or do we have to live with inelegant conditions?

I share Lindley's dissatisfaction with much statistical theory. This book is an attempt to present some of the basic mathematical results required for statistical inference with some elegance as well as precision, and at a level which will make it readable by most students of statistics. The topics treated are simply those that I have been able to do to my own satisfaction by this date.

I am grateful to those who, at Dundee, Melbourne, or Sydney, were presented with earlier versions, and Who helped with their questions and criticisms. I am specially grateful to Professor E.J. Williams, with whom I have had many discussions, and who arranged for and supervised the typing; to Judith Adams and Judith Benney, who did most ofthe typing, and to Betty Laby, who drew the diagrams.

E. J. G. P.

vii

Page 5: Asymptotic Relative Efficiency

CHAPTER ONE

BASIC PRINCIPLES OF THE THEORY OF INFERENCE THE LIKELIHOOD PRINCIPLE

SUFFICIENT STATISTICS

In developing the theory of statistical inference, I fmd it helpful to bear in mind two considerations. Firstly, I take the view that the aim of the theory of inference is to provide a set of principles, which help the statistician to assess the strength of the evidence supplied by a trial or experiment for or against a hypothesis, or to assess the reliability of an estimate derived from the result of such a trial or experiment. In making such an assessment we may look at the results to be assessed from various points of view, and express ourselves in various ways. For example, we may think and speak in terms of repeated trials as for confi­dence limits or for significance tests, or we may consider the effect of various loss functions. Standard errors do give us some comprehension of reliability; but we may sometimes prefer to think in terms of prior and posterior distributions. All of these may be helpful, and none should be interdicted. The theory of inference is persuasive rather than coercive.

Secondly, statistics being essentially a branch of applied mathematics, we should be guided in our choice of principles and methods by the practical applications. All actual sample spaces are discrete, and all observable random variables have discrete distributions. The continuous distribution is a mathe­matical construction, suitable for mathematical treatment, but not practically observable. We develop our fundamental concepts, principles and methods, in the study of discrete distributions. In the case of a discrete sample space, it is easy to understand and appreciate the practical or experimental significance and value of conditional distributions, the likelihood principle, the principles of sufficiency and conditionality, and the method of maximum likelihood. These are then extended to more general distributions by means of suitable definitions and mathematical theorems.

1

Page 6: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Let us consider the likelihood principle. If the sample space is discrete, its points may be enumerated,

Suppose that the probability of observing the point Xr is f(x r , e), and that e is unknown. An .experiment is performed, and the outcome is the point x of the sample space. All that the expen­ment tells us about e is that an event has occurred, the probability of which is f(x, e), which for a given x is a function of e, the likelihood function.

Consider fIrst the case where e takes only two values eo, e1. To decide between eo and ep all that the experiment gives us is the pair of likelihoods f(x, eo),f(x, ( 1). Suppose that e is a random variable, taking the values eo, e1 with probabilities PO,Pl' where Po + P1 = 1. Given the observed value x, the conditional probabilities of eo, e1 are proportional to Pof(x, eo), pd(x, ( 1). The conditional odds of e1 against eo are

(P1/PO) [j(x, ( 1)/f(x, eo)].

The prior odds are P1/PO' and all that comes from the experiment is the likelihood ratio f(x, ( 1)/f(x, eo)'

If f(x,e1)/f(x,eo) = 00, then e=e1, if f(x,e 1)/f(x,eo)=o, then e = eo. Let c be a positive number. Denote the set of x points for which f(x, ( 1)/f(x, eo) = c by A. If xrEA, the e1 conditional probability of xr given A is

f(x r , ( 1) cf(xr, eo) f(x r , eo) L f(x, ( 1) = L cf(x, eo) = L f(x, eo)'

xeA xeA xeA

which is the eo conditional probability of xr given A. Hence the conditional distribution of any statistic T will be the same for e = e1 as for e = eo. Thus when we know f(x, ( 1)/ f(x, eo), knowledge of the value of T gives no additional help in deciding between eo and e1. We express all this by saying that when e takes only two values eo and e1, f(x, ( 1)/ f(x, eo) is a sufficient statistic for the estimation of e. In the ordinary, non-technical sense of the word, all the information supplied by the experiment is contained in the likelihood ratio f(x, ( 1)/ f(x, eo}. When e can take many values, all the information about e given by the experiment is contained in the likelihood function, for from it

2

BASIC PRINCIPLES OF THE THEORY OF INFERENCE

we can recover any required likelihood ratio. This is the likeli­hood principle.

For a discrete sample space, where the observed sample has a non-zero probability, this seems sound; but when we come to a continuous distribution, we are dealing with an observation which has zero probability, and the principle seems not so intuitively appealing. However, the extension to the continuous distribution seems reasonable when we regard the continuous distribution as the limit of discrete distributions.

If the probability measure Pe, on the sample space has density f(x, e) with respect to a O'-fmite measure jl, it is still true that when e can take only two values eo, e l' all conditional probabili­ties given f(x, ( 1)/ f(x, eo) = c (positive and fmite) are the same for e = e1 as for e = eo' The likelihood ratio f(x, ( 1)/f(x, eo) is a sufficient statistic; but it should be remembered that when the probability of the conditioning event is zero, a conditional probability is not operationally verifIable. Conditional probabili­ties, in the general case, are defIned as integrands with certain properties.

This extension of the likelihood principle is supported by the Neyman-Pearson theorem, which says that, for testing the simple hypothesis H 0 that a probability distribution has a density of fo against the alternative hypothesis H 1 that the density function is f1' the most powerful test of given size (probability of rejecting Ho when true) is based on the likelihood ratiof1(x)/fo(x). There is a critical value c; sample points which reject H 0 have f1(X)/fo(x) 2:: c, and those which do not reject Ho have f1(X)/fo(x)::;; c.

Suppose that a statistician has to make n tests of this kind, each of a simple hypothesis H 0 against a simple alternative H 1 ,

the experiment and the hypotheses in one test having no connec­tion with those in another. Suppose also that for the average size of his tests (probability of rejecting H 0 when true), he wishes to maximize the average power (probability of rejecting H 0

when H 1 is true). It can be shown (Pitman, 1965) that to do this, he must use the same critical value c for the likelihood ratio in all the tests. This result suggests that, to some extent, a parti­cular numerical value of a likelihood ratio means the same thing, whatever the experiment. This is obvious when we have prior probabilities for H 0 and H 1 : for if the prior odds for H 1 against

3

Page 7: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Ho are PI/PO' the posterior odds are (Pl/PO)(fl/fo), and all that comes from the experiment is the ratio fl / fo. This is the argument with which we started for the discrete sample space. Its extension to the non-discrete sample space needs some sort of justification as given above, because conditional probabilities are not then directly verifiable experimentally.

Instead of the likelihood ratio f(x, ()1)/ f(x, eo) we may use its logarithm, logf(x, ( 1) -logf(x, eo), which we may call the discrimination of e 1 against eo for the point x. If e is a real parameter, taking values in a real interval, the average rate of discrimination between eo and e1 is

logf(x, ( 1) -logf(x, eo)

e 1 - eo

The limit of this when e 1 ~ eo, if it exists, is

d 10g~X' e) 19=90 = f'(x, eo)/f(x, eo),

where the prime denotes differentiation with respect to e. This is the discrimination rate at eo for the point x. It plays a central role in the theory of inference.

Suppose that for values of e in some set A, the density function factorizes as follows.

f(x, e) = g[T(x), e]h(x) , [Ll]

where T, h are functions of x only, and g, h are non-negative. For any e l' e z E A, the likelihood ratio

f(x,e z)/f(x,e1) = g[T(x),ez]/g[T(x),e1],

is a function of T(x), and so its value is determined by the value of T. This is true for every pair of values of e in A. Hence all the information about e provided by the experiment is the value of T(x). T is a sufficient statistic for the set A. Note that T may be real or vector-valued.

We may look at this in another way. Relative to the measure fl, f(x, e) is the density at x of the probability measure P 9 on fiE. If when [Ll] is true we use, instead of fl, the measure v defined by dv = h(x)dfl, the density of the probability measure becomes g[T(x), e] at x. Clearly all the information about e provided by the experiment is the value of T(x).

4

BASIC PRINCIPLES OF THE THEORY OF INFERENCE

Conversely, if the value of some statistic T is all the information about e provided by the observation x, for eEA, then for any fixed eoEA, the likelihood ratio f(x, e)/f(x, eo) must be a function of T and e only for eEA,

f(x, e)/f(x, eo) = g[T(x), e]

f(x, e) = g[T(x), e]f(x, eo)

= g[T(x),e]h(x).

If a sufficient statistic T exists for eEN, where N is a real open interval containing eo,

L = logf(x, e) = log g[T(x), e] + log hex)

L~ = g'[T(x),eo]/g[T(x),eo], whereg' = ag/ae.

The discrimination rate L~ is a function of the sufficient statis­tic T.

5

Page 8: Asymptotic Relative Efficiency

CHAPTER TWO

DISTANCE BETWEEN PROBABILITY MEASURES

1 From the practical or experimental point of view, the problem of estimation is the problem of picking the 'actual' or 'correct' probability measure from a family of possible probability measures by observing the results of trials. It is therefore advisable fIrst to study families of probability measures, and consider how much members of a family differ from one another.

Let P l' P 2 be probability measures on the same a-algebra of sets in a space Pl'. We want to defIne a measure of the difference between the two probability measures, a distance between P 1

and P 2. Suppose that they have densities fl' f2 relative to a dominating measure f.1" which is a-fInite. This is no restriction for we may always take P 1 + P 2 for f.1,. '

Consideration of continuous distributions in R 1 suggests

p*(P1 ,P2) = Jlfl - f2Idf.1"

the Ll norm of fl - f2 for the distance between the distributions. This is zero when the distributions coincide and has the maximum value 2 when they do not overlap; p* is the total variation of P 1 - P 2. For measurable sets A, P 1 (A) - PiA) is a maximum when A is the set {X;fl(X) > fix)}, and then

P1(A)-PiA) = P2(AC)-P1(AC) = tp*·

It has two main disadvantages. The modulus of a function is analytically awkward, and the L 1 norm gives the same weight to the same difference between fl and f2 whether the smaller of the two is large or small. The L2 norm of Jfl - Jf2 is much better ~ both respects, and so we derme p(P l' P 2) = P(fl ,f2)' the dIstance between P 1 and P 2' by

p2(Pl' P2) = J<Jfl - Jf2)2df.1, = 2 - 2S J(fJ2)df.1,. [2.1J

p can take values from 0 to J2. It is 0 if and only if P 1 = P 2. It has the maximum value J2 if and only if fl(X)f2(X) = 0 a.e.f.1"

6

DISTANCE BETWEEN PROBABILITY MEASURES

i.e. if there are disjoint sets A 1 , A2 such that P 1 (A1) = 1 = P 2(A2). The distance between any discrete probability measure and any continuous probability measure has the maximum value J2, however close these measures may be in another metric.

p* = S IJfl - Jf21 (Jfl + f2)df.1, ~ p2.

By Schwarz's inequality

p*2 ~ SCJfl - Jf2fdw S(Jfl + .Jf2)2df.1, = p2(4 - p2).

Therefore

p2 ~ p* ~ pJ(4- p2) ~ 2p.

The value of p is independent of the particular choice of the dominating measure f.1,. Let g l' g 2 be the densities of P l' P 2 relative to another dominating measure v. Let h, k be the densities of f.1" v relative to f.1, + v. The density of P 1 relative to f.1, + v is flh and also glk. Thus flh = glk. Similarly f2h = g2k. Hence J (fl f2)h = J (g 19 2)k, and

J J(glg2)dv = S J(glg2)kd(f.1, + v)

= S J (fl f2)hd(f.1, + v) = S.J (fl f2)df.1, .

Thus the distance is the same whether the densities relative to f.1, or to v are used.

For any f.1, measurable set A,

[J J(fd2)df.1,J 2 ~ S fl df.1,Sf2df.1" A A A

Hence, if (Ar; r = 1, ... ,n) is a partition of Pl' into a fInite number of disjoint, measurable sets,

r=l

n

SJ(flf2)df.1, ~ I J[P 1 (Ar}P iAr) J. r= 1

It can be shown that the left hand side of this inequality is the infImum of the right hand side for all such partitions. Thus

7

Page 9: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

n

pZ(P1,PZ) = 2-2S.j(fJz)d)1 = 2-2inf I .j[P1(Ar)Pz(Ar)]

= 2 sup{ 1 - J1 .j[P1(Ar)Pz(Ar)J }

2 If T is a mapping of f![ into a space 1!lJ, and if Q1' Qz are the probability measures in I!lJ induced by P l' P z' then

[2.2J

with equality if and only if fdfz is a function of T, i.e. if T is a sufficient statistic.

Let gl'gz be the densities of Q1,QZ relative to a IT-finite measure v on I!lJ which dominates the measure induced by )1.

We prove [2.2J by showing that

S .j (fJz)d)1 :-:;; S.j (g 19 z)dv .

By the Schwarz inequality (iv) in Section 2 of the Appendix.

T* .j(fJz) :-:;; .j(T*f1· T*fz) = .j(g1gZ) a.e.v, [2.3J

S .j (f1 fz)d)1 = S T* .j (f1 fz)dv :-:;; S.j (g 19 z)dv ,

with equality if and only if the equality in [2.3J holds a.e.v, i.e. if .j f1 /.j fz is a function of T, and therefore fz/ f1 a function of T, so that T is a sufficient statistic. Note that a 1-1 mapping is a sufficient statistic. D D D

3 If cD In(X l' ... , Xn) = fl (Xl) fl (XZ)· . .fl (Xn),

cDZn(x 1, ... ,Xn) = fz(x 1)fz(xz)·· .fz(Xn},

1 - tPZ(cD In' cDZn) = S.j (cD In cD Zn))1(dx1)··· )1(dxz)

= H.j(fJz))1[d(x1)J}n = {1- tPZ(fl'/Z)}n,

and therefore ~ 0 as n ~ 00, if fl +- fz . Thus pZ(cDln,cDzn)~2, the maximum value, as n~oo. This means that if two probability distributions are different, no matter how close to one another they are, by taking a sufficiently large sample, we can obtain distributions with a pZ, and therefore a p*, as close to the maximum 2 as we please - a fundamental principle of applied statistics.

8

DIST ANCE BETWEEN PROBABILITY MEASURES

4 Some insight into the statistical significance of p may be gained by considering discrete distributions. Let P be the discrete probability distribution which assigns the probability Pr to the point X r ' r = 1,2, ... ,k. Let P' be the probability distribution determined by a random sample of size n from this distribution. p~ = nJn, r = 1,2, ... , k, where nr is the number of Xr points in the sample.

Consider

x Z = \,(nr - npY = n \,(p~ - Pr)Z L nPr L Pr

The limit distribution of this, when n ~ 00, is XZ with k - 1 degrees of freedom.

Z( ') _ "'( /' /)Z _ \' (p~ - Pr? p P,P - i..J V Pr - V Pr - L(.jp~ + .jPr?

Thus

where

= ~ \' (p~ - PJZ[1 _ (.jp~ - .j~r)(.jp~ ~ 3.jPr)J. 46 Pr (.jPr + .jPr)

I 1< 31.jp~ - .jPrl Gr - .jp~+.jPr '

and therefore ~ 0 when P~ ~ Pro The 8r all ~ 0 with probability one as n ~ 00, and

4npz = XZ(1 + '1n},

where '1n ~ 0 with probability one as n ~ 00. Hence 4npz ~ X Z

in probability as n ~ 00: its limit distribution is therefore XZ

with k - 1 degrees of freedom. There is no corresponding theorem for a continuous distri­

bution. The sample distribution is discrete, and so always has maximum distance .j2 from its continuous parent distribution.

Now consider two samples of sizes n1 , nz from the same discrete distribution P above. Let n1r , nZr be the numbers of

9

Page 10: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Xr points in the two samples.

Plr = nlr/nl , P2r = n2r/n2·

The usual measure of the discrepancy between the samples is

X 2 = n n I(nlr/nl -n2r/n2)2 - n I (Plr - P2Y I 2 - In2 .

nlr + n2r nlPlr + n2P2r

As nl , n2 --+ 00 its limit distribution is X2 with k - 1 degrees of freedom.

The distance between the two sample distributions is given by

p2 = I(JPlr - JP2rf = \' (Plr - P2Y 2 f..jJPlr + JP2r)

= nl + n2 \' (Plr - P2rf (I + Br), 4 LnlPlr + n2P2r

where

B = (jPlr - jP2r) [(3nl - n2)jPlr - (3n2 - nl )jP2rJ r (n l + n2)(.jPlr + .jP2r?

IBrl ::; 3(JPlr - JP2r), JPlr + JP2r

and therefore --+ 0 with probability one as nl , n2 --+ 00.

4nln2p2 = X2(1 + 11), nl +n2

where 11 --+ 0 with probability one as nl , n2 --+ 00. The limit distribution of 4nl n2p2 /(n l + n2) is X2 with k - 1 degrees of freedom. The result in the previous case can be deduced from this case by putting nl = n, n2 = 00.

When nl = n2 = n, the results simplify to

X2 = n \' (Plr - P2Y, L Plr+ P2r

2np2 = n \'(Plr - P2r)2[1 + (JPlr - JP2r)~J. L Plr + P2r (JPlr + JP2r)

In this case

10

CHAPTER THREE

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

WITH RESPECT TO A PARAMETER

I We shall consider a set {Po ;OEE>} of probability measures on the space q; with densities f( . , 0) relative to a (J - ftnite measure jl, where E> is a real interval. To simplify notation we shall write f for f(-, 0), and f.. for f(', Or) where convenient. The derivatives with respect to 0, where they exist, will be denoted by!',f:. For the present we denote by p, = P(P6 ,P8 ), the distance of P6 from a fixed probability measure P 60 •

p2 = J(Jf - Jfo)2djl = 2 - 2 J.j (ffo)djl.

p2, and therefore p, is a continuous function of 0 iffis continuous in mean because J(ffo) ::;fo + J, and so J(ffo) is continuous in mean.

p takes the minimum value 0 at 0 = 00 , Hence if it has a derivative at 00 , this must be zero. In general, this is not so.

p2 = f(Jf - JfO)2 dll e =1= e . [3.1J (0 - ( 0 )2 0 - 00 r, 0

If a.e.jl,.jf has a e derivative at 00 , then a.e. the integrand --+ (dJf/dO)i=60 = f~2/4fo when e --+ eo' Under certain regularity conditions the limit of the integral will be equal to the integral of the limit and we shall have

lim p2 2 = ff~2 djl, 6--+80 (0 - eo) 4fo

and therefore [3.2J

where

1= I(e) = f f;2 djl = E6[(f'/f)2J,

11

Page 11: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

and

10 = I(eo)'

At eo, p will have a right hand derivative lJlo, and a left hand derivative - i-Jlo' Sufficient regularity conditions are given below; but even without these, if at eo,fhas a e derivative a.e. then

and so

[3.3J

In particular, if 10 = 00, lim pile - eo 1= 00. The right hand o~oo

?erivative of p at e = eo is + 00, and the left hand derivative IS - 00.

A case where [3.3J is true and not [3.2J is the set of distributions on the real line with density functions

f(x, e) = eO-x, x ;:::: e,

= 0, x < e. Here

f'(x, e) = eO-X, x > e,

= 0, x < e, and does not exist at x = e. 10 = 1.

p2(f,fo) = 2 - 2e- I,0-001/2.

As e-+eo, p2/Ie-Bol-+l, and p2/(e-eo)2-+00, and thus pile - eol-+ 00. 100 does not tell the full story.

2 By the mean value theorem and the theorem of dominated convergence, it follows from [3.1 J that if the following conditions are satisfied, i.e.

(i) for almost all x, f is a continuous function of e in an open interval N containing eo ;

12

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

(ii) f has a e derivative at all points of N - D, where D is a denumerable set of points in N, which does not contain eo, but otherwise may vary with x;

(iii) f'(x, e)21 f(x, e) 5 g(x),

where g is integrable over f!£ ;

then lois finite and

for eEN -D,

In general, this result is applicable to distributions likely to be met in practice, except when the points at which f(x, e) > 0 vary with e. Such a case is the family of distributions with density functions

f(x,e) = eO-X(x - er-1/f'(m), x> e, m> 0,

=0 ,x<e.

f'21f = eO-X(x - er- 3(x - e - m + W/f'(m), x> e,

= 0, x < e. I is finite if m> 2, infinite if 0 < m 52 and m =1= 1, and 1= 1 if m = 1. The conditions set out above are satisfied when m;:::: 3, but not when 2 < m < 3. Many such cases can be dealt with by using the following theorem.

Iflimp2/(e-eo)2 exists, is finite, and is equal to ilo, we shall say that the family of probability measures (or the family of densities) is smooth at eo. If the family is smooth at every point of an open interval N, we shall say that the family is smooth in N. Instead of saying that the family of densities (or the family of probability measures) is smooth, it is sometimes convenient to say, more briefly, thatfis smooth, at eo, or in N.

3 Theorem If at each e in an open interval N containing eo ,fhas a e derivative f' at almost all x, and

(i) Sf'dfl = 0, = :e Sf dfl,

(ii) :eSJ(ffo)dfl = f dJd~o) dfl = f f~1; dfl,

13

Page 12: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

(iii) 1= Jf'21fdjl is a continuous function of e, I

then the family is smooth at 80· If alsof'(·, 8) ~ 1'(., (0)*' then at 80

d2p21d82 = Yo, where

Hence

[3.5J

v ~ ° as e ~ eo. By the extended form of l'Hopital's rule given in Section 3 of the Appendix.

V v, lim sup (8 _ 8 )2 ::;; lim sup 2(8 - 8 r 8-+80 0 8-+80 0

From [3.5J

1· 2V < 1· V 1 1lll sup (8 _ 8 )2 - 1lll sup (8 8)2 + "41 0'

8 ... 80 0 8 ... 80 - 0

and therefore

1· V < 11 1lll sup (ll e)2 - 4 o· 8 ... 80 0 - 0

From [3.3J

lim ·f V 1 8 ... 80 m (e _ eo)2 2:: 410'

I *See Section 1 of the Appendix for definition of .....

14

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

and so

lim V 11 (8 8 )2 = 4 o·

8 ... 80 - 0

This proves the first part of the theorem. At 80' V' = 0, and so

V' - V~ _ f(~f - ~fo)f' e - 80 - (8 - eo)~f djl,

which~ J(f~2/2fo)djl=iIo' as 8~80' because the integrand I 2 ~ f~ 12fo, and

I (~f - ~fo)f' I < (~f - ~fO)2 +1'2 If (8 - eo)~f - (e - eo)2 '

which is convergent in mean at 80. Thus,

at eo, dV'ld8 = d2V Id82 = Yo. The distributions on R1 with densities

e8 - X (x - e)m-1 Ir(m), m > 0, x > e,

satisfy the conditions of the theorem for all m > 2.

DOD

4 Denote lim infp(P/J,P80)/le-801, which always exists, by 8 ... 80

so; it is a measure of the sensitivity of the family to small changes in 8 at eo. In all cases So 2::i~Io. If So is fmite and equal to i~Io, we shall say that the family is semi-smooth at 80. A smooth family is, of course semi-smooth. Semi-smoothness is a theoretical possibility rather than a contingency liable to be encountered in practice, but it must be mentioned if we are to give a complete account of 1.

If the density fhas a 8 derivative in mean at 80' i.e. iff~ exists such that

limflf-fo -f~ldjl,=O, 8 ... 80 8 - 80

we shall say that the family of probability measures (or the family of densities) is differentiable in mean at eo.

1~=~oJ = I~~=i:ol(~f+~fo),

15

Page 13: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

therefore

I;=~tl ~ (J;=i:o y +2f+ 2fo·

If the family is smooth at () o,f -+ fo in mean, and (Jf - Jfof I«() - ()0)2 -+ f~2 14fo in mean as () -+ ()o. Therefore (f - fo)/(() - ()o) -+ f~ in mean. Thus smoothness at ()o implies differentiability in mean at ()o. For any measurable set B in fE,

Sf~dp, = :() SJdp,19-90 B B

which is thus a necessary condition for smoothness. In particular

SJ~dp, = o. This last statement is true for a semi-smooth family.

5 Put

L = logf, Lo = logfo,

L: = d logfl d() = f'1f, L:o = f~/fo.

When the family is smooth or semi-smooth at ()o,

E90(L:O) = S(f~1 fo)fodp, = Sf~dp, = o. 10 = E90(L:5) = V90(L:O) = 4s~.

If the family is smooth or semi-smooth in an open interval N, Sf'dp, = 0 for all ()EN. If this can be differentiated with respect to (),

Thus

Sf"dp, = 0,

E(82LI8()2) = S:e (f'lf)fdp, = (f"lf- f'2If2)fdp"

= Sf" dp, - S (f,2 If)dp, = - E(L:2).

6 Suppose we have a family of probability measures P 9 on a space fE with densities f(·, ()) relative to a O"-finite measure p"

16

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

and a family of probability measures Qo on a space Ojj with densitites g(., ()) relative to a O"-finite measure v. Consider the product measure R9 = P9 X Q9 on the space fE x Ojj. Its density hex, y, ()) at (x, y) is f(x, ())g(y, ()) relative to the product measure p, x v. Put

then

1 -lp2 = SJ(fgfogo)·d(p, x v) = SJUJo)dp,- SJ(ggo)dvv

= (1 -lp~)(1-1p~).

Therefore 2 2+ 2 1 2 2 P = Pl P2 - 2:P1P2'

p2 p~ p~ «() - ()of = «() - ()0)2 + «() - ()0)2

If 2 2

2 1· Pl d 2 lim P2 S10 = 1m «() _ ())2 an S20 = «() _ () )2'

9-+90 0 9-+90 0

exist, so does 2

2 1· P So = lID «() _ () )2'

9-+90 0

and 2 2 2

So = SlO + S20·

This can be extended to any finite number of families.

L(x, y, ()) = log hex, y, ()) = logf(x, ()) + log g(y, ()).

L: =f'lf+g'lg.

If the P and Q families are both smooth at ()o,

E90(L:O) = E90(f~1 fo) + E90(g~1 go) = o. Sh~2Iho·d(p, x v) = V90(L:O) = V90(f~/fo)+ Voo(g~/go)

= 4sio + 4s~0 = 4s~,

17

Page 14: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

and so the R family is smooth at eo' This can be extended to any finite number of smooth families. In particular, if X I'

X 2' ... ,X n are independent random variables, each with the same family of distributions, which is smooth at eo, then the family of (X l' X 2' ... ,X n) distributions is smooth at eo. In the same way it can be shown that if the Xr have the same semi­smooth family of distributions, the joint distribution is semi­smooth.

7 R.A. Fisher encountered I in his investigation (Fisher, 1925) of the variance of the maximum likelihood estimator in large samples. He called it the intrinsic accuracy of the distribution, and later, the amount of information in an observation. The former name has dropped out of use. I is now usually called the information; but this is not a good name, as the following. examples will show.

For one observation from the normal distribution with densitye-(X-O)2/2/J(2n),

L = - i(x - e)2 - i log 2n, 1: = x - e, I = 1.

For the densities [1/.J(2n)]e-(X-03 )2/2,

L = - i(x - ( 3 )2 - i log 2n, 1: = 3e2(x - lJ3), I = ge4 •

At e = 0, I = 0; but this does not mean that a single observation gives no information about e.1t merely means that dplde = 0 at e = 0, where p is the distance of Po from Po. As a consequence of the Cramer-Rao inequality, (see Chapter 5) the e derivative of the mean value of any statistic with a finite variance will also be Oat e = O.

For the exponential family t!-x, x ~ e, I has the value 1, as it does for the normal family of unit variance and mean e. For a sample of n, the value of I is n in both cases. In the normal case, the sample mean Xn is a sufficient statistic. It has mean e and variance lin. In the exponential case, the least observation X(li is a sufficient statistic. X(l) - lin has mean e and variance lin . The two cases are quite different, although I has the same value. As shown above, I has little significance in the exponential case.

For the distributions on RI with densities

18

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

f(x, e) = (x - e)eo-X,

=0,

x ~ e, x < e;

1=00 for every e. One observation gives infinite 'information' aboute!

For the mixture of normal distributions with densities at x,

(1- e)e- x2/21.J(2n) + ee- x2/4/J(4n),0 ~ e ~ 1,

I = 00 at e = 0; but for the family

(1- ( 2)e-x2/21.J(2n) + e2e- x2/4/.J(4n),0 ~ e ~ 1,

1= 0 at e = O. This cannot be explained in terms of ' information'. Evidently 'information' is a misnomer. Denoting, as before,

lim infp(Po,Po)/le-eol, which always exists, by So' we shall 0-+00

call 4s~ the sensitivity of the family at eo. When the family is smooth or semi-smooth, or when lois infmite, 10 = 4s~, so that then the value of 10 is the sensitivity at eo. It would seem that when 10 < 4s~, the particular value of 10 has no statistical signi­ficance at all. As we have seen, when independent probability distributions are combined to give a product distribution, sensitivities of smooth families are additive.

8 A mapping T of a probability space f!{ into a space !T, as discussed in Chapter I, is called a statistic. The induced pro­bability measure in !T is called the distribution of T. For brevity we shall sometimes speak of the sensivity of a statistic T when we really mean the sensitivity of the family of distributions of T. Let the family of distributions of T be Wo}, with densities g(' , e) relative to a a-fmite measure v. In the notation of Section 2 of the Appendix, g = T*f

Theorem (i) If the Po family is differentiable in mean at eo, so is the Qo

family. (ii) If the Po family is smooth at eo, or semi-smooth and differen­

tiable in mean at eo, so is the Qo family, and the sensitivity Io(T) of T at eo is less than or equal to the sensitivity 10 of the Po family ai eo, with equality if and only if 1:0 , = f~/fo, is afunction ofT. In Particular, ifU is any statistic,

10(1:0, U) = 10 = 10(1:0),

19

Page 15: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

(iii) Ifthe Pefamily is differentiable in mean at ()o, and ro is a function of T, then

g~(t)/go(t) = f~(x)/fo(x), a.e., where t = T(x).

Proof. If the Pe family is differentiable in mean at ()o, (f - fo)/(() - ()o) - f~inmeanas() - ()o·Hence(g - go)/(() - ()o) = T*(f - fo)/(() - ()o) - T*f~ in mean. See Section 2 (vi) of the Appendix. Thus T*f~ is the derivative in mean of g at ()o. We may denote it by g~. For our purposes, it has all the properties of a derivative, e.g.

d I

d()Sgdv1e=eo = Sgodv. B B

As () - () o,f - fo in mean. Therefore g - go in mean. If the P e family is smooth at ()o, then as () - ()o,

-...!'\/.~_'\/~.o __ 0_ in mean. ( fJ - 1;)2 f/2 () - ()o 4fo

Therefore

T*(.Jf - .JfO)2 _ T* f~2 in mean. () - ()o 4fo

But

T*(.Jf - .JfO)2 = T*f + fo - 2.J(ffo) () - ()o (() - ()0)2

> g + go - 2.J(ggo) - (() - ()0)2 ,

a.e.v.

because T* .J(ffo) s .J(ggo), as shown in Chapter II, Section 2. Thus

T*( .J~ = t:o y ~ (.J~ -= i:o Y = (: = ::Y (.Jg +1.Jgof

Now (g - go)/(() - ()o) - g~ in mean, and (.Jg + .JgO)2 - 4go in mean. Therefore

( .Jg - .JgO)2 ~ g~2 ~ in mean,

() - ()o 4go

20

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

since it converges loosely to g'i /4go, and is dominated by

T*( .J~ = t:o y, which is convergent in mean. The proof is

the same when the P e family is semi-smooth and differentiable in mean at ()o, except that attention is restricted to a sequence (()J with limit ()o, such that

(.Jfn - .Jfof /(()n - ()0)2 - f~2 /4fo in mean.

The inequality 10(T) s 10 is obvious from the fact that p(Pe,Peo) ~ p(Qe, QeJ To determine the conditions under which the equality holds we must use Schwarz's inequality.

g'i = (T*f~)2 = [T*(.JfoI~/.Jfo)Ys T*fo·T*(f~2/fo) a.e.v.

Thus

and

Sf~2/fod/l ~ Sg~2/godv,

with equality if and only if(f~/J fo)/Jfo is a function of T a.e./l, i.e. ro = f~/fo is a function of T. In particular, if U is any statistic

lo(ro, U) = 10 = lo(ro)·

If T is a sufficient statistic, ro is a function of T, and therefore

10(T) = 10.

The statement that ro is a function of T means only that T(x 1) = T(x2) => r O(x 1) = r O(x2). If this condition is satisfied, let ro = k( T), i.e.

f~(x)/fo(x) = k[T(x)], f~ = k(T)Io·

If the Pe family is differentiable in mean at ()o,

g~ = T*f~ = T*[k(T)fo] = k·T*fo = kgo· a.e.

Thus k = g~/go' and

f~(x)/fo(x) = k[T(x)] = g~[T(x)]/go[T(x)]

= g~(t)/go(t), ift = T(x).

This proves (iii), and completes the proof of the theorem. DDD

21

Page 16: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

9 The distance function p that we are using is based on ..)f; it is worth noting that smoothness of the family at 80 is exactly equivalent to the differentiability in mean square of ..)f at 80 .

If ..) fis differentiable in mean square at 80 , go (not a function of 8) exists such that

It follows that

..)f - ..)fo ~ g as 8 -t 8 8-80 0 0'

and, by the corollary to the Theorem in Section 1 of the Appendix.

( ..)~ = t:o Y -t g~ in mean.

Now

f - fo = (..)f - ..)fo)(..)f + ..)fo) I 2 /1" 8-8 8-8 -t gOVJO' o 0

I~=tl ~ (..)~=t:o Y + (..)f+..)fo)2 ~ (..)~=t:o Y+2f+2fo,

which is convergent in mean. Hence (f - fo)/(8 - 80 ) -t 2go..)fo in mean: that is at 80 ,fhas a derivative in meanf~ = 2go..)fo, so that go=f~/2..)fo.

( ..)~ = t:o Y -t f~2 /4fo in mean:

the family is smooth at 80 •

Conversely, if the family is smooth at 80 ,

lim f(..)f - ")fO)2 d/1 = ff~2 d/1, 8-->80 8 - 80 4fo

wheref~ is the derivative in mean offat 80 .

22

f~ . 2..)fo'

SENSITIVITY OF A FAMILY OF PROBABILITY MEASURES

therefore

lim f(..)f - ..)fo - JL)2 d/1 = o. 8-->80 8 - 80 2..)fo

Thus ..)f is differentiable in mean square at 80 . LeCam (1970) has shown the importance of this condition in maximum likeli­hood theory. D D D

23

Page 17: Asymptotic Relative Efficiency

CHAPTER FOUR

SENSITIVITY RATING CONDITIONAL SENSITIVITY

THE DISCRIMINATION RATE STATISTIC

1 We may define the relative sensitivity rating at 80 of two statistics as the ratio of their sensitivities at 80 , when this has a meaning. The sensitivity rating at 80 of a statistic Tmay be defined as the ratio of its sensitivity at 80 to the sensitivity 4s~ of the original family of probability measures. When the family is smooth, or semi-smooth and differentiable in mean at 80 , the sensitivity rating is 10(T)/10.

If the family of probability measures on f!£ is smooth at 80 ,

and if A is an event of positive probability at 80 , the family of conditional distributions given A is smooth at 80 . Put F = Po(A) = Sfdfl, a function of 8 only, for given A. Since the

A

family is smooth at 80 , it is differentiable in mean at 80 ,

and therefore

F~ = Sf~dfl. A

For the conditional distributions the space is the set A, and the density of the conditional distribution at 8 is k = f /F, which has a derivative in mean at 80 , k~ = f~/Fo - foF~/F~, because

f/F - fo/Fo ~ k' 8-8 0 o

and

If/F - fo/Fol ~fll/F -l/Fol + ~I f- fo I,

8-80 8-80 Fo 8-80

which is convergent in mean. We have to show that 1

(.Jk - Jko)2/(8 - 80 f, which ~ k~2/4ko as 8~ 80, is convergent

24

in mean.

(Jk - .Jko? (8 - 80)2

SENSITIVITY RATING

[J(f/F) - J(fO/FO)]2 (8 - 80)2

< 2[J(f/F) - .J(f/Fo)]2 + 2[.J(f/Fo) - J(fO/FO)]2 - (8-80)2 (8-80f

2f(l/JF - l/J F 0)2 2(J f -.J fo)2 = (8 - 80f + F 0(8 - 80)2

Both terms are convergent in mean in A, and so (Jk - Jko?/(8 - 80? is convergent in mean, and the family of conditional distributions is smooth at 80 •

In particular, if the f!£ space is discrete, all conditional distri­butions will be conditional on an event of positive probability, and so all conditional distributions will be smooth at 80 .

The corresponding results can be shown to hold for a family which is semi-smooth and differentiable in mean at 80 .

2 When the conditioning event is of zero probability, we are not able to prove so much, and we shall restrict the discussion to conditioning on the discrimination rate statistic L'o. Let U be a statistic such that the distribution of (U, ~o) has a density relative to the product measure v x T, where v, T are a-finite measures in the U space and in the ];0 space (Rl) respectively. We shall show that if the family of probability measures on f!£ is semi-smooth and differentiable in mean at 80 , then for almost all c the conditional distribution of U given ];0 = c, has zero sensiti­vity at 80 . We express this by saying that Eo is locally sufficient at 80 .

The (U, ~o) family of distributions is semi-smooth and differen­tiable in mean at 80 , and has sensitivity 10 there. Denote its density by g(.,., 8). The density of the ];0 distribution is h(·, 8), where

h(c, 8) = Sg(·, c, 8)dv.

From (iii) of the Theorem in Section 8 of Chapter 3.

h'(c, 80 )

h(c, 80 )

g'(u, c, 80)

g(u,c, 80 )

25

, ... ' .....

f'(x, 8~)

f(x, 80 ) a.e., [4.1]

Page 18: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

where c = Lo(X), u = U(x). The density of the distribution of U, given [;0 = c, is k(· , c, e), where

k(u, c, e) = g(u, c, e)/h(c, e). [4.2]

lim inf ff(J g - J gg)2 dvdr = ff g'i dvdr. 0-+00 (e - eo) 4go

Therefore for some sequence (en) with limit eo

ff I (J gn - J gg)2 _ g~21 dvdr -+ 0. (en - eo) 4go

Therefore

and so

in mean on the [;0 space. From their convergences in mean, it follows that there exists a sequence (em) with limit eo such that for almost all c

f (Jgm - Jggf dv -+ f g~2 dv, (em-eo) 4go

g -g em _ eO -+ g~ for almost all u, m 0

hm - ho -+ hi e - e o·

m 0

26

SENSITIVITY RATING

Because of[ 4.1], the last term is equal to (l/Jhm - 1/Jho)g~/2J go' hence

f(Jkm - Jkg)2 dv :::; ~ f(Jgm - Jgo _ ~)2 dv

(em - eo) hm em - eo 2Jgo

3( l/Jhm - l/Jho ~)2 f d + e - e + 2h3 / 2 go v mOO

+ 3 (l/Jhm - l/Jhof f :;20 dv.

When m -+ 00, each of the three right-hand terms -+ 0. Therefore the left-hand integral-+ 0, and so

1· . ff(Jk-Jkof d ° o~o ill (e _ eo)2 v = .

The conditional distribution of U given Lo = c, has zero sensitivity for almost all c. D D D

3 The role of [;0 is best appreciated when the probability space is discrete. Suppose that f.1 is the counting measure, and that the points at which Lo = c are Xl' X 2 ' .•• If the Po family is differen­tiable in mean at eo,

f'(xr, eo) 'LJ'(xr, eo) "-------'-~ = c = ~--'--'~"---f(xr,eo) ~f(xr,eo)·

Put qs = Po {Xs\L'O = C} = f(xs,e)/~f(xr,e),

dqs f'(Xs' e) f(xs,e)~f'(xr,e) de = 1.J(Xr, e) [If(xr,e)]2

{ f'(xs,e) ~f'(Xr,e)} = qs f(xs,e) - ~f(xr,e) .

This is zero at eo, and thus at eo

dqs/ de = 0, s = 1,2, ...

All conditional probabilities given L~ = c, have zero e derivative at eo. All conditional distributions will be stationary at eo. If the Po family is smooth (or semi-smooth) at eo, all conditional distributions, given £0 = c, will be smooth (or semi-smooth), and will have zero sensitivity at eo.

27

Page 19: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

4 If a statistic T has maximum sensitivity for all 0 in an open interval N, it must be sufficient. [; must be a function of T at each 0 in N, and so

[;(x,O) = k(T,O)

L(x,O) = kl (T, 0) + h(x), where 8kd80 = k.

Hence T is sufficient in N. High sensitivity is obviously a good quality in a statistic used

for estimation or testing. In the ordinary case, when the family is smooth, or semi-smooth and differentiable in mean, at 0o,Lo is a statistic of maximum sensitivity at 00 , and it is locally sufficient there. Unfortunately, in general, we cannot use LO because we do not know it; it depends on the true value 00 of O. It is an 'invisible' statistic. The state of affairs is different when we pass from estimation to testing. To test the hypothesis 0 = 0o,LO can be used, and as is well known, the locally most powerful test is based onLo.

28

CHAPTER FIVE

EFFICACY SENSITIVITY

THE CRAMER-RAO INEQUALITY

I With the same notation as in Chapter III, let S be a real valued statistic which has a finite mean value h(O) for all 0 in an open interval N.

h(O) = E(S) = ISfdfl.

We shall say that S is regular at 00 E N if h has a 0 derivative at 0 . given by 0

h'(Oo) = ISf~dfl.

When this is so,

h'(Oo) = I SLofodfl = E60(S£0)·

In all applications of the theory we shall have

E60(tO) = If~dfl = o. Thus when S is regular at 00, h'(Oo) is the covariance of Sand Lo at 00 , Hence

h'(00)2 S V60(Lo)V60(S) = 10 V60(S) ,

where V denotes variance, and

10 = V60([;0) = E80([;g) = If~2/fodfl.

Thus

[5.1]

with equality if and only if S is a linear function of [;0. This is the Cramer-Rao inequality, which is true if the statistic S is regular. Which statistics are regular? We do not want to apply the Cramer-Rao inequality to statistics we know, and so the regularity conditions should ask as little as possible of the statistic S which we may not know, and should be mainly concerned with

29

Page 20: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

the family of probability measures Po, which we know completely. Before discussing regularity conditions on S, consider the

expression on the left of the inequality [5.1 J. Its value is a measure of the effectiveness of S in the estimation ofO. We shall call it the lifficacy of Sat 00 • The value of the numerator is an indication of the sensitivity of the distribution of S to small changes in 0 at 00 •

We want this to be large. The denominator indicates the liability of S to vary from observation to observation. We want this to be small. Thus high efficacy is a desirable characteristic of a statistic used for the estimation of O. It is unaltered by the addition of a constant to S, or by the multiplication by a non-zero constant. If 1:0 itself is regular, its efficacy is 10, the maximum possible. As will be seen later, when the family of probability measures on Pl' is differentiable in mean at 00 , the efficacy of any statistic with a finite variance at 00 can be dermed, even ifit is not regular. Even when 1:0 is not regular, its efficacy is always the maximum possible.

When h(O) = 0, so that S is an unbiased estimator of 0, the efficacy of S is simply the reciprocal of its variance,

Voo(S) ~ 1/10.

In the more general case, suppose we have a sequence (Sn) of statistics such that E(Sn) = h(O), and VOo(Sn) -+0 as n-+ 00, and that we estimate 0 by taking the observed value of Sn as an estimate of h(O), and then calculating the corresponding value of O. Our estimator is then en' where Sn = h(en).

We suppose that the Sn are all regular at 00 , that h(O) is strictly monotonic, and that h'(eo) =f= o. If 00 is the true value of 0, Sn -+ h(Oo) in probability as n -+ 00, i.e. h(en) -+ h(Oo), and therefore en -+ 00 in probability.

Sn - h(Oo) = h(en) - h(Oo) = (en - 00)h'(00)(1 + 8n),

where 8n -+ 0 as On -+ 00 • Hence

h'(Oo)(en - 00) Sn - h(Oo) v'[VOo(Sn)] = v'(VOo(SJ) + 11n,

where 11n-+O in probability as n-+ 00. [Sn - h(Oo)]/v'[Voo(Sn)] has zero mean and unit variance. If it has a limit distribution, h'(eo)(en - 0o)/v'[VOo(Sn)] has the same limit distribution. If

30

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

this limit distribution has zero mean and unit variance, VOo(SJlh'(00)2 is the asymptotic variance of en' and is the reciprocal of the efficacy of Sn.

2 Regularity conditions. Let S be a real-valued statistic with mean value h(O). As shown above, to establish the Cramer-Rao inequality we require

(i) h'(Oo) = JSf~dJl,

(ii) Jf~dJl = O.

We say that S is regular at 00 if it satisfies (i). Equation (ii) is equivalent to the statement that the statistic with the constant value 1 is regular.

Theorem I Ifthe Po family is smooth at 00 , every real statistic with a second moment which is bounded in some neighbourhood of 00 is regular at 00 •

Proof. Let N be an open interval containing 00 , and suppose that S has a second moment which is bounded in N, say

EoCS2) ~ K, OEN.

Note that Sf~ is integrable because ISf~1 ~ S2fo + f~2Ifo. h(O) - h(Oo) = fS(f - fo) d

0- 0 0 - 0 Jl, o 0

= f S(f- fo) d + f S(f- fo) d" 0-0 Jl 0-0 r· o 0

ISI";< ISI><

The first term on the right -+ J sf~ dJl as 0-+ 00, because f is ISI,,;<

differentiable in mean at 00 • Denoting the second term by k(O,c), we have

v'f - v'fo k(O, c) = J S(v'f+v'fo) 0- 0 dJl.

ISI>< 0

Using Schwaris inequality, and the inequality

(v'f+v'fo)2 ~ 2f+ 2fo'

31

Page 21: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

we obtain

kef} )2 < 4K f (../f - ../fO)2 d ,c - (f}-f}of J1.,

ISI>c

which ~ K J f~2 /fodJ1. as f) ~ f}o. ISI>c

When c~ 00, K J f~2/fodJ1.~O. Hence ISI>c

lim h(f}) - h(f}o) = lim J Sf~dJ1. = J Sf~dJ1., 9-+90 f}-f}o C-+""ISI';;c

and S is regular at f}o.

Theorem IT

DDD

Let N be a real open interval containing f}o. lffor almost all x (i) f is a continuous function of f} in N, and has a f} derivative l'

at all points of N-D, where D is a countable set of points in N, which does not contain ()o but otherwise may.vary with x,

(ii) 1'2/fo ~ G, an integrable function ofx only, then every statistic S with a finite second moment at f) 0 is regular at () o. The con~ elusion still holds if(i) is replaced by

(i/) f is continuous in mean in N, and has a f} derivative in mean l' at all points of N-D, where D is a countable set which does not contain f} 0 .

Proof. Suppose that S has a finite second moment at f}o.

I l' I ~ ../ (foG) a.e.

From either (i) or (i,), the appropriate mean value theorem gives

Therefore

I f - fo I < I( I' G) If}- f}ol- v Jo

a.e.

I~r--t~1 ~ ISI../(foG) ~ S2fo + G a.e.

The last function is integrable and independent of f}. Hence S(f - fo) / «() - () 0) converges in mean to Sf~ , and S is regular at () 0 .

DDD When we have repeated independent observations,

32

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

n

f(x,f}) = ng(xr,f}), 1

where g is the density of the distribution of a single observation. It is easy to see that if g satisfies (i) or (i/), so does f When this is so, if g satisfies (ii), then

Ig-gol < l(g G) I()- f}ol- v 0

g~go+../(goG), iflf}-f}ol~ 1.

We may always take N oflength ~ 1, and then

g2 ~ 2g~ + 2goG,

g2/go ~ 2go + 2G = H,

integrable over the Xl space. Also

gl2/go ~ G ~ H.

1'(x, ()) = f(x, f})'f.gl(xr' f})/g(xr, f}).

Hence

1'(x, f}f ~ f(x, f})V(xr, ()f f(x, f}o) ~ n L f(x, f}o)g(xr, f})2 .

The term after the summation sign is

Hence

gl(Xr' ())2 n g(Xs ' ()f < Ii H( ) g(Xr , f}o) str g(xs ' ()o) - 1 xr ·

n

1'(x, f}f /f(x, f}o) ~ n2 n H(xr) a.e.

and condition (ii) is satisfied. Thus if g satisfies condition (ii), so doesf

The normal and the Poisson densities

1 -An -(x-a)2j2u2 del\,

(J../(21r/ an -xl both satisfy the conditions of Theorem II for all their parameters.

33

Page 22: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Hence for samples from a normal or a Poisson distribution, all statistics with finite second moments are regular with respect to all parameters. For the gamma distribution density,

x~a; [5.2J

this is so for m and for c, but not for a. To deal with this we need an obvious, slight extension of Theorem II.

Theorem II' If the condition (ii) of Theorem II is replaced by

(ii')

where fl = f(', e1), and e1 need not be in N, the conclUSion is: every statistic S with a finite second moment at e 1 is regular at eo'

DDD The gamma density [5.2J satisfies condition (ii') if m > 2 and

a1 < a. Hence for a sample from such a gamma distribution, any statistic S is regular at ao if it has a finite second moment for some a < ao'

3 The inequality without regularity conditions. If EoP~o) = 0, and S is a real statistic, Eoo(S1.:0)2 ::;:; Voo(S)Voo(1.:o) without any restrictions, with (finite) equality if and only if S is a linear function of 1.:0' The only point at issue is the interpretation of Eoo(S1.:o) other than as an integral. If S is regular, and E(S) = h(e),

Eoo(S1.:o) = dE(S)/delo=oo = h'(eo)'

The inequality has no interest unless Voo(S) and Voo(1.:o) are both finite, so we assume this. Eoo(S1.:o) will then exist and be finite.

Suppose the density f is differentiable in mean at eo' Eoo(1.:o) = 0, and any bounded statistic will be regular at eo. Defme Sc by

Sc = S if I S I ::;:; c,

= ° iflSI > c.

Sc has a bounded mean and variance for all e. Let

E(Sc) = hc(e).

S c will be regular at eo, and

h~(eo) = JSJ~dJl = J Sf~dJl. ISI~c

34

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

Therefore

k(eo) = lim h~(eo) = JSf~dJl = Eoo(S1.:o), c-> 00

and

In this form, with h'(eo) replaced by k(eo) = lim h~(eo)' the c-> 00

inequality applies to every statistic which has a finite variance at eo, even if the statistic is not regular, provided that f is differen­tiable in mean at eo' We may then extend the defmition of efficacy to every statistic S with finite variance at eo, by defIDing the efficacy of S at eo as k(eo)2/voo(S)' which is equal to [Eoo(S1.:o)]2/Voo(S)' We shall denote the efficacy of S by J(S), and its value at eo by J 0(S). The statistic 1.:0 always has efficacy equal to its variance, the maximum possible, [Eoo(1.:o1.:o) ]2 /Voo(1.:o) = Vo/1.:o)· It should be noted that 1.:0 may not be regular. The r(3) distribution with end point at e has density ieO-X(x - e)2, e::;:; x < 00. Eo<1.:o) exists only for e ~ eo. Eo(1.:o) has no left hand e derivative at eo ; but it has a right hand derivative which is equal to Eoo(1.:g). 1.:0 is semi-regular. For the symmetrical distribution with density i-e-Ix-Ol(x - e)2, - 00 < x < 00, EO<1.:o) does not exist for e =F eo. 1.:0 is not regular.

4 The distance inequality. Let Po,P be probability measures on f1£ with densities fo ,f relative to a (1-finite measure Jl. Let S be a random variable with means ho' h and variances (1~, (12 under Po' P respectively.

h - ho = J S(f - fo)dJl = J<J f - J fo)(J f + J fo)SdJl.

(h - hof ::;:; J<Jf - JfO)2dJl JS2(Jf + JfofdJl

::;:; 2p2(p,po)JS2(f+ fo)dJl.

The value of h - ho is unaltered by replacing S by S - c, where c is a constant. J<S - cf(f + fo)dJl will be a minimum when c = !(h + ho). The inequality then becomes

(h - hO)2 ::;:; 2p2[ (1~ + (12 + !(h - hO)2J .

(h - ho)2(1 - p2) ::;:; 2«(1~ + (12)p2.

35

Page 23: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

When p2 < 1, we can write this

2 < 2(0"~ + 0"2)p2 (h - ho) - 1 2 ' -p

p<l. [5.3J

This is the distance inequality relating the distance I h - ho I between the means of a random variable S, and the distance between the probability measures.

4p2 1 _ p2 --c-----c-;;: > --;-:-~-,,-(h - hO)2 - t(O"~ + 0"2)"

If we have a family of measures, and 0"""""* 0"0 when p""""* 0, then

1· . f 4p2 1 1m III (h _ ho? ~ O"~'

and, if the limit exists, . 4p2 >~

11m '(h _ h )2 - 2 . p->o 0 0"0

Introducing a parameter e, we have

2(1 - p2) (h - hO)2 < 4p2 O"~ + 0"2 (e - eo)2 - (e - eo?'

If, as e """"* eo, 0"2 """"* O"~ ,

h'(eo)2 < (dP)2 2 - 4 e '

0" 0 d ao

provided the limits exist. For a smooth family, this gives the Cramer-Rao inequality.

Location parameter. Let X be a random variable with finite mean and variance, and with density I(x, e) = g(x - e) relative to Lebesgue measure on the real line. We may take e to be the mean of the distribution. Let

0"2 = J(x - e)2g(x - e)dx = Jx2g(x)dx.

The inequality [5.3J becomes

40"2p2 (e- eo?:::; --2' p < 1.

l-p

4p2 1- p2 --'-~>--(e - eo)2 - 0"2

36

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

Hence

and, if the limit exists,

. 4p2 1 11m (e e )2 ~ 2' a->80 - 0 0"

The sensitivity is not less than the reciprocal of the variance. If the distribution of X is normal, the sensitivity is equal to the reciprocal of the variance. Thus, of all distributions of given variance, the normal distribution has the least sensitivity with respect to its location parameter.

The mean value of X is (J.

e = J xg(x - (J)dx.

If X is regular,

1 = Jx :eg(x - (J)dx = J - xg'(x - e)dx = - J(x + e)g'(x)dx,

where here g'(x) denotes dg(x)/dx. If this is true for all e,

Jxg'(x)dx = - 1, Jg'(x)dx = 0.

Hence

1 = Uxg'(X)dXJ2:::; Jx2g(x)dxJg'(X)2/g(x)dx.

Thus I = S g'(X)2 /g(x)dx ~ 1/0"2,

the Cramer-Rao inequality in this case. We shall have equality in [5.4J if and only if

g'(x)/g(X)1/2 = exg(x)1/2, a.e., e constant.

g'(x)/g(x) = ex, a.e.

d log g(x)/ dx = ex, a.e.

log g(x) = tex2 + K(X),

where K has a zero derivative almost everywhere.

g(x) = ecx2/ 2 +K(X).

37

[5.4J

Page 24: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

We require J xg'(x)dx = - 1. Since xg'(x) = cx2g(x) a.e., c must be negative. Thus

( ) _ k(x) -x2/2a2 g X - (1.j(21r/

where k(x) has a derivative almost everywhere which is 0. The function k must be non-negative, and must also be deter­

mined so that

J g(x)dx = 1, J g'(x)dx = 0, J x2g(x)dx = (12.

X will then be regular, and I(X) = 1/(12.

F or example, given b > 0, we can determine a, c > 0, such that if

k(x) = 0, Ixl < a,

= c, a S I x I s a + b,

= 0, Ixl > a+b,

the conditions will be satisfied. For simplicity we take (1 = 1. The equations for a, care

2 a+b C J -x2/2d - 1 -- e X-,

.j(2n) a

2c a+b 2 [5.5J -- J x2e- x /2dx = 1. .j(2n) a

Hence a+b a+b

J 2 -x2/2d - J -x2/2d x e x - e X.

a a

Integrating the left side by parts gives

ae- a2/2 = (a + b)e-<a+b)2/2

For any given b > ° this hasa unique solution for a. Equation [5.5J then determines c.

With this g, if X has probability density g(x - e),

E(X) = e, VeX) = 1, X is regular, I = 1.

The sensitivity is 00 . The family of densities g(x - e) is not smooth. The Cramer-Rao inequality is

VeX) ~ 1/1.

38

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

Here

VeX) = 1 = 1/1.

The Cramer-Rao lower bound for the variance of an unbiased, regular estimator of e is attained.

If Xl' X 2' ... are i.i.d.r.v. with this distribution,

I(X 1 ,X2 , .•. ,X,,) = n,

Xn = (X + ... + Xn)/n is regular, E(Xn) = e, VeX,,) = l/n, the Cramer-Rao lower bound.

In such a case we would not be interested in regular estimators, because there are non-regular estimators which perform much better. For example if X n1 is the least of a sample of n, E(Xn1 ) = e + qn' where qn is a function of n only. Thus X n1 - qn

is an unbiased estimator of e. It is not regular; but its variance is asymptotically rx/n2, n-+ 00, rx constant, and so is less than the Cramer-Rao lower bound when n is sufficiently great. It should be noted that

I(Xi'X2 ) = 2, I(X 1 +X2 ) = 00.

Evidently, here I cannot be information. It would seem that the Cramer-Rao inequality is of interest

only when the family of probability measures is smooth. When this is so, every statistic with a variance which is bounded for e in some neighbourhood of eo is regular at eo. Further, it seems that 1= Jf'2/fdJ.l is of importance only when it is the value of the sensitivity, lim 4p2(PIJ ,PIJ0)/(e - eo)2.

1J--1J0

5 Efficacy rating and asymptotic sensitivity rating. In order to avoid unnecessary complications, we shall suppose throughout the remainder of this chapter that the P IJ family is smooth in an open interval N, and we shall consider only values of e in N. All induced families of distributions will be smooth in N. The statistic S discussed above will have a sensitivity I(S) at e given by I(S) = E(g'2/g2), where g is the density of the S distribution, so that

J(S) s I(S) s I.

The efficacy of S is not greater than its sensitivity. We derme the efficacy rating of S at e as J(S)/I, the ratio of its

39

Page 25: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

efficacy to the maximum possible. The statistic Lo has a sensitivity rating 1 at eo; its sensitivity at eo is the same as the sensitivity at eo of the original family of probability measures on PI. Its efficacy at eo is the maximum possible, and so its efficacy rating at eo is also 1. For any other statistic a reasonable index of its performance in estimating e at eo is the square of its correlation coefficient with Lo at eo ; but this is exactly its efficacy rating at eo. Moreover, if Sn is a statistic which is based on n independent values of x, and which is asymptotically normal, then, under certain regularity conditions,

I(S,,) J(S,,) ----+-- as n--+ 00,

n n

and so

where I is the sensitivity of the family of probability measures on PI, so that nl is the sensitivity of a sample of n. Thus the sensitivity rating of Sn --+ the efficacy rating of Sn. The practical value of this result comes from the fact that efficacy is often much easier to compute than sensitivity.

Consider a random variable X with a normal distribution of mean a and standard deviation e, both differentiable functions of e. The probability density is e-(X-a)2/2c2/.j(2n)e.

L = - (x - a)2/2e2 -log e - ilog 2n.

L = (x - a)a'/e2 + (x - afe'/e3 - e'/e.

The sensitivity of X is

1(X) = E(L2) = a'2/e2+ 2e'2/e2.

The efficacy of X is a'2/e2~ If Xl' X 2' ... are independent random variables, each with this

distribution, Xn = IX.!n is normal with mean a and standard r

deviation e/n1/2. The sensitivity of X n is therefore na,2/e2 + 2e'2/e2, and its efficacy is na'2/e2. Thus

I(X,,) J(X,,) ----+-- asn--+oo.

n n

40

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

Consider now the general case of a sequence Xl' X 2' ... of independent, real random variables, each with the same con­tinuous distribution with density 1(·, e) relative to Lebesgue measure. Let Yn be a function of Xl' ... ,X n which has mean a and standard deviation e/.J n, a, e, being differentiable functions ofe. J(Yn) = na'2/e2.

I(Yn)::; I(X1 ,···,Xn) = nl(Xl)·

Hence a'2/e2 = J(Y,,)jn ::; I(Y,,)/n ::; I(X 1).

Suppose that Zn' = n1/2(Yn - a)/e, is asymptotically standard normal for all e in N.

Yn = a + eZjn1/2.

If Zn were exactly standard normal, Yn would be normal of mean a and standard deviation en- 1/2, and I(Yn) would be na,2/e2 + 2e'2/e2. Since Zn is approximately standard normal for all e in N, it is reasonable to expect that I(Yn) = na'2/e2 + 2e'2/e2 + o(n). If this is so,

I(Yn)/n--+a'2/e2 = J(Y,,)/n.

Let us investigate this more thoroughly. Let J,,( . ,e) be the probability density of Yn, and gn( . , e) that of

Zn . We assume that J,,(y, e) is a differentiable function of y, and therefore gn(z, e) is a differentiable function of z.

Put z = .In(y - a)/e.

where, as before, a prime denotes differentiation with respect to e.

-00 -00

41

Page 26: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Now

so that

-00

Using the easily proved,

a.n, Pn, Yn' bn all real, J 13;, J y;, J b; all ~ ° as n ~ 00,

J (a.n + Pn + Yn + bn)2 bounded=> J (a.n + Pn + Yn + bn)2 ~ J a.;,

we obtain

-00 -00

provided

-00

and

-00

Note that I(Yn)/n ~ I(X 1)' and so is bounded, and that

-00 -00

If further

42

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

-00

then

I(YJ J(YJ --~--,

n n the result we are interested in. DOD

First consider the case where the distribution of Yn depends on e only through a location parameter. The density function is then expressible in the form

f,,(y, e) = hn(y - a),

where the function hn does not depend on e. c' = 0.

gn(z,e) = n- 1/2chn(y- a) = n-1/2chn(n-1/2cz),

and is therefore the same for all e. Hence g~ = 0, and conditions (i) and (ii) are satisfied.

When the distribution of Yn depends on e only through a location and a scale parameter, the density f,,(y, e) is expressible in the form

1 (y-a) f,,(y,e) = ~hn .-c- ,

where hn does not depend on e. It then follows that gn is the same for all e, and therefore g~ = 0, and (ii) is satisfied.

In all cases, Zn has mean ° and unit variance for all e. Its 00

sensitivity is I(ZJ = J g~2 /gndz. Its limit distribution has zero -00

sensitivity, for it is standard normal for all e in N. We may then expect that in general I(Zn) ~ ° as n ~ 00, or at least remains bounded. All that (ii) asks is that I(Zn) = o(n). Conditions (i) and (iii) are related, and are concerned only with the distribution of Yn at the particular value of e being considered. In many cases, not only is Zn asymptotically standard normal, but its density g (z,e) tends to the standard normal density [1/.J(2n)]e- z2 / 2, ~d 8gn/8z ~ - [1/.J(2n)]ze- z2/ 2,so that (8gn/8z)2/gn ~ [l/.J(2n)]z2e-z2 /2. Under certain regularity conditions, we

43

Page 27: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

shall then have

00

f(OgnlOZf d --+_1_ <XlS 2 -z2/2d = 1 z .J(2n) z e z , gn -00

-00

and

-00

The first of these is (iii), and the second is stronger than (i), which requires only

-00

6 As an example, consider the gamma distribution, with density

e- x xm - 1/r(m), x ~ 0,

where m> 0 is a differentiable function of e. XII is the mean of a sample of n.

Thus

L= -x+(m-l)logx-logr(m).

E = m' log x - m'ljt(m), ljt(m) = r'(m)/r(m).

E' = - m,2ljt'(m}+ m"(logx -!/f(m»).

1(X) = E( - E') = m'2ljt'(m)

I(Xn) = [(LX,.) = n2m'2!/f'(nm)

I(X,.} '2,/.,() '2( 1 2 ) --=nm 'I' nm =nm -+22+'" n nm n m

= m'21m + o(l}.

J(X,.} m'2 -- = l(X1) =-.

n m

I(Xn) l(Xn) m,2 ----+--=-

n n m

If we introduce location and scale parameters tX, y > 0, which

44

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

are differentiable functions of e, and consider the distribution with density

e-(X-<xJ/Y(x - tXr- 1

ymr(m) , X ~ tX,

we shall still have

I(X n) --+ l(X n) = l(X 1)' n n

as before. This is because the distribution of .In(Xn - a)/e is unaltered: only a and e are changed. In the previous case a = m, e = .Jm. Here a = tX + my, e = .Jmy. Thus the residual density gn is unaltered. It must have satisfied conditions (i), (ii), (iii) before, and so must still satisfy them. Location and scale parameters can always be treated in this way whenever the statistic Yn has the property

(Xl - tX X 2 - tX Xn - tX) = Yn(Xl>X 2 , ••• ,xn) - tX

Yn , , ••• , y y y y

X n ,Max(Xl>X2 "",Xn), and Min(X1 ,X2 , ... ,Xn) have this property. Note that in the example just considered

l(X 1) = «(X + my)'2lmy2 = «(X' + m'y + my')2 Imy2.

7 Median of a sample. As an example of a case where the statistic is not the sum of i.i.d. random variables, consider the median of a sample of 2n + 1. Let f be the probability density, and F the distribution function of a continuous probability distribution on the real line. Suppose that the median is at 0, F(O) = i, and thatfis bounded, and is continuous at O. Consider the faniily of distributions with densities f(x - e) at x.

If M 2n+ 1 is the median of a sample of 2n + 1, its distribution has density

r(2n + 2) n n

r(n + l)r(n + l/(x - e)F(x - e) [1 - F(x - e)] .

L = log f(x - e) + n log F(x - e) + n loge! - F(x - e)]

-logB(n+ 1,n+ 1).

E = - r(x - e) + n[2F(x - e) - 1 ]f(x - e) f(x - e) F(x - e) [1 - F(x - e)]

45

Page 28: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

d wheref'(x) = dxf(x).

n2n2n + 1)

nn + I)nn + 1) 00

x I [2F(x) - IY[j(x)]3[F(x)]n-2[I - F(x)]n- 2dx -00

+ 0(1), n ~ 00.

Putting y = F(x), dy = f(x)dx, we have

J(M2n+ 1) ~ 4n2n2n + 1) I1 f( )2( _ .1)2 n-2(I --.: )n-2d . 2n+I nn+I)nn+I)o x y 2 Y Y Y

Ifn>2,0<e<i, 1

K, = I (y - i)2yn-2(I_ yt-2dy, = K1 + K 2, o

where

K1 = I (y - ifyn-2(I- yt- 2dy, K2 = I ly-1/21:S' ,< ly-1/21:S 1/2

K1 > I (y - i)2yn-2(I - yt- 2dy ly-1/21 :S,/2

> (i-ie2t- 2 I (y-i)2dy = G-ie2)n-2e3jI2 ly-1/21:S,/2

1

K2 < (i- e2t- 2 I (y- i)2dy < (i- e2t- 2, o

Therefore

Thus K = Kl +K2,....,K1'n~00.

Hence, when n~ 00, the probability distribution Pn on (0,1) with density

(y _ ifyn-2(I _ y)n-2 1

I (y - i)2yn-2(I - yt- 2dy o

converges to the singular distribution with probability 1 at i.

46

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

When y ~ i, f(x) ~ f(O). Hence, as n ~ 00,

1

I f(xf(y - i)2yn-2(1 - y)n- 2dy

o 1 = EpJf(X)2] ~ f(0)2.

I (y- i)2yn-2(1- yt- 2dy o

Therefore

J(M2n+1) ~ 4n2n2n+ 1) f(0)2 fcy-i)2yn-2(I- yt-2dy 2n+l nn+ l)nn+ 1) 0

Thus

= 41(Ofn ~ 41(0)2. n-l

J(M 2n+ 1) ~ 4f(0)2 . 2n+ 1

E(M 2n+ 1) = 0 + K(2n + 1), where the function K depends d

on f but not on O. Therefore dO [E(M2n + 1)] = 1, and

J(M2n +1) = IjV(M2n+1)·

2 _ n2n + 2) I1 2,,n(1 _ )nd E(M2n+1 -O) -nn+l)r(n+l)ox y y y

1

_ r(2n + 2) f x 2 (y_.1)2,,n(1 _ )"d - nn + l)nn + 1) (y - if 2 y Y y.

o x

First suppose f(O) > o. y - i = I f(u)du, and so o 1

lim y -"2 = f(O). y->1/2 x

Hence

2 n2n+ 2) E(M2n+1 - 0) ,...., nn+ 1)r(n+ l)f(Of

1

I (y ~ i)2yn(1 - y)ndy, o

1 = 4(2n + 3)f(0)2'

47

Page 29: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

(2n + l)E(M 2n+ 1 - 8)2 ~ 1/4[(0)2.

I(M2n+1) > J(M2n +1) = 1 2n + 1 - 2n + 1 (2n + 1)V(M2n+1)

1 >-:-::----,,-------:------,---,;-- (2n+ 1)E(M2n + 1 - 8)2'

The first and the last ~ 4f(0)2. Therefore J(M 2n+ 1)/(2n + 1) ~ 4[(0)2. When f(O) = 0, I(M 2n+ 1)/(2n + 1) ~ 0, and therefore J(M2n + 1 )/(2n + 1)~0. Thus, in all cases

lim I(M 2n+ 1) = 4[(0)2 = lim J(M 2n+ 1) n- ex) 2n + 1 n- ex) 2n + 1

The sensitivity rating of M 2n+ 1 is

I(M 2n+1)/I(X 1'··· ,X 2n+ 1) = I(M 2n+ 1)/(2n + l)I(X 1)

~ 4[(0)2 /I(X 1)'

and the asymptotic sensitivity rating of M 2n+ 1 is thus 4f(0) 2 II(X 1);' For the Laplace family with density te-rx-81, - 00 < x < 00,

I(X 1) = 1, f(O) = t, and so 4[(0? /I(X 1) = 1; the asymptotic sensitivity rating is 1. For the Cauchy family l/n[l + (x - 8?J, 4f(0?/I(X1) = 8/n2, and for the normal family [1/aJ(2n)] x e-(X-8)2j2a2 (a constant), the value is 21n.

8 Sensitivity and efficacy are both indicators of the rate of change of a probability measure. The former applies to distribu­tions on any space; but the latter applies only to distributions on the real line with finite variances. To compute the sensitivity of a statistic, we need to know the density function, but computa­tion of the efficacy requires only knowledge of the mean and variance. Also, the practical significance of efficacy is easier to grasp. For a normal distribution of fixed variance and varying mean, efficacy and sensitivity are equal. This suggests that a statistic of a sample of n, which is asymptotically normal, will have an asymptotic sensitivity rating equal to its asymptotic efficacy rating. Several examples of this are given above, but simple, sufficient conditions for this have yet to be discovered.

The efficacy of a random variable X is always less than or equal to its sensitivity. The two are equal if and only if the cor-

48

EFFICACY, SENSITIVITY, THE CRAMER-RAO INEQUALITY

relation coefficient of X with its discrimination rate r is ± 1, i.e. if and only if r is a linear function of X. This is true for all 8 in N if and only if, in N,

r(x,8) = a(8)x + b(8) ,

and therefore

L(x, 8) = a1(8)x+ b1(8) + h(x) ,

where dadd8 = a, dbdd8 = b. The density function of X will then be

exp {a 1 (8)x + b1 (8) + h(x)}.

[5.6J

If X l' ... ,X n are i.i.d. random variables with this distribution, n

Sn = LXy is a sufficient statistic for 8, and J(Sn) = I(SJ 1

Condition [5.6J is satisfied by the normal, binomial, Poisson, and negative binomial families, shown below with their respective densities:

1 ___ e-(X-8)2j2a2 a constant· aJ(2n) , ,

(:)py(1- pr-Y, p = p(8);

e-A)/Ir!, A = A(8);

(m +; -1) pY(l _ pr, p = p(8), m constant.

In the case of the negative binomial family, if m also is a function of e, the efficacy is less than the sensitivity; but it can be shown that I(Sn)/n ~ J(Sn)/n as n ~ 00 .

49

Page 30: Asymptotic Relative Efficiency

CHAPTER SIX

MANY PARAMETERS THE SENSITIVITY MATRIX

1 We now consider a family of probability measures {Po} on f£ with densities f(· ,0) relative to a a-finite measure ji, where o is a point in Rk;O'=(ep e2, ... ,ek). We shall write O~= (el0,e20,···,ekO). As before, we shall write f for fLO), and fo for f(·, ( 0) where convenient. The partial derivatives with respect to er will be denoted by f: and/,.~. L = 10gJ,L:r = 8L/8er,

L:rO = L:k, ( 0). Let 0 = 00 + vi, where l' is a unit row vector, I' = (/1 ,/2 , ••• ,Ik),

"f./;=1, and ivi=iO-Ooi, is the distance of 0 from 00,v2= "f.(er - erO )2. If for every fixed I, the one parameter family f( . ,00 + vI) is smooth at v = 0, we shall say that the Po family (or f) is smooth at 00 •

Irs is the covariance at 0 of L:r, L:s and I = 1(0) = (I r.) is the dispersion matrix of L:1 , L:2,··· ,L:k.

Theorem Let N be an open interval in R\ and l' = (/1 ,/2, ... ,Ik), a unit row vector. If

(i) at every point in N,fis smooth with respect to each er ,

(ii) each partial derivative /,.' is loosely continuous in N, (iii) each Irr is a continuous function of 0 in N, thenfis smooth in N, and i/O = 00 + vi and OoEN, then the sensiti­vity with respect to v at eo is 1'1(00)1,

lim f(~f - ~fO)2 dji = ~f("f.lr/,.~)2 dji = JcE ("f.1 L: )2 V2· 4 {" 4 00 r rO

v->O )0

= JcI'I(e )1 = Jc(d2p2/dv2) 4 0 2 v=O'

where p = p(J,fo). Also

(82p2/8er8es)90 = !Eoo(L:roL:sO)·

Proof. Since f is smooth with respect to er, Eo(L:J = O. Hence Irs = J/"'fs'/f dji. The continuity of Irs when r =1= s follows from the

50

MANY PARAMETERS, THE SENSITIVITY MATRIX

continuity of Irr and Iss be the use ofthe Theorem in Section 1 of the Appendix.

The density fis smooth with respect to er • Hence, as shown in Section 3 of Chapter 3,

8p2 = f~(~f -~;; ?dji = f(~f - ~fo)f;.' dji 8er 8er 0 ~f·

This is a continuous function of 0 in N, because fand f: 2/f are continuous in mean, and

i~f-~foiif:/~fi s 2f+ 2fo+/,,'2/f,

which is continuous in mean. Therefore

dp2/dv = "f.lr8p2/8er = J(~f-~fo)("f.IJ:/~f)dji.

Put V = p2 ; then

f ( ~f ~ ~fo "f.lr [;f Y dji = V/v2 - V'/v + iI'n.

This is similar to [3.4] in the proof of Theorem in Section 3 of Chapter 3, and the argument given there, with an obvious modification for the last one, proves all the results stated above.

DOD The sensitivity of the Po family in the direction I at 00 is 1'1(00)1.

The matrix I = [Irs] has been called the information matrix; but as shown in the discussion of the case k = 1, this is not a good name. A better name is the sensitivity matrix.

810gf 1 8f 8er f 8er .

If the second derivatives offwith respect to the er exist, a.e.ji,

82 logf 1 82f 1 8f 8f 8er8es - f 8er8es - F 8er 8es

1 82f 8 logf 8 logf = f 8e r8e s - ae:-ae:

( 810gf810gf) = _E(8210gf ) f 82f d E 8e 8e 8e 8e + 8e 8e ji.

r s r s r s

The last term is zero if we can differentiate twice under the

51

Page 31: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

integral sign in S!dfl = I; then

( iPIOgf) I r • = - E 80r80. .

If (X l' ... ,X n) is a random sample from the distribution with density lex, ° l' ... , OJ,

I(X 1 , ... ,X n) = nl(X 1)'

as in the case k = 1.

3 The Cramer-Rao inequality for many statistics and/or many parameters.

We shall assume that the conditions (i), (ii), (iii) of the above theorem are satisfied, so that! is smooth in N. Let Sl ... S· , 'J

be j real statistics which are regular in N with respect to each 0, -a sufficient condition for this is that their variances are bounded in N. Let M be the dispersion matrix of the j statistics. Put

S' = (Sl""'S), D' = (810g!/801 , ... ,810g!/80k),

K = E(DS') = (Kr.), where K r• = 8E(S.)/80r = E(E,S.),

C' = (c1 , ••. , cJ 1= E(DD'), the sensitivity matrix. j

Consider the statistic S = I c.S. = C'S.If, as before, (J = (Jo + vi, 1

dE(S) = ~ I 8E(S) = ~ I [f, 8E(S.)] = I'KC dv L..., 80 L, r L, c. 80 .

1 r 1 1 r

V(S) = C'MC.

The sensitivity in the direction I is I'll. Writing the Cramer-Rao inequality for S in the form

V(S) ;::: [dE(S)/dv]2 I'll '

we obtain

, > (I'Kq2 CMC _ Yilfor alll,C. [6.1]

52

I

MANY PARAMETERS, THE SENSITIVITY MATRIX

If we choose I so that I'll = AI'KC, i.e. I = AI- 1KC, the right hand side of [6.1] becomes C'K'I - 1 KC. Therefore

C'MC ;::: C'K'I-1KC, for all C.

Thus

M - K'I- 1K is non-negative. [6.2]

This is the extension of the Cramer-Rao inequality to the many statistics, many parameters case. The above argument shows that this extension simply states that the rate of change in any direction of the mean value of any linear function of the S. obeys the simple Cramer-Rao inequality. Note that 1- 1 is replaced by a g-inverse 1- if I is singular.

It is easy to show that M - K'I- 1K is the dispersion matrix of the statistics S - K'I-1D, and so is non-negative. This gives a shorter, but less obvious proof of [6.2]. It does not show the relation of the extended inequality to the simple inequality for one statistic and one parameter. It does, however, show what we shall require below, that

[6.3]

if and only if

S - K'I-1D = a, where a = E(S). [6.4]

If [6.4] is true, S is a linear function ofD,

S = GD+a. [6.5]

Conversely, if [6.5] is true,

K' = E(SD') = E(GDD' + aD') = GI.

Hence G=K'I-l, and so [6.4] is true, and therefore [6.3]. For a single statistic S1> M = V(Sl)'

K' = (8E(Sl)/801> ... ,8E(Sl)/80k),

and so

V(Sl) ;::: K'I- 1K.

If j=k, and K and I are both non-singular, M-K'I- 1K

53

Page 32: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

is non-negative, and so (see Section 4 of the Appendix)

IMI ;;::: IK'I- 1KI = IKI2III-l,

IKI2 IMI ~III,

with equality if and only if M = K'I- 1K, i.e., as shown above, if the S. are linear functions of the 1.:r •

We define the efficacy of the set of statistics (S l' . " , Sk) as IKI2/IMI. The efficacy rating is dermed as IKI2/IMIIII, the ratio of the efficacy to the maximum possible. The set of statistics D has efficacy rating 1.

In order to get some idea of the meaning of efficacy in this multi-parameter case, we diagonalize the matrices M and K by suitable linear transformations on the statistics S and the parameters (). Let G be an orthogonal matrix such that GMG' is diagonal. Put

S* = (ST, ... ,S:) = S'G'

The dispersion matrix of the statistics S* is M* = GMG'. Put

4>' = (4) 1 , .•• , CPk) = (),B,

where B is determined later. The matrices corresponding to D,I,K will be denoted by D*,I*,K* respectively.

Hence

If we take

D = BD*,

I = E(DD') = E(BD*D*'B') = BI*B'.

K = E(DS') = E(BD*S*'G) = BK*G.

K* = B- 1KG'.

KG' B=-­

IKI1/k'

54

MANY PARAMETERS, THE SENSITIVITY MATRIX

IBI = 1,K* = IKI1/k1k,IKI* = IKI,

oE(S:) = IKI1/k OCPr '

oE(S:) s =1= r,~= O.

The efficacy of S: with respect to CPr is IKI 2/k/V(S:), and

IKI2 IK*12 k •

IMI = IM*I = rG (efficacy of S: WIth respect to CPr)'

55

Page 33: Asymptotic Relative Efficiency

CHAPTER SEVEN

ASYMPTOTIC POWER OF A TEST ASYMPTOTIC RELATIVE

EFFICIENCY

1 Asymptotic power of a consistent test. Let e be a real para­meter of a probability distribution. Suppose that a test of the hypothesis Ho:e=eo against the hypothesis H 1:e>eO' is that H 0 is rejected if Tn > K n, where Tn is a statistic, and n is the sample number. Suppose that the size of the test is exactly or approximately oc, and in the latter case - oc as n - 00, so that

P9o(Tn > Kn) = ocn - oc as n - 00.

The power function is

p(n,e) = P9(Tn > Kn).

The test is said to be consistent if for every e > eo, pen, e) - 1 as n - 00. Suppose that the test is consistent; let us investigate the behaviour of pen, e) for large values of n. Power functions are usually difficult to evaluate, and we mostly have to be content with approximations based on limit results. For fixed e> eo, pen, e) - 1 as n - 00. Hence to get a limit value less than 1, we must consider a sequence (en) of e values such that en> eo and - eo as n - 00. We shall try to determine this sequence so that as n - oo,p(n, en) tends to a given limit between oc and 1.

We shall assume that for some h> ° and for eo::; e ::; eo + h, c> 0, ace), wen) exist, such that as n too, wen) ~ 0, and the eo distribution of

and the en distribution of

Tn - a(en) cw(n)

both tend to a distribution with a continuous distribution

56

ASYMPTOTIC POWER OF A TEST

function F. We also assume that at eo,a(e) has derivative a'(eo) > 0.

P9o(Tn> Kn) = P90 [ Tn c~~\eo) > kn] - 1- F(kn) - 1 - F(k) ,

where 1 - F(k) = oc, and

k = Kn - a(eo) _ k as n _ 00.

n cw(n)

p(n eJ = P (T > KJ = P [Tn - a(en) > k _ a(en) - a(eo)] , 9n n 9n cw(n) n cw(n)

-1-F(k- An)

where

A = a(en) - a(eo) '" a'(eo)(en - eo). n cw(n) cw(n)

In order that pen, en) - a limit < 1, An must - a [mite limit A. If we take en - eo '" Acw(n)/ a'(eo), An wil1- A, and pen, en) - 1 -F(k - A). Thus

limP[n, eo + ACW(n)/a'(eo)] = 1 - F(k - A).

Note that if a'(eo) = 0, and en - eo = O[w(n)], n - 00, then An - 0, and pen, en) - 1 - F(k) = oc. If a'(eo) = 0, and the second derivative a"(eo) exists and is > 0,

A '" a"(eo)(en - eo? n 2cw(n)

Hence if (en - eo)2 '" 2ACW(n)/a"(eo),An-A and p(n,en)­I-F(k - A).

In cases encountered in practical applications, wen) is a decreas­ing function of n which is regularly varying at 00, i.e. for every b>O,

lim w(bn)/w(n) = bY. n-+oo

The constant y is called the exponent of regular variation. Since here wen) is a decreasing function of n, y must be negative or zero. By far the most important case is wen) = n- 1/2. Others of some importance are wen) = n- 1, and wen) = (n log n)-1/2.

57

Page 34: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

2 If we have two tests of the same hypothesis at the same level iX, and for the same power with respect to the same alter­native, the first test requires a sample of n1 , and the second a sample of n2 , we may define the relative efficiency of the second test with respect to the first as n1/n2 .

We define the asymptotic relative efficiency, ARE, of the second with respect to the first as

lim n1/n2 when [31(nl'enJ "-' [32(n2,en,), and en! - eo "-' en2 -eo, n2--+ 00

where [31' [32 are the power functions. Suppose that the tests, both of asymptotic size iX, are based on the statistics Tn' Vn, and that as n --+ 00, the en distributions of

Tn - a1(en) and Vn - a2(en) C1 wen) C2 wen)

have the same limit distribution with continuous distribution function F, where wen) is regularly varying at 00 with exponent -m. Then

lim [31 [n,eo + AC1W(n)/a'(eo)] = l-F(k-A) n~oo

= lim [32[n,eo + AC2W(n)/a'(eo)]' n~oo

For the same limit of power with en! - eo "-' en, - eo, the sample sizes n1 , n2 must be related by

Hence

C1 w(n1)/ a~ (eo) "-' C2 w(n2)/ a~(eo).

a~(eO)!C2 "-' w(n2) "-' (n2)-m a~ (eo)/ C1 w(n1) n1

which is the ARE. In the most important case, the limit distribution is normal,

F = <1>, the standard normal distribution function, <I>(u) =

.J(~n) ):-x2/2 dX . aCe) = EiTJ The en variance of Tn is asympto-

58

ASYMPTOTIC POWER OF A TEST

tically equal to c2/n, wen) = n- 1/2, m =!. The asymptotic relative efficiency is

a~(eO)2 / c~

a~(eo? /ci'

The asymptotic normality of Tn is often easily established by using the following theorem, which, in fact, enables us to deal with a larger class of alternative hypotheses.

3 Let g(', e), eEN (an open interval containing eo) be a family of probability densities relative to a measure v on the real line. Let Znr' r = 1, ... , n, be n independent random variables with

n

probability densities g(', enr) relative to v, and let Tn = I Znr/n. r= 1

Denote by ace) and (T2(e) the mean and variance of a random variable with probability density g(', e) relative to v. Put

ao = a(eo), anr = a(enr), bn = max I anr I, r

n n

an = I an)n, en = I en)n, r= 1 r=l

n

(T~ = (T2(eo) > 0, (T;r = (T2(enr), (T; = I (T;r/n r= 1

Theorem If I

(i) g --+ go as e --+ eo, (ii) Jz2g(z,e)dv is a continuous function ofe at eo,

(iii) max lenr - eo 1--+ ° as n --+ 00,

then r

Iffurther

T -a n -1/~ is asymptotically standard normal.

(Ton

(iii') max I enr - eol = O(n- 1/ 2 ),n --+ 00, r

(iv) aCe) has a finite derivative a'(eo) at eo,

59

[7.1J

Page 35: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

then an in [7.1J may be replaced by ao + (en - eo)a'(eo), so that

Tn - ao (en - eo)a'(eo) aon 1{2 aon 1{2

is asymptotically standard normal.

Proof. It follows from (i) and (ii) and the Theorem of Section 1 of the Appendix, that g(z, e), zg(z, e), z2g(Z, e) are continuous in mean at eo. Therefore a;r ~ a~ uniformly with respect to r as n ~ 00. Hence a; ~ a~. Also bn ~ 1 ao I.

n

Since an/ao ~ 1, we have to show that I (Znr - anr)/nl{2an is r= 1

asymptotically standard normal. This will be so if the Lindeberg condition is satisfied, namely, for every t: > 0,

1 n

Wn = -2 I S (z - anrfgnrdv ~ 0 as n ~ 00. nan r= 1 Iz-anrl > Bn'{20"n

Now Iz - anrl::;; Izl + bn. Hence

1 n

Wn ::;; -2 I S (2Z2 + 2b;;)gnrdv nan r= 1 Izl +bn>Bn,!2O"n

The first term ~ 0 as n ~ 00. The second term

::;; ~maxJ(z2 + b;)(gnr - go)dv, an r

which also ~ 0 as n - 00. This proves the first part of the theorem. This result, of course, includes the result that (Tn - ao)/aon- 1{2 is asymptotically standard normal when enr = eo, for all r, n.

60

ASYMPTOTIC POWER OF A TEST

When (iii') and (iv) are true,

n n

an - ao = I (anr - ao)/n = I (enr - eo) [a'(eo) + t:nJ/n, r= 1 r=l

where max 1 t:nr 1 ~ 0 as n ~ 00. Thus r

an - ao = (en - eo)a'(eo) + t:n max 1 enr - eo I, r

The last term - 0 as n - 00.

DOD In order to simplify the exposition, we have introduced the

family g(. ,e) of probability density functions of the distributions on the real line; but we may not know g. Often we start with a space PI on which there is a family of probability measures with densities f( . ,e) relative to a measure 11, and Z is a known random variable on PI. Denoting its probability density relative to the measure v on the real line by g(., e), we can verify conditions (i) and (ii) without knowing g. Condition (i) is equivalent to g

being convergent in mean at eo. This will be so iffis convergent in mean there. Condition (ii) simply states that Z has a second moment which is continuous at eo. This can be determined from a knowledge off and Z only, without knowing g.

4 Let g be a probability density relative to Lebesgue measure on the real line, G the corresponding distribution function. Suppose that G(O) = 0, g(O) > 0, and that g has right hand conti­nuity at o. Consider the family of probability measures with densities g(x - e) at x.

Let Tn be the least member in a sample of n.

Pe(Tn - e > x) = [1- G(x)Jn.

Pe[(Tn - e)ng(O) > xJ = [1- G(x/ng(O))Jn.

When n ~ oo,x/ng(O) ~ 0, and G[x/ng(O)J ~ [x/ng(O)Jg(O) = x/no Therefore

{I - G[x/ng(O) J}" ~ (1 - x/n)" _ e- X •

61

Page 36: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Thus

Pa[(Tn - 8)ng(0) ::;; x] = 1 - {1 - G[xjng(O)]}n -+ 1 - e-X,x ~ ° Here a(8) = 8, c = 1jg(0), w(n) = n- 1 .

For testing 8 = 80 against 8> 80 , at 1eve1!Y. = e-k,

p[n,80 +Ajng(0)]-+eA -k, A < k, -+1, },~k.

62

CHAPTER EIGHT

MAXIMUM LIKELIHOOD ESTIMATION

1 General results. We consider a family of probability measures on :!l dominated by a IT-finite measure 11. The corresponding density functions are f(-, 8), 8E0, where 0 is a set in Rk. We shall denote the likelihood function for a set of n observations by J;., so that

n

fn(xl'x 2 ,···,xn,8) = TIf(xr ,8). r= 1

For any set A which intersects 0, we write

f*(x, A) = sup {J(x, 8) ;8EA0},

J;.*(x 1 ,x2 ' ... ,xn' A) = sup {J;.(Xl'X 2 , ... ,xn' 8) ;8EA 0}.

For the proof of the main theorem we need the following three lemmas.

Lemma 1 If f is a probability density relative to a measure 11, and if g is a density or a subdensity, i.e. g ~ 0, S gdll::;; 1, and if p(j, g) > 0, then E flog (gjf) < 0.

Proof If Z> 0, log Z::;; Z - 1, with equality if, and only if, Z = 1. Therefore

E f log (gjf) = S log (gjf)fdll < S (gjf - l)fdll ::;; S (g - f)dll = 0. DOD

Lemma 2 For any set A with intersects 0,

1 1..+1(·,80) > r+ 1 1 1..(.,80) Eao og1.* (. A) - --Eao og 1.*(. A)'

r+ 1 , r r '

when the right hand side exists,finite or infinite.

63

Page 37: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Proof

fr+ 1 (Xl' X2"" ,Xr+ l' ey = Dfr(X2,X3 ,··· ,Xr+ l' e),

where the factors in the product are the r + 1 likelihoods of r out of X 1 ,X2 ' ... ,xr + 1 . Hence

fr~1(X1'X2' ... ,xr+1,AY S Dfr*(X2,X3 ,··· ,xr+l'A).

Hence

E logfr+1Leo);:;::: (r+ 1)E 10gfrLeo) r 00 fr~ 1 LA) 00 fr*(' ,A)'

if the right-hand side exists.

Corollary Iffor r= m

Eo log fr~~, eo)) > ° (alternatively > - 00) a I'. A

Jr ,

then this is true for all r ;:;::: m.

Lemma 3 Iffor a set A which intersects e, and for some value ofr,

Eoo logfr~~ 1"" ,Xr, eo)) > 0, f.. Xl' ... ,Xr,A

then, with probability one,

!,,(X 1"" ,Xn, eo) > !,,*(X 1"" ,Xn, A)

when n is great.

Proof. To simplify the printing let us denote

by W(r, s), r S s.

lo/~-r+ 1 (Xr,··· ,Xs' eo) J,;/~-r+ 1 (Xr' ... ,Xs' A)

64

DDD

[8.1 J

[8.2J

MAXIMUM LIKELIHOOD ESTIMATION

Note that if r S s < t,

ft-r+ l(Xr"" ,XI' e) = fs-r+ 1 (Xr,··· 'Xs' e)ft-ixs+ l' ... ,XI' e),

and therefore

Hence W(r, t) ;:;::: W(r, s) + W(s + 1, t). Suppose [8.1 J is true when r = m, then from the corollary to

Lemma 2 it is true for r;:;::: m. When n;:;::: 2m, there are positive integers u, v such that

n = vm + u, m S u < 2m.

v-1

W(l,n) ;:;::: W(l, u) + L W(rm + u + 1,rm + u + m). r=O

Since u;:;::: m, for fixed u, the first term on the right has a mean value > 0, and therefore is > - 00 with probability one. The v random variables under the summation sign are independent, and each is distributed like W(1, m) with a positive mean value. It follows from the strong law of large numbers that, with pro­bability one, their sum ---+ 00 as v ---+ 00. Hence, for fixed u,

lim W(l, vm + u) = 00, a.s. v~oo

This is true for each of the m values of u, and therefore

lim W(l,n) = 00, a.s. n~oo

Hence

!,,(X 1 , ... ,Xn' eo)/fn*(X1 , ... ,Xn' A) ---+ 00, a.s.,

and so [8.2J follows. We shall say that a set A is inferior to eo if it satisfies [8.2]. The union of a finite number of sets inferior to eo is inferior to eo.

DDD Theorem Let X l' X 2' ... ,X n be i.i.d. random elements each with probability density f(x, eo) at x. We assume that

65

Page 38: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

(i) if8=1=80 , p(f,fo) > 0; (ii) for each x the density f is an upper semi-continuous function

of8 in 8, i.e. if¢E8,

lim sup [j(x, 8);1 8 - ¢ I < h] = f(x,¢). h--+O

If H is a compact subset of 8 which contains 80 , and if for some value ofr

(iii) FC 8 )

E log)r ' 0 > 00 J,*(.,H) - 00,

then, enEH exists, such that

h(x 1 , x 2 '··· , xn' en) = fn*(x 1 , x 2 '··· , xn' H),

and with probability one, en --> 80 as n --> 00 . If in addition,

(iv) E 1 J,e, 80 ) 0 00 og J,*(', He) > ,

then with probability one, at en the likelihood function has a global maximum, when n is great.

h(X1 , x 2 '··· , xn' en) = fn*(xl' x 2 '··· , xn ' 8).

Some sort of continuity condition of f is necessary. While this is not the main reason for imposing it, condition (ii) does result in the supremum of J,(x 1 , ... , X r ' 8) in any compact set He 8 being attained for some 8EH. In practical cases the satisfaction of this condition can usually be achieved by suitable definition off(x,8) at points of discontinuity with respect to 8.

Condition (iii) rules out densities like eO-X(x - 8)-lj2/l(!), which have infinities whose position varies with 8. Such cases need special treatment, and are hardly likely to be met in practice.

A sufficient condition for (iii) is

(iii') h(xpo .. , xr)J,(Xl"" , xr' 8) bounded for all 8 in H and all Xl' ... ,Xr ,

and

where

66

MAXIMUM LIKELIHOOD ESTIMATION

Often h == 1. Suppose

then h(x 1 , ... ,xr)J,(x 1 , ... ,xr,8):S:; C,

h(x1 , ••. ,xJf..*(Xl' ... ,xr,H):S:; C.

Eoo log f;~ ~ ,8;]) = Eoo log [h J(', 80)] - Eeo log [h 'J,*(', H)]

> - 00 - log C.

Sufficient conditions for (iii) and (iv) are (iii") and (iv').

(iii") For some r, E log J,(', 80) > - 00 eo f,.*(',8) .

(iv') There exists an expanding sequence (Hm) of compact sets in 8 such that for some r and for almost all x,

J,*(x1 , ... , x r ' H~) --> K(X 1 , ... , x r ), as m --> 00,

where

J Kdflr :s:; 1, and p(K,J,(', 80 )) > O. Often K == O.

The condition (iii") implies (iii), and when (iii") is true,

E 1 J,C,80 )

- 00 < 80 og J,*(' , H"m)"

log{J,C,80 )/J,*(-,H"m)} t log [J,(·,80 )/K] whenmtoo.

Hence

E log f,.C,8 0 ) t E logJ,(·,80 ) 80 F*(. He) eo ,

Jr , m K

which is > 0 by Lemma 1. Thus (iv) will be true for Hm when m is sufficiently great. Clearly 80 will be an element of such an Hm'

Proof. Let H be a compact subset of 8 which contains 80 and at least one other point of 8. For any h> 0, let N h(¢) denote the open ball in Rk with the centre ¢ and radius h. Take ho > 0 and sufficiently small so that K = H - N ho(80 ) is not empty. K is compact.

Suppose (iii) is true when r = m. If ¢EK,

E 1 fmC, 80 ) < E 1 fmC, 80 )

- 00 < eo ogfm*(.,H) - 80 ogf:[',Nh(¢)K]'

67

Page 39: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

It follows from the upper semi-continuity ofj(x, .) at cp that

f![x w ·· ,xm,Nh(cp)K] ~ fm(x 1 , ••. ,xm' cp) as h ~ 0.

Therefore, when h -+ 0,

I fm( . , °0) I fmL °0) ° Eoo ogf:[.,Nh(cp)K] -+Eoo og fm(.'CP) > .

Hence when h > ° is sufficiently small,

Eoo 10g{fmL 0o)/f:[·, Nh(CP)K]} > 0,

and so there is an open ball S(cp) with centre cp such that

E I fmL °0) ° 00 ogf:[.,S(cp)K] > .

S( cp)K is inferior to °0. Every point of K is the centre of such a ball. The set of open balls covers the compact set K, and therefore

t

a finite subset, say (S l' S 2' ... ,St) covers K. K = U SrK is inferior 1

to 00. Thus with probability one, when n is great, fn(Xl' ... 'Xn ' 0)

will attain its maximum in H at a point (or points) l!t. in Nho(Oo). Since ho can be arbitrarily small, this means that On -+ 00 with probability one.

When condition (iv) is satisfied, the set He is inferior to 00' and so, with probability one, when n is great, the maximum in H will be a global maximum. The maximum likelihood estimator (MLE) is consistent with probability one.

DDD

2 Location and scale parameters. As an application of the theorem we shall consider location and scale parameters. Here 0= (a, c), c > 0, 00 = (ao' co), andf(x, 0) = c- 1g[(x - a)/c], where g is a probability density relative to Lebesgue measure on R 1 .

Xl' X 2' . .. are independent random variables each with pro­bability density f(·, 00)·

Theorem I Let H be the compact set

H = {(a,c); - A:=;; a:=;; A, c1 :=;; c:=;; c2 }

68

MAXIMUM LIKELIHOOD ESTIMATION

where ° < A < 00, 0< c1 < c2 < 00 and 00 = (ao,co)EH. If g is bounded, and upper semi-continuous, and has the property that K > 0, A ~ 1 exist such that

g(Y):=;;Ag(X) if y:=;;x:=;;-K, or ifK:=;; x:=;; y,

then On' the local MLE in H,for a sample ofn, -+ 00' a.s. asn -+ 00.

Proof. When x > A + Kc2 , (x - a)/c ~ (x - A)/c2 > K,

f( 0) _ 1 (x - a) A (x - A) AC2 1 (x - A) x, --g -- :=;;-g -- :=;;--g -- , c c C C2 C1 c2 C2

and therefore f*(x,H):=;;kf(x,Ol)' where k=AC2 /C 1 > 1, 01 = (A,c2 )·

A+Kc2 A+Kc2

A+Kc2 A+Kc2

> - 00 -logk,

by Lemma 1 and the fact that k > 1. We can show similarly that

-A-Kc2 f f(x, 00) log f*(x, H/(x, 0o)dx > - 00 .

-00

Now consider A+Kc2

J= f f(x, °0) log f*(x, H) f(x, ° o)dx

-A-Kc2

A+Kc2 A+Kc2 f logf(x, 0o)f(x, °o)dx - f logf*(x, H) f(x, ° o)dx.

-A-Kc2 -A-Kc2

u log u has a minimum - e- 1 at u = e- 1. Hence the first

69

Page 40: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

integrand;::: - e- 1. The function g is bounded, say g(x)::::; b.

f(x,e) = c- 1g[(x - a)/cJ ::::; b/c1; f*(x,H)::::; b/c1·

Hence

Thus

-00

Condition (iii) is satisfied, and the theorem is proved. DOD

Theorem n If the every a > 0, IxI1hg(X) -4 00 as x -400, or as x -4 - 00, no global MLE exists.

Proof. Suppose that, for any a > 0,

x 1hg(X) -400 as x -400. Let k =1= ° be such that g(k) > 0.

f,,(x w " ,xn,a,c) = ~ TI g(xy - a). c y=l c

When n > 1, the factor between the braces -4 00 as c -4 0, whether Xy=X1 or not. Hencefn(x 1, ... ,xn,a,c)-400, and so no global MLE exists. 0 0 0

Consider

1 g(x) = 2(1 + Ixl)[1 +log(1 + Ixl)J2' - 00 < x < 00.

For any a> O,x1+ag(x) -400 when x -400. Thus there is no global MLE for the parameters a, c. However, the conditions

70

MAXIMUM LIKELIHOOD ESTIMATION

of Theorem II are satisfied, and the local MLE in a compact set H containing eo will -4 eo a.s. as n -400.

Theorem ill If g is bounded and upper semi-continuous, and for some a > 0, IxI 1+ag(x) is bounded, then with probability one, a global MLE en exists when n is great, and en -4 eo as n -4 00.

Proof. Suppose that g(x) and IxI 1+ag(x) are both < b. It will then be true that if 0< m::::; 1 + a, g(x) < b/lxlm.

Let Xl be the Xy nearest to a, and let z = min I Xy - XS I. If r> 1, y<s

Hence I Xy - a I ;::: ~ z, r > 1.

(X - a) bcn/(n- 1)

g -y-c- < Ixy _ aln/(n 1) ifn/(n -) < 1 + a, i.e. n > 1 + l/a,

bcn/(n-1) ::::; (~zt/(n-1)' r> 1.

Therefore

f,,*(x 1, ... ,xn,0)::::; 2nbn/~.

Eoo logfn*(X1, ... ,Xn,0)::::; nlog(2b)-nEoologZ < 00, [8.3J

because Eoo log Z > - 00; see Lemma 4. It is easy to show by differentiation that if u > 0, m > 0, then

log u ;::: (1 - u-m)/m.

Therefore

log g(x) ;::: [1 - g(x)-mJ/m. Hence

00 00

S logg(x)'g(x)dx>-ooif S g(x)l-mdx<oo. -00 -00

b b1 - m

g(x) < Ixl1+a' g(x)l-m < Ixl(1+a)(l-m) ifm < 1.

71

Page 41: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Take m = ilX/(1 + IX), (1 + 1X)(1 - m) = 1 + ilX

b1 - m

( ) l-m < d I ()l-m b1 - m g X Ixl1+"'/2 ,an a so g x < .

Hence

00 00

J g(x)l- mdx < 00, and J logg(x)·g(x)dx > - 00, -00

00

f [1 (x -ao)] 1 (x -ao) log -g -- -g -- dx Co c 2 Co Co

-00

00

= J log [g(x)/co]g(x)dx > - 00. - 00

From [8.3] and [8.4] it follows that

[8.4]

E I fnC,8o)_ E I f(·8)- 1 *.Q 80 ogfn*(·, 8) - n 80 og '0 E80 ogfn ( ,\Y) > - 00.

Condition (iii") is satisfied. We now need to show that (iv') is satisfied.

ifr > 1.

I (Xr - a) b -g -- <-, c c c

bnc(n-1)",-1 fn(xl' ... ,xn,a,c) < C!z)(n-1)(1+",)·

When (n - 1)1X > 1, i.e. n> I + 1/1X, this will -+ 0 uniformly with respect to a as c -+ 0 if z =1= 0, i.e. except on a set of Lebesgue measure zero.

I (Xr - a) b -g -- <-, c c c

72

MAXIMUM LIKELIHOOD ESTIMATION

and so -+ 0 uniformly with respect to a as c -+ 00.

1 (Xr - a) bc b -;; g -c- < c I xr - a I = I xr - a I '

and therefore -+ 0 uniformly with respect to c as I a I -+ 00 . Hence fn(x 1 , ... , xn' a, c) -+ 0 uniformly with respect to a as c -+ 0 or 00, and -+ 0 uniformly with respect to c as I a I -+ 00. Condition (iv') will be satisfied. We have shown above that condition (iii") is satisfied, and so the theorem is proved.

DOD Lemma 4 If Xl' X 2' ... ,X n are i.i.d.r.v. with a bounded density function relative to Lebesgue measure on the real line, and Z = min I X r - X s I,

r<s

then E log Z > - 00.

Proof. If a random variable U has a bounded density h relative to Lebesgue measure on the real line,

1

J log lulh(u)du is finite. -1

Therefore 00

EloglUI = J loglulh(u)du > - 00. -00

X,., X s' r < s, are independent random variables with a bounded density, and therefore IX r - Xsi has a bounded density. Hence EloglXr - Xsi > - 00.

F or a random variable W, define

Obviously

W- = W,ifW < 0

= 0, ifW;:::: o.

E(W) > - 00 <o>E(W-) > - 00.

Let W1 , •.. , Wk be random variables, each with a mean value > - 00.

73

Page 42: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

k

E(min Wr) ~ LE(Wr-) > - 00. r 1

10gZ = minloglXr - X,I, r > s,

EloglXr - X,I > - 00, and so ElogZ > - 00.

DDD

3 Discrete probability space. Although the theorem in Section 1 of this chapter applies to all distributions, we can get a stronger and more useful result by starting afresh, and using the special properties of a discrete probability space. This investigation points to the great difference between continuous probability distributions and discrete probability distributions. From the point of view of experiment, the former are unreal. This section is based on Hannan (1960): the result obtained is a slight exten­sion of Hannan's result.

The probability space is countable, with points Xl'X2 ' ....

We take the counting measure as the dominating measure fl. Consider the family of probability measures {p} with densities {p}, and suppose that the actual probability measure is PoE{P}, with density Po. Denote the observations by Y1'Y2' ....

We shall consider estimating the probability measure Po itself rather than parameters which determine it. For a sample of n, the MLE would be the probability measure Pn which

n

maximizes L log p(yJ To avoid considering compactness, we r=l

shall consider a more general estimator. The probability measure Pn' with density Pn' is a practical likelihood estimator (PLE) if for some positive sequence (Gn) such that Gn/n -+ 0 as n -+ 00,

In particular, Gn might be constant. It follows from [8.5J that n n

I logpn(yr) ~ IlogPO(Yr)-Gn, r= 1 r= 1

1 ~ 1 PO(Yr) < / - L.. og-(y) - Gn n, n r=l Pn r

74

MAXIMUM LIKELIHOOD ESTIMATION

and therefore . 1 ~ Po(Yr) < 0 hm sup- L.. log-(y) - .

n .... 00 n r = 1 Pn r

[8.6J

A sequence (Pn) of probability measures with densities (Pn) for which [8.6J is tme will be called a regular likelihood estimator (RLE). Every PLE is an RLE.

The empirical probability distribution of the sample Y1'Y2' ... 'Yn will be denoted by P: with density P:· If h is a function on PI,

n

L h(Yr)/n = LP:(X)h(x), r=l x

which we shall write as LP:h. If EPoh is fInite, LP:h -+ EPah a.s. as n -+ 00, by the strong law of large numbers. Note that [8.6J may be written

lim sup Ip: 10gPo/Pn ~ o. n .... oo

Theorem Let Pn be an RLE. Iffor some value ofm

m p(Yr )

EPa sup log IT -(Y) < 00, P r=lPO r

[8.7J

then, with probability one, P n -+ PO' i.e. p*(P n' Po) -+ 0 as n -+ 00.

Note that the condition [8.7J is of the type (iii"); but in the case of a discrete probability space, no further condition like (iv') is required. In fact, we ignore the topology of e, and consider only that of {p}.

Proof. We shall fIrst prove the above theorem for m = 1. If y(x) = sup p(x), the condition [8.7J becomes

P [8.8J

Note that y ~ 1, and therefore EPa log y/po ~ - EPa log Po. Hence a sufficient condition for [8.8J is EPa log Po > - 00.

If k,k' > 0,

k log (k/k') = 2k log (.jk/-Jk') ~ 2k(1 - -Jk'/-Jk)

= 2-Jk(-Jk - -Jk') = (.Jk - -Jk'f + (k - k').

(-Jk - -Jk')2 ~ k' - k + klog(k/k').

75

Page 43: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Let J w denote the set (Xl' X 2 "" ,xw) of points in fE, and also its indicator function. Put k = J wP:, k' = J wPn' Summing over the points of fE, we have

LJw(Jp: - JpY ::; LJw(Pn - P:) + 2)wP: log (p:/pJ,

::; L(1- Jw)P: + LJwP: logp:/po + LP: logPo/Pn

+ L(1- Jw)P: logpJpo

::; LP:(1-Jw)[1 +logY/PoJ + LJwp:logp:/po

+ LP: log Po/Pn·

When n -+ 00, the first term on the right hand side -+ EPa(1- Jw)[1 + log Y/PoJ with probability one, because EPa log Y/Po is finite. The second term -+ 0 almost surely, and lim sup LP: log Pol Pn ::; O. Therefore n--+oo

~oth sides of this inequality are monotone in w, and the right sIde -+ 0 as W -+ 00, since EPa log Y/Po is finite. Hence

lim LJw(Jp: - JpJ2 = 0 for all w, a.s. n--+oo

and therefore

lim [p:(x) - Pn(x)} = 0 for all x, a.s. n--+ 00

Also

P:(X) -+ Po(x) for all X, a.s.,

Therefore

Pn{X) -+ Po(x) for all x, a.s.

Since

LPn(X) = LPo(X), this implies Pn(x) -+ Po(x) in mean a.s., and

I!pn(x) - Po(x)l-+ 0, a.s.

Now consider the case where [8.7J is not true for m = 1, but is true for some m > 1. First suppose n -+ 00 by integral multiples of m, n = vm, and v -+ 00. A sample of m observations from the

76

MAXIMUM LIKELIHOOD ESTIMATION

space fE may be regarded as a sample of one from the product space fZ = fErn, each point of which is an ordered set of m points from fE, not necessarily all different. If P is a probability measure on fE, we shall use the same symbol P for the product measure on fZ = fEm. Thus if

m

z,EfZ,= (Yl'Y2,.··,yJ, P(z) = TIP(Yr)' 1

where

Zs = {Y(S-l)m+ l' Y(s-l)m+ 2' ... 'Y(S-l)m+m}'

If P n is an RLE for the sample of n = vm from fE,

1· 11 TIn PO(Yr) < 0 lID sup- og -( -) - .

n r=lPnYr

Therefore

1· 11 TID P O(Zs) 0 lID sup - og -( -)::; ,

V s=lPnZs

and P n on fZ is an RLE for the sample (Zl' Z2' ... , zJ from fZ. The condition [8.7J may be written

P(Z) EPa suplog-Z) < 00, Z = (Yl , Y2, ... , Ym)'

p Po(

It follows from the part of the theorem already proved that with probability one,

In particular

Thus

Pn(X,X, .,. ,x) -+ P o(x,x, ... ,x), all x,

Pn(xr -+ po(xr·

P n(x) -+ P o(X) , for all x, a.s.

77

Page 44: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

Now suppose n= vm+ u, 0 < U < m.

Zs = {Y(S-l)V+U+ l' ... 'Y(S-l)v+u+m}·

For fixed u, when n ~ C(), the first term on the right ~ 0, and we can show as before that if P n is an RLE for the sample of n = vm + u from f!£, then Pn on f!Z is an RLE for the sample (Zl' ... ,zv), and P n ~ Po' a.s. This is true for every u < m, and so the theorem is proved. DOD

78

CHAPTER NINE

THE SAMPLE DISTRIBUTION FUNCTION

1 The distribution function F of a real random variable X is defined by F(x) = P{X ~ x}. For a sample of n values of X, the sample distribution function Kn is defined by Kn(x) = N(x)/n, where N(x) = number of sample values ~ x. It is a step function. The Glivenko-Cantelli theorem (published in 1933) states that, with probability one, sup I Kn(x) - F(x) I ~ 0 as n ~ 00. This is

x

the existence theorem for Statistics as a branch of Applied Mathematics.

The Glivenko-Cantelli theorem is only the beginning. It being true, we immediately want to know how likely Kn is to differ much from F, i.e. we are interested in the distribution of the random function Kn. The investigation of its probability distribu­tion in finite samples, and of its limiting distribution, was started by Kolmogorov with a paper in the same journal and in the same year (1933). The former is a problem in combinatorial mathe­matics which has taken mathematicians about 40 years to solve completely.

The sort of thing we want to know about the distribution of Kn is the probability that Kn lies between two specified functions,

g(x) ~ Kn(x) ~ hex), all x.

The fact that Kn is a non-decreasing function makes the second inequality equivalent to

Kn(x) ~ ho(x),

where ho is the greatest non-decreasing function which is ~ h. Hence the effective upper barrier will be non-decreasing. So we may as well take h as non-decreasing. Similarly for g.

Again, if F(x) is constant over an interval, then Kn must be constant over that interval, and so the effective upper and lower barrier functions will be constant over that interval, and therefore they will be functions of F. Thus there is no loss in generality in

79

Page 45: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

considering only inequalities of the form

g[F(x) J :s; Kn(x) :s; h[F(x) J, allx, [9.1 J

where hand h arc non-decreasing, and g(O) = 0, h(l) = 1. Kolmogorov (1933) considered the probability of

max [F(x) - a,OJ :s; Kn(x) :s; min [F(x) + a, 1].

For any random variable X with distribution function F,

F(X):S; F(x)¢;>X:S; x or (X > x) [F(X) = F(x)].

The last event is of probability 0. Hence, with probability one,

X :s; x¢;> F(X) :s; F(x).

Therefore, if Xl' ... 'Xn is a sample of n values of X,

KJx) = (number of Xr :s; x)/n = [number of F(xr) :s; F(x) J/n, a.s. = K:[F(x)J,

where K: is the sample distribution function of

F(x l ), ... ,F(xn)·

The inequality [9.1 J has therefore the same probability as

g[F(x)J :s; K:[F(x)J :s; h[F(x)J, all x. [9.2J

If ° :s; k:S; 1, and F is continuous, then for some x o' F(xo) = k. Therefore

P[F(x) :s; kJ = P[F(x) :s; F(xo)] = P[X :s; xoJ = F(xo) = k.

Thus if X is a random variable with a continuous distribution function F, the random variable F(X) has a rectangular [0, 1 J distribution. The probability of [9.2J, and therefore of [9.1 J, will be the same as the probability of

g(u) :s; K:(u) :s; h(u), all u, [9.3J

where K: is the sample distribution function of a sample of n values of a rectangular [0, 1 J random variable. In other words, the probability of [9.1 J is the same for all continuous distribu­tions, and is equal to the probability of

g(x) :s; K:(x) :s; h(x), all x, [9.4J

where K:(x) is the sample distribution function for a sample of

80

THE SAMPLE DISTRIBUTION FUNCTION

n values of a rectangular [0, 1 J random variable. From now on we restrict attention to a random variable with a continuous distribution function.

If F is the distribution function of X, then the probability of [9.1J is the same as the probability of [9.4]. But what if the distribution function of X is not F but G? What is then the probability of [9.lJ? We are interested in this problem when we use an acceptance region of the form [9.1 J to test the hypothesis that the distribution function of X is F -a generalised Kolmogrov test. We want to know the power of the test at G.

Suppose that the distribution function of X is G, and that F = f( G), where f is a continuous distribution function on [0, 1].

P{g[F(x)J :s; Kn(x):S; h[F(x)J}

= P {g(J[ G(x) J) :s; Kn(x) :s; h(J[ G(x) J) } = P{g(J(x)J :s; K:(x) :s; h[j(x)J}.

The barrier functions 9 and h are replaced by g(f) and h(f). In all cases we reduce the problem to computing the probability of [9.4J for suitable 9 and h. We now study the distribution of K n ,

the sample distribution function of a sample of n values of a rectangular [0, 1 J variable-n random points in [0, 1]. We shall call the graph of Kn the sample graph, or the sample path.

2 If U 1 :s; U 2 :s; ... :s; Un are the order statistics from a sample of n values of a rectangular [0,1 J variable, the inequality [9.4J is equivalent to

where

Note that

Lemma Let

[9.5J

ui = inf{x; hex) ;:::: i/n}

Vi = sup{x; g(x) :s; (i - 1)/n}.

° :s; ul :s; u2 :s; ... :s; Un :s; 1,

° :s; Vi :s; V 2 :s; ... :s; Vn :s; 1.

81

Page 46: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

denote events such that for any integer k, the sets

{Ar,Bs;r<k,s:::;; k}, {Ar,Bs;r>k,s>k}

are independent, then

P(Bi B2 ... BnAiA2 ... An-i) = d n = det(di), 1 :::;; i, j :::;; n

where

dij = 0

= 1

= PCB)

= P(BiBi+ 1 ... BjA~A~+ 1 ... Aj-i)

and A~ is the complement of Ar.

ifi > j + 1,

ifi=j+I,

if i = j,

if i < j,

Proof. Note that the conditions on the events make B i , B2, ... , Bn independent. The events Ai' A 2, ... are I-dependent. Put

Ar = AiA2 ... Ar, Br = BiB2 ... Br·

The lemma may be proved by use of the principle of inclusion and exclusion.

r<s

which can be shown directly to be the expansion of d n • A proof by induction is shorter, and easier to print.

Assume the lemma true for n = m.

dm = P(BmAm-i)'

Consider ~ + 1 • The elements of its last row are all zero except

d.n+i,m = 1, dm+i ,m+i = P(Bm+i )·

Therefore

~+ 1 = P(Bm+ i)dm - L\:., where ~ differs from d m only in having Bm in the last column of ~ replaced by BmBm + 1 A:'n, which satisfies the same conditions

82

THE SAMPLE DISTRIBUTION FUNCTION

relative to the other events appearing in d~ as does Bm' Therefore

d~ = P(BmBm+iA:'nAm-i) = P(Bm+iAm- iA:'n);

P(Bm+i)~ = P(Bm+i)P(Bm~-l) = P(Bm+iAm- i )·

Thus

~+i = P(Bm+i Am- i ) - P(Bm+i~-lA:'n)

= P(Bm+i A...-iAJ = P(Bm+iA m),

and so the lemma is true for n = m + 1. It is easy to show that it is true for n = 2, and so it is true for all n. 0 0 0

Corollary Taking the case where every P(Bi) = 1, we obtain the following result for the I-dependent sequence of events Ai' A 2, ... ,

P(AiA 2 .. · An-i) = det(di),

where dij = 0

= 1

Theorem Let

ifi>j+I,

ifi=jorj+I

if i < j.

o :::;; Ui :::;; U2 :::;; .. , :::;; un :::;; 1,

o :s;; Vi :::;; V2 :::;; ... :::;; Vn :::;; 1,

be given constants such that

ui < Vi' i = I,2, ... ,n.

If Vi' V 2' ... , V n are the order statistics (in ascending order) from a sample of n independent uniform random variables with range 0 to 1,

P(ui :::;; Vi:S;; Vi' 1:::;; i:S;; n) = n1det[(vi -u}i+-i+i/(j-i+ 1)1],

where (x)+ = max(x,O), and it is understood that determinant elements for which i > j + 1 are all zero.

Proof. Let Y i , Y2 , .•• 'Yn be independent random variables, each with a uniform distribution from 0 to 1. The required

83

Page 47: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

probability is equal to

n!P(ui :::::; Yi :::::; Vi' 1 :::::; i:::::; n; Y1 :::::; Y2 :::::; ... :::::; Yn). [9.6J

Denote by Bi the event ui :::::; Yi :::::; Vi. Denote by Ai the event Yi :::::; Yi + l' and by A~ the complement of Ai' that is, the event Yi > Yi+ 1. The events Ai' Bi satisfy the conditions of the lemma. Hence

[9.7J

= P(B 1 B2 ... BnA 1A2 ... An- 1)

= det(diJ

dii = P(B;} = Vi - ui·

If i < j,

dij = P(BiBi+l ... BjA~A~+l ... Aj-l)·

The event BiBi+ 1.·· BjA~A~+ 1··· Aj_l is

This is equivalent to

ur :::::; Yr :::::; vr' i:::::; r :::::; j

Yi > Yi + 1 > ... > Yj •

Vi;::::: Yi > Yi+1 > ... > Yj ;::::: uj'

the probability of which is (Vi - u)~-i+ l/(j - i + I)!. The theorem then follows from [9.6J and [9.7]' DOD

This result is due to Steck (1971). It is an explicit solution of the problem for finite n; but it is not of great practical value except when n is small. The expansion of the determinant has 2n - 1 non-vanishing terms. Simpler and more useful expressions for crossing probabilities can be obtained when the barriers are straight lines.

3 Probability of crossing a straight line through the ongm. Let OE be the line x = ky, 0 < k < 1, and let p be the probability that the sample path crosses OE.

1 - p = P{Ur > kr/n; 1 :::::; r :::::; n}.

84

THE SAMPLE DISTRIBUTION FUNCTION

k E 1-k

rln ----------------- -------r----l

o u, Figure 1

Put A = k/n. 1 U3 U2

1 - p = n! S dUn··· S dU2 S du1 · nA 2A A

Consider

Ur + 1 U3 u2

Qr = S dur ··· S dU2 S du1 • rA 2A A

This will be a polynomial in ur + 1 which vanishes when ur+ 1 = rA

Ql = u2 - A, Q2 = u~/2 - AU3

and it is easy to show by induction that r r-1

Q = Ur+l_A~ r r! (r - I)!

1 - p = n!Qn with Un+1 = 1,

= 1- nA = 1- k.

Thus p = k. This is a very simple and elegant result. It is remarkable that the probability of crossing is independent of the sample number n. The following proof, suggested by Dr J.W. Pitman shows why this is so.

Extend the domain of definition of K n , the sample distribution

85

Page 48: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

function, by means of the equation

KJx + 1) = Kn(x) + 1.

For any Xo E [0,1], consider the cyclical transformation of the interval [O,l],x~x*,x*=x-xo if x~xo,x*=l+x-xo if x < Xo' The sample Xl' ... ,xn transforms to the sample xI, ... ,x: , and the new sample distribution function K: will be given by

K:(x) = Kn(x) - Kn(xO)' If x ~ XO' = Kn(x + 1) - Kn(xo), if x < xO'

The set of samples which can be transformed into one another by cyclic transformations will be called a configuration. Given a configuration, consider the graph of the extended sample distri­bution function of a particular member sample. We may look on it as an indefinitely extended flight of steps with horizontal treads and vertical risers. Imagine that there are rays of light shining down parallel to the line x = kyo Each riser will cast a shadow

Figure 2

86

THE SAMPLE DISTRIBUTION FUNCTION

on one or more of the treads below, and the total width of the shadow will be k times the height of the riser. The total width of the shadows on the treads from x = ° to x = 1 is therefore kKn(1) = k.

The conditional distribution of the members, given the con­figuration, may be specified by saying that the starting point Xo

(the point that goes to ° in the transformation) has a uniform distribution over (0, 1). Hence the probability that it is in shadow is k. The sample path of a member of the configuration crosses the line x = ky if and only if its starting point is in shadow, and so the conditional probability of crossing, given the configuration, is k. This is true for all configurations, and therefore the probability of crossing is k. 0 0 0

It should be noted that the proof has not assumed that the risers are all equal in height, and, of course, we may consider a basic interval of any length. The most general form of the result may be stated as follows:

Theorem Let hi' ... ,hn be positive constants with sum h, and let Xl' ... ,Xn be independent rectangular [0, a] variables. Diflne the sample function Knfor ° :::; x :::; a by

Kn(x) = L hr'

Call the graph of Kn the sample path. The probability that the sample path cuts the straight line joining the origin to the point (k, a), 0< k < a is kla. 0 0 0

4 Probability of crossing a straight line not through the origin. Let A be the point (0, a), and A' the point (1,1 + a'). AA' is the line y = a + (1 + a' - a)x. The probability that the sample path

n

crosses AA' is Ltr, where tr is the probability that the sample 1

path crosses AA' last at y = rln. Q is the poi~t (Pr' rln).

rln = a + (1 + a' - a)Pr.

Pr = rln - a 1 _ _ 1 + a' - rln 1 + a' - a' Pr - 1 + a' - a '

c = BO' = a'/(I + a' - a).

87

Page 49: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

A'

a'

~ ____________ ~ ______ ~B~-TC~O'

A

a

o P, Q' 1 - P,

Figure 3

tr = Probability of r sample points in OQ' x Probability that sample path starting at Q does not cross QB,

= (;)P~(l- Prt-rc/(l- Pr),

=0

The required probability of crossing is

ifr/n> a,

ifr/nS;a,

P(a, a') = ~' n I (n)(r/n-a)'(I+a'-r/n)n-r-1. (1 + a - a) r>na r

Abel's formula [A.4J in Section 5 of the Appendix is

(z+ut = ± (n)U(Z-n+rnu+n-rt - r- 1. r=O r

Putting z = n(I - a'), U = na', we obtain

1 = (1 a; _ )n ± (n) (r/n - anI + a' - r/nt- r- 1. + a a r=O r

88

THE SAMPLE DISTRIBUTION FUNCTION

Thus

P(a, a') = 1 - (1 ~' _ t L (n) (r/n - anI + a' - r/n)n-r-l, + a a r<na r

tr being 0 if r = na.

By considering the transformation of the unit square,

(x,y) --+ (1 - x, 1- y),

which rotates Fig. 3 through 180°, we can see that

P(a, a') = P( - a', - a),

the probability of crossing the straight line joining (0, - a') and (1, 1 - a).

Putting a = an-liZ, a' = j3n- 1/Z, we obtain

P(an-1/Z,j3n-1/Z) = L tr, r>an 1/ 2

where

tr = [1 + (;~-~;:-l/zJn (; ) (r/n - an- 1/zy

x (1 + j3n- 1/Z - r/n)n-r-1

n! rr(n - rt- r- 1 j3(1 - an1/Z /r)'(1 + j3n 1/Z /(n _ r) )n-r-1 = X -~----;:-::,---'-:-::---'-:--~=----'-__ _

nn- 1/zr !(n - r)! [1 + (j3 - a)n 1/ZJn

= Ur X Yr.

By the use of Stirling's Theorem, we obtain

I I n 81 8z 83 og U = og +--~ __ -"--__ r ,J[2nr(n - rn I2n 12r 12(n - r)'

for 0 < r< n, where 0 < 81 ,82 ,83 < 1. Therefore

and

en Ur < ,J[2nr(n - r?J' O<r<n

89

Page 50: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

where C is a constant.

log Vr = log /3 - log [1 + /3n l/2/(n - r) J

[ anl/2 a1n a3n3/2 ]

+ r - -r - - 2r2 - 3"r3 - ...

[ /3nl/2 /32n /33n3/2 ] + (n - r) --- 2( )2 + 3( )3 - ... n-r n-r n-r

[ /3 -a (/3 - a)2 (/3 - a)3 ] - n ~ - 2 + 3 3/2 - ••• n n n

=log/3-! -+---(/3-af +11, [ na2 n/32 ] r n- r

_ 1 /3 _.1 [(n - r)a + r/3J2 - og 2 ( ) + 11· r n- r

where

when r, n - r great.

1111 < Kn- 1/6 if n5/6 < r < n - n5/6 •

Hence if n5/6 < r < n - n5/6,

v /3 {- H(n - r)a + r/3J 2} r '" exp () , n --+. CI) , r n- r

uniformly with respect to r, and

n/3 {- H(n - r)a + r/3J 2} tr '" / exp

y [2nr(n - r)3J r{n - r)

uniformly with respect to r. Thus if n- 1/6 < rln < 1 - n- 1/6,

/3 { - H (l - rln)a + (rln)/3J2} tr '" / exp .

ny [2n(rln)(1 - rlnn (rln)(l - rln)

uniformly with respect to r. Therefore

n~6. I-nI-,/6 /3 exp { - Ha~(;:~~ /3xJ2}

~ tr --+ .j(2n) .j[x(1- x) dx n5f6 n - 1/6

90

THE SAMPLE DISTRIBUTION FUNCTION

1 {- Ha(l - x) + /3xJ2 } I /3 exp x(1 - x)

--+ .j(2n) .j[x(l - xn dx o

see below. Using

we obtain

10g(1 + x) ~ x - x 2, Ixl sufficiently small

log(l + x) :::; x -log(l + x 2/6), x> - 1

( anl/2) 10gVr < 10g/3+r --r-

[9.8J

[9.9J

{ /3nl/2 [ /32n ]} + (n - r) n _ r -log 1 + 6(n _ r)2 .

_n[/3- a _ (f3-a)2] nl/2 n

( /32n ) = 10g/3+(/3-af-(n-r)log 1 + 6(n-r)2

when n is great. Therefore

Kn 1 t = U V < ----=---= =----,:-;;-----,--,--------c=---r r r .j[2nr(n - rn [1 + /3 2nI6(n - rfJn-r

Kn 1 K' < = ----=---=

.j[2nr(n - r?J /32nI6(n - r) .j[r(n - r) J' where K, K' are constants. Hence

n5f6 n5f6 1 L tr < K' L --,:=------::-

an'/2 r=l n.j[(rln)(l - rln)]

n- 1 / 6 dx 0

--+ ! .j[x(l - x)] --+

Similarly n

L tr --+ 0, n- n5 / 6

91

Page 51: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

and so

P{an-1/2,[3n-1/2) = I t r -+ e- 2aP. r>Cln 1/ 2

Proof of [9.9]' By the substitution

u = X 1/2(1 - x)-1/2, du = !x- 1/2(1 - X)- 3/2dx,

the integral [9.8] becomes

2[3 ooJ [1 2 J(2n) 0 exp - z(a/u + [3u) ]du.

By the substitution

[9.10]

- _y+J(y2+4a[3) dY [ Y ] Y - [3u - a/u, u - 2[3 , du = 2[3 1 + J (y2 + 4a[3) ,

the integral [9.10] becomes

1 ooJ 1 2 [ Y ] J (2n) _ 00 exp [ - z(y + 4a[3)] 1 + J (y2 + 4a[3) dy

1 00

= J(2n)}00 exp[ _!(y2 + 4a[3)]dy = e- 2aP .

This completes the proof that

lim P(an-1/2,[3n-1/2) = e- 2aP, a,[3 > 0. n-->oo

Since

P{ _ an- 1/2, _ [3n- 1/2) = P([3n- 1/2, an- 1/2),

we have also

lim P(_an-1/2,_[3n-1/2) = e- 2aP . n--> 00

ODD

The proof for this elegant result is lengthy; but it seems the best that can be done by elementary methods. A much shorter, but more sophisticated proof depends upon the properties of the Brownian bridge. See Billingsley (1968).

When the line AA' is parallel to 00', [3 = a, and

P(an-1/2,an-1/2)-+ e- 2a2 .

AA' is then the line y = an -1/2 + x. The probability of not crossing

92

THE SAMPLE DISTRIBUTION FUNCTION

for a> 0, is P[Kn(x) ::; an- 1/2 + x], which -+ 1 - e- 2a2. For the rectangular [0,1] distribution F(x) = x, 0::; x ::; 1. Hence, from the theory in Section 1 above, if Kn is now the sample distribution function of a sample of n values of a random variable with a continuous distribution function F,

P[K/x) - F(x)::; an- 1/2] = 1 - P(an-1/2,an-1/2)-+ 1- e- 2a2 .

5 Boundary consisting of two parallel lines. We now consider a pair of parallel boundary lines

y = a + (1 + c)x, y = - b + (1 + c)x,

where a, b, a + c, b - c > 0, a, b - c < 1. Let h(y) = probability of crossing the line y = y + (1 + c)x, then

h(a) = probability of crossing y = a + (1 + c)x, = P(a, a + c), h( - b) = probability of crossing y = - b + (1 + c)x,

= P( - b,c - b) = P(b - c, b) = h(b - c).

y

a+c

b-c

aH

o~------~------------~-----x K

-b Figure 4

93

Page 52: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

While various values of yare considered, e remains fixed through­out discussion. Put

G(x) = N(x)ln - (1 + e)x,

then

G(O) = 0, G(l) = - e,

and

h(y) = P[ G(x) = y for some x].

We require the probability that the sample path crosses the boundary consisting of the two lines, HH' and KK', that is the probability that G(x) takes at least one of the valu_es a, - b. Denote the event, G(x) takes the value a at some point in [0,1], by .91, and the event, G(x) = - b at some point in [0,1], by f!8. The required probability is

P(d + f!8) = P(d) + P(f!8) - P(d f!8)

= P(d) + P(f!8) - P(d followed by f!8) - P(f!8 followed by d).

As x in~reases from ° to 1, G(x) steadily decreases except at sample pomts, at each of which it increases by a jump of lin. The replacement of the sample points in (t, 1), 0< t < 1, by their

y

(1, - c)

Figure 5

94

THE SAMPLE DISTRIBUTION FUNCTION

transforms under the transformation x ~ t + 1 - x, will be called reversal after t. Suppose G(t1 ) = k1 , G(t2) = k2' where 0< tl < t2 < 1. As x increases from t2 to 1, G(x) changes from k2 to - e, a decrease (positive or negative) of k2 + c.lfwe reverse the sample points after t l' G(x) will decrease by k2 + e as x increases from tl to tl + 1 - t 2 , and so will take the value kl - k2 - e at tl + 1- t2·

A sample in which G(x) takes the value y at some x, will be called a y sample. When y :?: a, such a sample will be called a weak y sample, W(y), if G(x) takes the value - b before the value a; otherwise the sample is strong, S(y). If y ~ - b, a y sample is weak if G(x) takes the value a before the value - b: otherwise it is strong. The probability of crossing the boundary is

h(a) + h(b - c) - P[W(a)] - P[W( - b)].

Consider a sample which is W(A), A :?: a, and let t be the point at which G(x) first takes the value - b. At some point in (t, 1), G(x) takes the value A. Hence by reversal after t, the sample will become a - b - A - e sample. Moreover, it will become a strong - b - A - e sample. By this process, every W(A) will become an S( - b - A - e). By the same process, of reversal after the point at which G(x) first takes the value - b, every S( - b - A - e) will become W(A), and the mapping is one-to-one. Hence

P[W(A)] = P[S( - A - b - e)]

= h( - A - b - e) - P[W( - A - b - e)]

= h(A + b) - P[W( - A - b - e)]. [9.11 ]

Now consider a W( - B),B:?: b. Let t be the point where G(x) is first:?: a, then G(t) = a + d, where ° ~ d < lin, ignoring throughout the whole discussion the possibility, of zero pro­bability, that any two sample points coincide. By reversal after t, the sample will become S(a + d + B - e), which is included in the set of S(B + a - e). Therefore

P[W( - B) ~ P[S(B + a - e)] = h(B + a - e)

- P[W(B + a - c)]. [9.12]

By the same process, an S(B + a - e) will become W( - B + d). Therefore an S(B + a - c + lin) will become W( - B + d - lin),

95

Page 53: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

which is also W( - B), since d - lin:::; O. Thus

P[W( - B)J ~ P[S(B + a - c + I/n)J

= h(B + a - c + lin) - P[W(B + a - c + lin) J. [9.13J

Com~ining inequality [9.11J with either [9.12J or [9.13J, we obtam

P[W(A)J ~ h(A+b)-h(A+b+a)+P[W(A+b+a)J, A ~a,

:::; h(A+b)-h(A+b+a+ lin)

+ P[W(A + b + a + lin)]'

P[W( - B) ~ h(B + a ----: c + lin) - h(B + a - c + lin) +P[W(-B-a-b-l/n)], B ~ b,

:::; h(B + a - c) - h(B + a + b - c)

+ P[W( - B - a - b)].

Each of these can be extended indefinitely by repeatedly using the same inequality on the last term. From these we deduce

P[W(a) J ~ h(a + b) - h(2a + b) + h(2a -+- 2b) - ...

:::; h(a + b) - h(2a + b + lin)

+ h(2a + 2b + lin) - h(3a + 2b + 21n) + ... P[W( - b) J ~ h(a + b - c + lin) - h(a + 2b - c + lin)

+ h(2a + 2b - c + 21n) - ...

::::; h(a + b - c) - h(a + 2b - c)

+ h(2a + 2b - c) - ...

The terms on the right hand sides of the inequalities decrease in magnitude rapidly. Note that

P[W(a)J+P[W(-b)J:::; h(a+b)+h(a+b-c)

~ h(a + b) + h(a + b - c + lin)

- h(2a + b) - h(a + 2b - c),

which shows that h(a + b) + h(a + b - c) is a very good approxi­mation to P[W(a)J + P[W( - b)]. Thus the probability that the

96

THE SAMPLE DISTRIBUTION FUNCTION

sample path crosses the boundary consisting of the two parallel lines y = a + (1 + c)x, Y = - b + (1 + c)x is approximately

h(a) + h( - b) - h(a + b) - h(a + b - c),

= P(a,a + c) + P(b - c,b) - P(a+ b,a+ b + c)

- P(a+b - c,a+ b).

The most important case is c = 0, the boundary lines parallel to the line y=x. Then h(y)=P(y,y), and when n-+00,h(yn- 1/2) and h(yn- 1/2 + n- 1) both -+ e- 2y2. Therefore the probability of crossing the boundary consisting of the lines y = ± yn - 1/2 + X

= 2h(yn -1/2) - 2P[W(yn -1/2) J 00

-+ 2 L (- ly-le-2r2y2. r= 1

Now let Kn be the sample distribution function of a sample of n values of a random variable with a continuous distribution function F, and let Dn = sup I Kn(x) - F(x)l, then

x

-+ 2[ e- 2y2 - e- 8y2 + e- 18y2 - ... J, as n -+ 00.

The second term in the series between the braces is the fourth power of the first term. It, and all subsequent terms, can usually be neglected in practical applications. Inequalities similar to those in this section are discussed in Durbin (1968).

97

Page 54: Asymptotic Relative Efficiency

APPENDIX MATHEMATICAL PRELIMINARIES

1 Convergence in mean and convergence in measure. A sequence (J;,) of real valued integrable functions on a space fl' with a measure Jl is said to converge in mean to fif lim J If" - fl = 0. It then follows that J f" ~ J J, and that I f" I converges to I fl in mean. The sequence (f,,) converges in measure to f if for every 8>0,

Jl{X;XEfl', I f,,(x) - f(x) I > 8} ~ ° as n ~ 00.

The sequence (f,,) may converge to f in mean, or in measure, without converging almost everywhere to f; but it is well known that in both cases every subsequence of (f,,) contains a subsequence which converges to f almost everywhere. It is convenient to have a name for this type of point convergence: we shall call it loose convergence. We shall say that gn converges loosely to g, and write

I gn ~ g, if every subsequence of (gn) contains a subsequence which converges almost everywhere to g. It is easy to show -that if Jl(fl') < 00, loose convergence implies convergence in measure.

Loose convergence obeys the usual manipulative rules of point convergence, such as

I I I I f,,~f, gn~g, =>f,,+gn~f+g, f"gn~fg·

We need a modification of Fatou's lemma.

Lemma If gn Z 0, and ~ g, then lim inf J gn Z Jg. n-+oo

Proof. Put L = lim inf J gn' There is a sequence (n') of positive n-+oo

integers such that J gn' ~ L. This sequence contains a subsequence (n") such that gnu ~ g almost everywhere and J gnu ~ L. Hence, by Fatou's lemma,

DDD

98

APPENDIX: MATHEMATICAL PRELIMINARIES

We shall make frequent use of the following extension of the dominated convergence theorem, which is not so widely known as it ought to be. It is essentially given in Pratt (1960), though not quite in the form given here. It does not appear in most textbooks on measure and integration. An exception is Royden (1968), but the full implications are not set out there.

Theorem I

(i) gn~g. Ignl:;;; IGnl a.e., Gn integrable and~G in mean => gn ~ g in mean.

I (li) Hn Z ° and integrable, Hn~H integrable, JHn~ JH

=>Hn~H in mean.

Proof. We first prove

gn ~ g, Ignl :;;; Hn a.e., Hn integrable J Hn ~ J H => gn ~ g in mean.

I and ~ H integrable,

Hn + H -Ign - gl Z Hn + H -Ignl-Igl Z 0, a.e. I

Hn+H -lgn-gl~2H.

Therefore by the Lemma,

Hence

PH:;;; liminf J(Hn + H -Ign - gl) :;;; lim sup J(H,! + H -Ign - gl)

:;;; lim sup J(Hn + H) = PH.

lim J(Hn + H -Ign - gl) = PH = lim J(Hn + H).

Thus

lim Jign - gl = 0, gn ~ g in mean.

Since Gn ~ G in mean=> IGnl ~ IGI in mean, putting Hn = IGnl, we obtain (i). Putting gn = H n, we obtain (ii).

DDD Corollary

I Proof. Suppose the left-hand statement true. g; ~ g~ and so

I by (ii) g; ~ g~ in mean. (gn - g 0)2 ~ 0, and (gn - g 0)2 :;;; 2g; + 2g~,

99

Page 55: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

which -t 4g~ in mean. Therefore (gn - gO)2 -t 0 in mean, J(gn - go)2 -t 0, gn -t go in quadratic mean.

If the right-hand statement is true, i.e. (gn - gof -t 0 in mean, I 2 I 2

then gn -t go' gn -t go·

g; = (gn - go + gO)2 ::::;; 2(gn - go)2 + 2g~, which is convergent in mean. Therefore g; -t g~ in mean. D D D

If In and fo are probability density functions, IIn = Jio = 1, . I

and so if In -t fo, then by (ii) f.. -t fo in mean, and I I In - fo I -t O. This is Scheff6' s theorem.

A simple example of the application of this theorem (in many cases all that is required) is the following.

Fn ~ 0, integrable, and -t F in mean; Gn ~ 0, integrable, and -t G in mean => .,j (F Pn) -t .,j (FG) in mean.

I I I Proof. Fn -t F, Gn -t G; therefore .,j(FPn) -t .,j(FG). Also .,j(FPn) ::::;; Fn + Gn, which -t F + G in mean.

2 Mappings of measure spaces. Let /1 be a a-finite measure on a a-algebra !IF of sets in a space f!l. T is a mapping from f!l into a space :!T. Vo is the measure induced in :!T on the a-algebra d; i.e. d is the a-algebra of sets A in :!T such that T- 1 A E!IF, and vo(A) = /1(T- 1 A). We shall assume that the single point sets of :!T are d measurable, so that the mapping partitions f!l into !IF measurable sets T - 1 { t}, each of which is the inverse image of a single point t in :!T.

Let v be a a-finite measure on d which dominates vO' There always is such a measure v, because /1, being a-finite, is dominated by some finite measure /11' say, and the measure induced in :!T from /11 is finite and dominates Vo .lfvo is a-finite, we may take v = Vo ; but this is sometimes not so. For example, if /1 is Legesgue measure in R2 and Tis the mapping into Rl defmed by (x,y) -t x, the induced measure Vo in Rl takes only the values 0 and 00,

and so is not a-fmite. However, Vo is dominated by the Lebesgue measure on R \ and we take this for v.

Let f be a real-valued measurable function on f!l which is integrable. Put

Q(A) = I fd/1.

v(A) = 0=>/1(T- 1A) = O=>Q(A) = o. 100

APPENDIX: MATHEMATICAL PRELIMINARIES

Hence v» Q, and so by the Radon-Nikodym theorem there exists a function g on:!T, determined up to v equivalence, such that

I fd/1 = Q(A) = Igdv, T-1A A

for every AEd. We shall write g = T*f There is evidently some connection with conditional expectation, and, in fact, if /1 is a probability measure, and v the induced probability measure in :!T,

E{fIT} = g(T).

The mapping T* of integrable functions on f!l into integrable functions on :!T is linear.

(i) T*(cdl + cz/2) = C1 T*fl + c2T*f2' a.e.v, for constants c1 , c2 • It is also sign-preserving.

(ii) f ~ 0, a.e./1 => T*f ~ 0, a.e.v f ~ 0, a.e./1 and T*f = 0, a.e.v => f = 0, a.e./1. If h is a measurable function on:!T, and g = T*f,

I hgdv = I h(T)fd/1 A T-1A

and so (iii) T*[h(T)'f] = h'T*f a.e.v.

It follows (see below) from (i), (ii), (iii) that T* satisfies a Schwarz inequality

(iv) [T*(fd2)]2 ::::;; T*f12. T*f22 a.e.v with equality if and only iffl1f2 is a function of T a.e./1.

IT*fI2 = {T*(sgnf".,jlfl·.,jlfl)Y::::;; T*lfl'T*lfl,

Therefore (v) I T*fl ::::;; T*lfl a.e.v

with equality a.e.v if and only if sgnfis a function of T, i.e. T(x) constant => sgnf(x) constant.

IIIn - fld/1 = J T* lin - fldv ~ II T*In - T*fldv.

Hence (vi) f.. -t fin mean => T*fn -t T*fin mean.

Proof. of (iv). Denote by Ai' A2 real-valued measurable functions on:!T. A1(T), Az{T) are functions on f!l.

T*{ [Ai (T)ff + Az{T)f2]2} ~ 0 a.e.v,

A~T*f12 + 2Al A2 T*(fd2) + A~T*f22 ~ 0

101

a.e.v. [A. 1]

Page 56: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

First take A1 )'2 as real constants. The set of P'>ints (exceptional points) at which [A.1] does not hold may vary with A1 ,A2 ; but for all rational values of A1 , A2 , the union E of exceptional points will have measure 0, Thus for all points in EC, [A. 1] is true for all rational A1 , A2 • Because of continuity, it is true in EC for all real A1 ,A2 , and so

a.e.v.

If

[T*(fd2)]2 = T*ff· T*f22 a.e.v

T*f[A1(T)f1 + A2(T)f2J2} = [A1.J(T*f12) + A2.J(T*f22)]2 a.e.v

for all functions A.u A2 on ::r . Take A1 = - .j(T*fi),A2 = .J(T*ff), then

T*{A.1(T)f1 + Az{T)f2}2 = 0 a.e.v.

Therefore

a.e.p.

f21f1 = - A1(T)1 A2(T) is a function of T a.e.p..

Conversely, iff21f1 = h(T) a.e.p., ..... ~ T*(fd2) = T*[h(T)fn = hT*ff a.e.v

a.e.v and

[T*(fd2)]2 = T*ff'T*ff a.e.v.

3 L'H6pital's rule. In Chapter 3 we use the following extension of the usual forp1 of this rule.

Theorem Let f, g be real-valued functions which are continuous in the open interval (a,6), with derivatives f'(x), g'(x) at each point x of (a, b). Suppose further that g'(x) =F 0 in (a, b). If as x ~ a,f(x) and g(x) both --+ 0, or if g(x) --+ ± 00, then

lim "'ff'(x) <lim'· . ff(x) < li f(x) < l' f'(x) ill '() _ ill () _ msup-,,_ unsup--.

x~a g X a~a g X x~a g(x) x~a g'(x)

Similar results hold for left-hand, and for two:sided limits.

102

APPENDIX: MA THEMA TICAL PRELIMINARIES

Proof. Denote the four limits by 1', I,L,V respectively. We have to show that I' :::;; I, L :::;; V. By Cauchy's formula, if a < x < y < b,

Put

f(y) - f(x) f'@ g(y) _ g(x) = g'@' where x < ~ < y.

m(y) = inf{f'(x)lg'(x); a < x < y},

M(y) = sup {f'(x)lg'(x); a < x < y}.

We then have

m(y) :::;; f(y) - f(x) :::;; M(y). g(y) - g(x)

Iff(x), g(x) both --+ 0 as x ~ a, this gives

m(y) :::;; ;~i :::;; M(y).

Hence

I' = lim m(y) :::;; I, L:::;; lim M(y) = V. J(~a y~a

We may rewrite [A.2]

m(y) < f(x)lg(x) - f(y)/g(x) < M(y) - 1 - g(Y)lg(x) - .

If g(x)--+ ± 00 as x ~ a,

m(y) :::;; lim inff(x)lg(x) - f(y)/g(x) = lim inff(x) x~a 1 - g(y)lg(x) x~a g(x)

m(y) :::;; l.

Hence I' :::;; I. Similarly L :::;; V. In both cases, when limf'(x)lg'(x) exists

x~a

limf(x)lg(x) = limf'(x)/g'(x). x~a x~a

4 In Section 3 of Chapter 6 we require the following,

Theorem

[A.2]

If A, B are symmetric matrices of the same order k, and B, A - B

103

Page 57: Asymptotic Relative Efficiency

SOME BASIC THEORY FOR STATISTICAL INFERENCE

are both non-negative, then IAI ~ IBI, and if IBI > 0, IAI = IBI if and only A = B.

Proof. The result is obvious when I B I = O. Suppose I B I > O. B is then positive defmite. First consider the case B = lk' the identity matrix of order k. A -lk is then non-negative. Let A be an eigenvalue of A, and Va corresponding eigenvector.

AV = AV, (A - l k)V = AV - V = (A - I)V .

Thus A - 1 is an eigenvalue of A - lk' Hence A - 1 ~ 0, and so A ~ 1. Therefore IAI, the product of the eigenvalues of A is ~ 1. If I A I = 1, then every eigenvalue is 1, and so A = lk'

In the general case A - B is non-negative, and therefore B -1/2(A - B)B -1/2 = B -1/2 AB - 1/2 -lk is non-negative. Hence IB- 1/2AB- 1/21 ~ 1, i.e. IAIIBI- 1 ~ 1, with equality if and only ifB-l/2AB-l/2=1 ie A=B k' . . .

5 Abel's binomial formula. We require this in Section 4 of Chapter 9. Perhaps the simplest and most easily remembered form of this is

(z + ut = .to (;) u(u + ry-l(z - r)n-r [A.3]

for positive integral n. Denote the right side by!,,(z, u).

a !,,(z, u) f (n) u(u + ry-l(n _ r)(z _ r)n-r-l az r=O r

= n I n - u(u + ry-l(z - rt- 1 - r n-l ( 1) r=O r

= n!,,_l(z,U).

Hence, if !,,-l(Z, u) = (z + u)n-l, then !,,(z, u) = (z + ut + g(u). Putting z = - u, we have

g(u) = !,,( - u, u)

.to ( - It- r(; )U(U + rt- 1

= uLl;(u + yt- 1 at y = 0,

= O.

104

APPENDIX: MATHEMATICAL PRELIMINARIES

Thus

!,,-1 (z, u) = (z + ut- 1 => !,,(z, u) = (z + ut·

The statement [A.3] is true for n = 1, and therefore for all n. Interchanging r and n - r in [A.3], we obtain the form required in Section 4 of Chapter 9

(z + u)n = ± (n) u(z - n + r)'(u + n - rt-r- 1 . [A.4] r=O r

DDD

105

Page 58: Asymptotic Relative Efficiency

Ii R jl

Ii [' !i REFERENCES

BILLINGSLEY, P. (1968) Convergence of Probability Measures. New York: Wiley.

DURBIN, J. (1968) The probability that the sample distribution function lies between two parallel straight lines. Ann. Math. Statist., 39, 398.

FISHER, R.A. (1925) Theory of statistical estimation, Proc. Camb. Phil. Soc., 22,700.

HANNAN, J. (1960) Consistency of maximum likelihood estimation of discrete distributions. In Contributions to Probability and Statistics. Essays in Honor of Harold Hotelling, ed. Olkin, I. p. 249. Stanford: Stanford University Press.

KOLMOGOROV, A. (1933) Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attauari. Giorn., 4, 1.

LE CAM, L. (1970) On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Ann. Math. Statist., 41, 802.

LINDLEY, D.V. (1972) Review of The Theory of Statistical Inference by S. Zacks (Wiley), J. Roy. Statist. Soc., A 135,272.

PITMAN, E.J.G. (1965) Some remarks on statistical inference. In Bernoulli, Bayes, Laplace, ed. Neyman, J. & LeCam, L.M. p. 209. Berlin: Springer.

PRATT, J. (1960) On interchanging limits and integrals. Ann. Math. Statist., 31,74.

ROYDEN, H. (1968) Real Analysis, (2nd edn.) p. 71. New York: MacMillan. STECK, G.P. (1971) Rectangle probabilities for uniform order statistics and the

probability that the empirical distribution function lies between two distribu­tion functions. Ann. Math. Statist., 42, 1.

107

Page 59: Asymptotic Relative Efficiency

INDEX

Abel's binomial formula, 104 Aim of the theory of inference, 1 Asymptotic normality theorem, 59 Asymptotic power

of a consistent test, 56, 57 of test based on the least member

ofa sample, 61 Asymptotic relative efficiency, 58 Asymptotic sensitivity rating, 39, 40

Basic principles of the theory of inference, I

Binomial distribution, 49

Cauchy distribution, 48 Conditional distributions, 24, 25, 27 Convergence

in mean, 98, 99, 101 in measure, 98

Cramer-Rao inequality, 29 for many statistics and/or many

parameters, 52, 53 regularity conditions for, 31-34 without regularity conditions, 34

Discrimination rate, 4, 25 Differentiability

in mean, 15, 19 in mean square, 22

Distance between probability measures, 6-10

Distance inequality, 35, 36 Discrete probability space, 1,9,25,

27,74-78 Durbin, J., 97

Efficacy, 30, 35,48,54 Efficacy rating, 39,40,54

Fisher, R.A., 18

Gamma distribution, 13, 15, 34, 44 Glivenko-Cantelli theorem, 79

Hannan, J., 74

Information, 18 Intrinsic accuracy, 18

Kolmogorov, A., 79, 80

Laplace distribution, 48 LeCam, L., 23 Likelihood

principle, 2 ratio, 2, 3

L'Hopital's rule, 102 L~, 5, 25, 27, 28 Locally sufficient, 25 Location parameter, 36,43,68-73 Loose convergence, 98

Mapping of measure spaces, 100 Mathematical preliminaries, 98-105 Maximum likelihood estimation,

63-78 for discrete probability space,

74-78 Median of a sample, 45

Negative binomial distribution, 49 Neyman-Pearson theorem, 3 Normal distribution, 33, 37, 40, 48, 49

Pitman, J.W., 85 Poisson distribution, 33, 49 Pratt, J., 99 Probability of sample path crossing

pair of parallel straight lines, 93-97 straight line not through the origin,

87-92

109

Page 60: Asymptotic Relative Efficiency

INDEX

straight line through the origin, 84-87

References, 106 Regularity conditions for the Cramer-

Rao inequality, 31-34 Regular statistic, 29 Relative sensitivity rating, 24 Royden, H., 99

Sample distribution function, 79-97 Scale parameter, 43, 68-73

Scheffe's theorem, 100 Semi-smooth family, 15 Sensitivity, 19,48 Sensitivity matrix, 51 Sensitivity raJing, 24, 39, 40 Smooth family, 13, 50 Smoothness of conditional

distribution, 24-28 Statistic, 19 Steck, G.P., 84 Sufficient statistic, 2, 5 Symmetric matrix theorem, 103, 104

no

Page 61: Asymptotic Relative Efficiency

ASYMPTOTIC PROPERTIES OF THE WALD-WOLFOWITZ TEST OF RANDOMNESS

BY GOTTFRIED EMANUEL NOETHER

New York University

1. Summary. The paper investigates certain asymptotic properties of the test of randomness based on the statistic Rh = Ei=l XiX+hproposed by Wald and Wolfowitz. It is shown that the conditions given in the original paper for asymptotic normality of Rh when the null hypothesis of randomness is true can be weakened considerably. Conditions are given for the consistency of the test when under the alternative hypothesis consecutive observations are drawn independently from changing populations with continuous cumulative distribution functions. In particular a downward (upward) trend and a regular cyclical movement are considered. For the special case of a regular cyclical movement of known length the asymptotic relative efficiency of the test based on ranks with respect to the test based on original observations is found. A simple condition for the asymptotic normality of Rh for ranks under the alternative hypothesis is given. This asymptotic normality is used to compare the asymptotic power of the Rh-test with that of the Mann T-test in the case of a downward trend.

2. Introduction. The hypothesis of randomness, i.e., the assumption that the chance variables X1, - , X. have the joint cumulative distribution function (cdf) F(xi, - - * , x,) = F(xi) ... F(xn) where F(x) may be any cdf, is basic in many statistical problems. Several tests of randomness designed to detect changes in the underlying population have been suggested, however mostly on intuitive grounds. Very seldom has the actual performance of a test with respect to a given class of alternatives been investigated. It is the intention of this paper to carry out such an investigation for the particular test based on the statistic

n

Rh = XiXi+h , Xn+j = Xi

proposed by Wald and Wolfowitz [1]. It is suggested in [1] that this test is suitable if the alternative to randomness is the existence of a trend or a regular cyclical movement. Both these cases will be treated.

Let a,, ***, an be observations on the chance variables X1, ***, X. and assume that the hypothesis of randomness is true. (Henceforth this hypothesis will be denoted by Ho while the hypothesis that an alternative to randomness is true will be denoted by H1.) Restricting then Xi, - - * , Xn to the subpopulation of permutations of a1, * **, a", any one of the n! possible permutations is equally likely, and the distribution of Rh in this subpopulation can be found. If

231

Annals of Mathematical Statistics 1950;21(2):231-246.

Page 62: Asymptotic Relative Efficiency

232 GOTTFRIED E. NOETHER

the level of signifitance a is chosen in such a way that a = m/n! where m is a positive integer, the test is performed by selecting m of the n! possible values of Rh and rejecting Ho when the actually obtained value of Rh is one of these m values. The particular choice of the critical values should be such as to maximize the power of the test with respect to the class of alternatives under consideration.

Denote the expected value and variance of Rh in the subpopulation of equally likely permutations of n observations al, ***, an by E0Rh and V0Rh, respec- tively. Then it is shown in [1] that if h is prime to n

(2.1) E? Rh = 1 (A -A2) n-i

and

V?Rh= 1 (A 4-A4) n-1

(2.2) + ( - 1)( -2) (Al - 4A,A2 + 4A A3 + A2

- 2A4)

(n~~~~~ ~ ~ 1) (n 2) (2-A - 1~~A (n-I2

where A. = al + *- + at, (r = 1, 2, 3, 4). Actually (2.1) and (2.2) are valid as soon as n > 2h.

Let Rh = (Rh - E0Rh)//VVOR . Then it is also shown in [11 that if h is prime to n, Rh is asymptotically normally distributed with mean 0 and variance 1 provided the ai, (i = 1, * , n), satisfy condition W:

nr

_E (ai - a) n -1 = 0(1), (r = 3, 4, *- ),

[IE (ai-a)2]

-1 K~n wherea =n - ai.

It is easily seen that condition W is satisfied when the original observations are replaced by ranks. When the a1, - -*, an are independent observations on the same chance variable X, condition W is satisfied with probability 1 provided X has positive variance and finite moments of all orders. It is interesting to compare this condition for asymptotic normality of Rh in the population of permutations of observations on the chance variable X with the condition for asymptotic normality of Rh under random sampling. For this case Hoeffding and Robb's [31 have shown that it is sufficient to assume that X has a finite absolute moment of order 3. Thus it is desirable to weaken condition W. This will be done in Section 3.

In further sections the consistency and efficiency of the test based on Rh will

1 The symbol 0, as well as the symbols o and - to be used later, have their usual meaning. See, for example, Cramer [21, p. 122.

Page 63: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 233

be examined assuming that under the alternative hypothesis observations, though still independent, are drawn from changing populations. Throughout the paper the circularly defined statistic Rh is used. However, if with probability 1

X,-h4+jxl + * * * + XnXh = o(R4h),

it is seen that asymptotically the test based on the non-circular n-h

Rh = XiXi+h i=l

has the same properties as that based on Rh . We find

E Rh= (A2-A2), n(n - 1)

V0Rfth = n(- - I) (A2-A4)+ n(n-2h) (A) A2-A2-2A1A3 + 2A4)

+ (n - h - 1)(n - h - 2) + -2(h -1) (Al-6A2A2+ 8A1A3+ 3A2- 6A,) + n(n -1)(n -2)(n -3) 12

(n-h)2 (A 2-A2) 2 n2(n - 1)2

3. Asymptotic normality of Rh under randomization. Let the set of chance variables X,, - -*, X. be defined on the n! equally likely permutations of n numbers W. = (al, - * atn). Then we have

THEOREM 1: The distribution of R? tends to the normal distribution with mean 0 and variance l as n -X provided

n E (ai -)r

(3.1) =0n1 - r2 on 2 ], (r - 3, 4, *.

E(ai - ,)2

n where a = n' E a,.

i=1

R1mARK : The set 2n need not be a subset of 2nl The proof of this theorem will be omitted, since it is very similar to the proof

of another theorem by the author [4]. THEOREM 2: If the a, ,-a2, - - - are independent observations on a chance variable

X having positive variance and a finite aosolute moment of order 4 + 5, 5 > 0, condition (3.1) is satisfied unless possibly an event of probability 0 has occurred.

The proof of this theorem will be based on Markoff's method for proving the central limit theorem in the Liapounoff form.2 Thus we shall show that there exists a sequence of sequences 8n = (bnl X * , bnn) such that unless possibly an event of probability 0 has occured, (i) there exists an index n' (depending

2 See, for example, Uspensky [51, pp. 388-95.

Page 64: Asymptotic Relative Efficiency

234 GOTTFRIED E. NOETHER

on the given sequence) such that for n > n', 21n = ,n e and (ii) the sequences en

satisfy condition (3.1) expressed in terms of the bni , (i = 1, - *, n). It is no restriction to assume that EX = 0, since the addition of one and the

same constant to every as does not change (3.1). Let

N = N(n) = nlI(4+8/2)

and define for i = 1,** n

bni = ai Cni = O, if ai < N(n),

- , = ai, if ai > N(n),

so that ai = bni + Cni. Then bni and cni can be considered as observations on chance variables Yn and Zn, respectively, where

Yn = X, Zn 0=, if X < N (n),

=0, =X, if X>N(n).

Further let pn = PIZn, = X} , a, (U) = EU', 3, (U) = E i U j' where U =X,

Yn , Zn and r is positive integral, if these moments exist, 14+? = E I X 14+6 and finally, let F(x) be the cdf of X.

In order to prove (i) consider the infinitely dimensional sample space Q with the generic pointw = w(al , a2, ** ) and let En = I an > N(n)}, (n = 1, 2,..). Then En has probability measure Pn . We shall show that n=1 pPn converges. Since

X 0 -N +X)

#4= f i x i4+6 dF(x) > N4+ [L dF(x) + / dF(x) ? N4+ p,n,

we find 1 1

Pn ? /4+6 4+6 0 134+ n(4+6)1(4+612)

Now (4 + 8)/(4 + 8/2) > 1 and the infinite sum converges. It follows that the set E of points which belong to infinitely many sets En has probability measure 0. Thus for every point w e Q except those in a set of measure 0 there exists an index n. (depending on w) such that for n > n,,

(3.2) a. < N(n).

Further, since nZ is finite and N(n) -* oo, it follows that for these points there exists a second index n' > n. such that in addition to (3.2) an < N(nJ), (n = 1, .. , nm). Thus except on a set of measure 0 tae sequences 0. are identical with the sequences 21n for n > n,' . This proves (i).

In proving (ii) let Bn, = jj brn, (n, r = 1, 2, ...). We first note that under the assumptions of the theorem n'A, -> a,(X) for r = 1, 2, 3, 4 except on a set of measure 0. Thus except on a set of measure 0

a = n-1A = o(1), A2 = Q(n), A3 = 0(n), A4 = O(n),

3 A function f(n) is said to be of order Q(nk), k real, if f(n) = O(nk) and lim inf If(n)/nk I > 0.

Page 65: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 235

and therefore by the argument used in proving (i) again except on a set of measure 0

bn = n-'B., = o(1), B,a2 = Q(n), Bn3 = 0(n), Bn4 = O(n).

It follows that in order to prove (ii) it is sufficient to show that

(3.3) Bnr = o[n(r+2)14j, (r = 5, 6, ***),

except on a set of measure 0. Now for r > 5

ar(Yn) < 3r(Yn) < Ar4 4(Yn) < NAr ,4(X),

and therefore

ar(Yn) = O(NI4) = o[n(r4)1(4+5/2)I.

It follows that EBnT = na7(Yn) = OLn(r+62)/(4+5/2)i

and var Bnr = n var Yn= n[a22(Yn) - ar(Yn)] = 0

so that o7(Bnr)

= o[n(r+?54)I(4+8/2)1.

Assume now that for some r > 5 (3.3) is not satisfied on a set Fr having measure er > e > 0. We shall show that this assumption leads to a contradiction, and that therefore (3.3) is true.

Choose e such that

(3.4) 1/2 < e < (16 + r5)/(32 + 46).

Since r 2 5, (3.4) can always be satisfied. Then the infinite sum E - (l/n28) converges, and a positive constant d can be found in such a way that

P 2 nE n2e

If we then write the Tchebysheff inequality

P1l Bnr -EBnr > dea (Bnr)} < 1/d2n2,

it is seen that except on a set having at most measure p

B = Otmax[n(r+512)I(4+512) nen(r+514)I(4+512)I }

Now for r 2 5

(r + 8/2)/(4 + 6/2) < r/4 and by (3.4)

e + (r + 6/4)/(4 + 6/2) = e + r/4 + (a/4 - r6/8)/(4 + 6/2)

< r/4 + (16 + 26)/(32 + 46) (r + 2)/4,

Page 66: Asymptotic Relative Efficiency

236 GOTTFRIED E. NOETHER

so that the measure of the set F, is not even equal to e. This contradicts our assumption, thus proving Theorem 2.

4. Consistency. To prove consistency of tests based on permutations of observations a,, , - a,, the following procedure can be applied. Let the test statistic be S. = S(xi, , * - xn) and denote by E?n = E0(a , * * *, a,,) and V? =

VO (a, * * , a.) the expected value and variance of Sn under the assumption that the set of random variables X1, , X. is restricted to the subpopulation consisting of the n! equally likely permutations of the observations. Assume that for the alternatives under consideration large values of Sn are critical. Then we reject the null hypothesis whenever (S. - E4n)/N/VV? > k where k is some positive constant depending on the limiting distribution of S,, under the assumption of equally likely permutations and the level of significance. Thus in order to prove consistency we have to show that

(4.1) urn P }>kIHi 1.

(4.1) will be satisfied if for some e > 0

(S~ - EO, lim VnV = 1.

Thus we shall have proved consistency, if we can show that when H1 is truie, E? /VnV9, converges in probability to 0 and there exists some e > 0 such that limn .o P{S,/VNVT. > E I Hl= 1.

Applying this method to our problem and noting that a corresponding pro- cedure could have been used in the case when small values of Sn are critical, we obtain

THEOREM 3: The test based on R,h is consistent with respect to alternatives for which

(4.2) E Rh x/nVO Rh

and there exists some E > 0 such that

nc \ {j AV0 RA >

where E0Rh and V0Rh are given by (2.1) and (2.2), respectively. In what follows it will always be assumed that uinder the alternative hypothesis

observations are independent from chance variables Xn with continuous cdf's Fn(x), (n = 1, 2, -..). We shall often have the opportunity to make use of the fact that the test is not changed if one and the same constant is subtracted from every observation. This wvill be helpful in reducing our problem to one for which (4.2) is true.

Let a, be the rank of the observation xi on the chance variable Xi, (i =

Page 67: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 237

1, * **, n). Then it is no restriction to assume that these ranks take the special form

- (n-1)/2 -(n - 3)/2, *, (n - 1)/2,

so that A1 = 0, A2 = (nn2 - 1)n = Q(n3) and

(4.4) VR A- n5) n

and therefore (4.2) is always satisfied. Before we can find conditions under which (4.3) is satisfied, we have to in-

vestigate the expected value and variance of Rh when H1 is true. For this purpose writeai = ;_1 yq,j (i = 1, ,n),

(415= -1/2 if x , > xi,

= 1/2 if xi<xj,

Then if P{X, < Xjl = p,j, (i,j = 1, ... , n), we find

Eyij pij - (1- pi) = pi - = eii (say).

Further, nn n

(4.6) Rh XE i E Yij Yi+h,k, Yn4+,k = YlIA i-1 i-1 k-I

Therefore

(4.7) E(Rh IHD H =) ji I , esje+h,l + 0(n) i j k

and

var Rh = E VI i ijYi+h,kYajYa+h,p - E I yiJyi+h,jCE EI YxpYa+h, (4.8) jik a''y ijk a 't

(8= a Z (Eyij yi+h,k yajl ya+h, - Eyij yi+h,k Eyap ya+h,y). ijk aft'r

In (4.8) the expression in parentheses is 0 unless one of the Greek indices (in- cluding a + h) equals one of the Roman indices. Therefore var (Rh H O) = 0(n5).

It then follows from (4.4) that

12 1 Rh/xVVRh n3Rh,

- 12 lim -nE(RhIHI), Pr n-.o

and we can state the following corollary to Theorem 3: COROLLARY: When using ranks, the test based on Rh is consistent, if under the

alternative hypothesis

1n n n (4.9) E Ci fijE+h,k = W(1)) (4.9) n3 i=1 j=1 k=1

where ci = P{Xi < Xi) 2-

Page 68: Asymptotic Relative Efficiency

238 GOTTFRIED E. NOETHER

Since i, = - Ej, we can write

E EI EI EjEi+h,.k "= E2 Z E eij(Ei+hk - Ej+h,k) = L, (say), i j k k i j>i

and the test is consistent if

(4.10) lim 1 L # 0.

4.1. Downward (upward) trend. Assume that for i < j and all k

(4.11) ei, < 0

and

(4.12) Esk < Ejk -

These requirements are equivalent to P{Xj < Xi} < 1/2 and P{Xi < Xk} <

P Xj < Xk} and are satisfied if the alternative to randomness is a downward trend in the sense that Fj(x) < Fj(x), (- oo < x < oo, i < j), with at least one interval of strict inequality.

(4.11) and (4.12) are not sufficient for (4.10) to be true. Thus assume in addition that there exist a positive integer n' and a number e < 0 such that l.u.b. ji> ne, ei e. Then

1 1 lim - L > lim - E X Eqj(Ei+h,k - Ej+h,k) n - on3 n boo n3 k=1 i<!gk-h- n

j>z::k-h+n'

n 2 > 2E2 Jim - 3E (k - h -n')(n - k + h - n' + 1) = 2e( - 3 > 0,

n boo n k=l

and the test is consistent. The case of an upward trend can be treated in exactly the same way. The

test is consistent with respect to alternatives for which for i < j and all k, Eii > 0, eik > eik , and g.l.b. j -i e,n n ?-e, where this time e > 0.

Another test of randomness, the so-called T-test, has been proposed by Mann [6] with exactly this alternative of a downward (upward) trend in mind. This T-test is also consistent provided certain general conditions are satisfied. Thus the question arises which of the two tests should be chosen if a downward (upward) trend is feared. This question will be considered in Section 7.

4.2. Cyclical movement. Let the class of alternatives be specified by

(4.13) E1g?a,mg?+f = Ea#, (a, y3 1, .** 9 > 1; 1, m = 0, 1, ...),

in other words, assume that the statistic Rh is used to test for randomness while under the alternative hypothesis there exists a regular cyclical movement with a period of length g. It is sufficient to consider the case h < 9.

If (4.13) is true, n n

(4.14) E fi ij Ei+h,k = n2 e i. ei+h,. + 0(n2) = n3t7 + 0(n2), i jk=l1il

Page 69: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 239

where

(4.15) = - Z Eia g a=1

and

(4.16) 1 0Z E. E.+h,. g a=1

Thus in view of (4.9) the test is consistent if 7 $ 0. If h = g, q reduces to a sum of squares and is therefore > 0 if some (ca. # 0.

However it is possible that some or even all 'E a? 0, (a # f), and still 'Ea.- 0 If this happens, the test is iiiconsistent, otherwise it is consistent. If under H1 the populations from which consecutive observations are drawn differ only in location, the above mentioned exceptional case cannot happen, and the test is always consistent with respect to this class of alternatives.

If h < g, it is not difficult to construct an example where a=1 fa.Ea+h- $ 0

while Za=1 Era.Er.+h. = 0, where the ra are a permutation of the numbers 1,*** , g. Thus in this case it is not sufficient that some Ea. 0 0 for the test to be consistent. Consistency may also depend on the order of the elements of a period.

We may conclude that if g is known, we should always choose h g g. If g is not known, we may as well take h = 1.

4.3. Change in location. Turning now to the case when the test is performed on the basis of the original observations, it will often be appropriate to assume that under the alternative hypothesis the distribution remains the same except for a location parameter. We shall consider only the case of a cyclical movement.

Thus let

Fn(x) = F (x - m,) (n = ly 2, ...*)

where F(x) is the cdf of a chance variable U having mean 0, and m, is a location parameter. It will also be assumed that U has the positive variance a' and a finite fourth moment.

In the cyclical case with period g

(4.17) mzg+?=ma (c=1,*..,g>1;l=O,1,* )

We shall find conditions under which our test is consistent with respect to alternatives of this kind. Obviously we can assume that _1 m. = g - = 0, since otherwise we could have subtracted m from every observation. Writing then an Un + mn , (n = 1, 2, * * * ), where un can be considered as an observation on the previously defined chance variable U. we find

n n A1 = E ai = ui + 0(1),

i=1 i=1 n n n

A2 = Eu2 + 2 E uimi + E m i=l i=l i=l

n 9 na n _

=-, u, + 2 mL a Ulg+a + - ma + 0(l) = a=1 z=0 a=l

Page 70: Asymptotic Relative Efficiency

240 GOTTFRIED E. NOETHER

where na is the largest integer such that n,,,g + a < n and [n/g] the largest integer < n/g. A3 and A4 are given by similar expressions. Since we assumed that EU =0, EU2 = 0-2 > 0, and EU4 < oo , we have with probability 1

rL n ~~~~~~N. n

Zui o(n), Z = =2(n), 0(n), Zu = 0(n), i=1 i=1 i=1 i=1

so that with the same probability

A1 = o(n), A2 = S(n), As= 0 (n), A = 0(n)-

It follows that with probability 1

2 E0Rh = o(n), V0Rh ' A2 =

and condition (4.2) of Theorem 3 is satisfied. Since further

n nT

var Rh = Z var (,xixh) + 2 cov (xi Xi h, Xi+h Xi+2h) "4_1 vai=lX+h

n - E { (f2 + m2)(r2 + m2+h) - 2mm 2h}

(4.18) n

+ M2 mi+2h (U2 + m2+h) ms mi+h m^+2h -

n

- Z ?{u,4 + u2(m + WI + 2m.mi+2h) 0 (n)

and therefore except on a set of probability measure 0

Rh R, -~ Rh lim - E(Rh H 1i) Rh____ R _ n 00 n

nVRh 'A 1) pr 2 1 - A2 u+- mma n 9~~~ a=1

condition (4.3) is satisfied provided lim n-x n-'E(Rh I H1) 9 0. Now E(Rh HI) - [n/g] a1 mama+h + 0(1), so that the test is consistent with respect to the class of alternatives (4.17) for which

Z (ma - mi)(ma+h - ) i 0,

where im = 9g' Z=1 ma. Thus by the same argument as in the case of ranks, the test is consistent whenever h = g, while it may or may not be consistent if h < g.

5. Limiting distribution of Rh under H1 in case of ranks. For the remaining two sections, it is of importance to know conditions under which Rh based on ranks is asymptotically normal under the alternative hypothesis. Using the methods of moments, it can be shown that in this case the distribution of

Page 71: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 241

(Rh- ERh)/a (Rh) tends to the normal distribution with mean 0 and variance 1 provided var Rh =-(n5).

Generalizing the method used in Section 4 in evaluating the variance of Rh , it is

not difficult to see that E(Rh - EPh)2s+l O(e +2), (S = 0, 1, *.. ). It follows that if var Rh =Q(n5), the odd moments are asymptotically zero. By means of a more careful analysis, it is also possible to show that E(Rh- ERh )2, -- (2s - 1) (2s - 3) - 3(var Rh)'. This proves our statement.

6. Ranks versus original observations. We have seen in Section 4 that if the alternative hypothesis is characterized by a regular cyclical movement the test based on Rh iS consistent both for original observations and for ranks, provided h = g, where g is the length of a cycle. The question arises which test is more efficient, the one based on original observations or the one based on ranks.

In trying to answer this question, we shall make use of a procedure due to Pitman4, which allows us to compare two consistent tests of the hypothesis that some population parameter 9 has the value 00 against the alternatives 0 > 00 using critical regions of size a, Si,, ? Si,(a), (i = 1, 2), where Si,, is a statistic having finite variance and Sin((a) is an appropriate constant. The relative efficiency of the second test with respect to the first test is defined as the ratio n1/n2 where n2 is the sample size of the second test required to achieve the same power for a given alternative as is achieved by the first test using a sample of size ni with respect to the same alternative.

Let E(Sin I 0) = 'in(()7 var(Sin I 0) = 0-*(9)l and Vf4n(00)/-in(O0) = Hi(n). Assuming that the alternative is of the form t- = 0 + k/Vn, where k is a positive constant, Pitman has shown that the asymptotic relative efficiency of the second test with respect to the first test is given by lim [H 2(n)/H2(n)], pro- vided there exists a number e>O such that for 0? < 9 < 00 e

(6.1) )t'/,n(0) exists;

as On 00with n -* oo

(6.2) i (? 1

and

(6.3) vin(On) 1

o~(00)

(6.4) lim Hi(n) - ci, where ci is some positive constant;

(6.5) the distribution of [Sin - n(O)]/0Jin(f) tends to the normal distribution with mean 0 and variance 1 uniformly in 9.

4 I should like to thank Professor Pitman for his kind permission to quote from his lectures on non-parametric statistical inference which he delivered at Columbia University during the spring semester 1948.

Page 72: Asymptotic Relative Efficiency

242 GOTTFRIBD E. NOETHER

Condition(6.5) can be replaced by the weaker condition

(6.5') the distribution of [Si, - tends to the normal distribu- tion with mean 0 and variance 1 as n -- oo.

In our case, in order to insure consistency, it will be assumed that h = g. Consider the parameter

1 h2

(6.6) a=- (ma -m),( h a=1

where as before ma is the expected value of the (lh + a)th observation, (1 - 0, 1, *..). We want to find the asymptotic relative efficiency of the test per- formed on ranks with respect to the test performed on original observations as 0-*Owithn oo.

Again it is no restriction to assume that

1h (6.7) m- = hE ma = 0. h a=1

Assume further that the chance variable U defined in 4.3 has a finite absolute moment of order 4 + 8, a > 0. Then R? -\/' nRh/A2 with probability 1 and, if the null hypothesis is true, it follows from Theorem 2 that with the same probability the statistic

n

E _ XiXi+h

i=l

has in the population of permutations of the observed sample values an asymptot- ically normal distribution with mean 0 and variance 1. This, however, is also the limiting distribution of Qh under random sampling when the null hypothesis is true, as follows from the results of Hoeffding and Robbins [3]. Thus it will be sufficient to find the asymptotic relative efficiency of the Rh-test for ranks with respect to the QA-test. In doing this, it will also be assumed that U has a con- tinuous density function f(x) = F'(x), and, in order to simplify notation, that there are nh observations instead of n.

In finding HQ(nh), let xa, j = Xaj = X(jl)h+a and ua, j = Uaj = U(j-l)h+a

(a = 1, ,h;j 1, , n). Then

1 h n h n

= h aZ aj = Z(Uai + ma)2 nhnh a=1 p-i1 nha- a =

h n n

E g

Ua; + 2m,,E Uaj + nm a 2 + _

nh a=l = =l )_ prl

Further, h n n

M2 Rh = E Uaj Ua,j+1 + 2ma Uaj + a

a=l tj=1 j=l J

Page 73: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 243

so that

N/-hRh /h EQh =E 1 + oa = Cn (0)

nhA2 Therefore

2

Also by (4.18) h

4 2 2 nha4 + 4no r, m 4 +4_2 0 varQh -a-1

_ _4_

var Qh nh(o-2 + 0)2 (a2 + 0)2

which converges to 1 as 0 -* 0. It follows that

(6.8) HQ(nh) - Qn (0) 2h

Conditions (6.1)-(6.5) are easily seen to be satisfied. Considering now the Rh-test for ranks, we know that (nh)F12Rh has finite

variance. From (4.7) and (4.14)-(4.16) it is found that

(6.9) E[(nh) 52R, I 1] - /'ihq = -V/h \- = /=Rn(0)

and after some computations

(6.10) 4Rn(O) V [h f2 (X) dX

From (4.4) and (6.10)

HR(nh) = 12V\I- [ ff2(x) dx].

Conditions (6.1)-(6.4) and (6.5') can be shown to be satisfied. Thus the asymptotic relative efficiency of the test based on ranks with respect

to the test based on original observations is

144nh L f2() dxl X 4 (6.11) HRQ nh/4 - = 144 Lf2() As is not difficult to see, this expression is independent of location and scale.

Let the chance variable U have density function

O, x < -1, x >1,

1(x ,+a -1 < x < a,"

- x < - < 1 < 1-a' a < x<

Page 74: Asymptotic Relative Efficiency

244 GOTTFRIED E. NOBTHER

i.e., let the graph of f(x) be given by the two straight lines connecting the points (-1, 0) and (1, 0) with the point (a, 1). Then EU = a/3, var U = - (3 + a2),

lf2(x) dx = 2/3, and (6.11) becomes [8(3 + a2)/27]2. Thus HRQ increases

with j a 1. For a = 0, it is equal to 64/81; for j a = 1, it is equal to (32/27)2. It is equal to 1, for a = - /8.

This' example shows that the asymptotic relative efficiency of the rank test with respect to the test based on original observations may be < 1, = 1, or > 1, depending on the density function f(x). Unless f(x) is explicitly given, no state- ment can be made as to which of the two tests is to be preferred.

We are now in a position to give at least a partial answer to a question raised in [1]. In concluding their paper, Wald and Wolfowitz note that the problem dealt with in this section can be posed not only when transforming to ranks, but also for any transformation carried out by means of a continuous and strictly mono- tonic function h(x).

Let t = h(x) be such a transformation, satisfying in addition the condition that Pitman's procedure remains applicable for the transformed distribution. Corre- sponding to 2 and Q we shall use 9 and Qs . Let h(ma) = pa , h1' a Aa (# )2 = t. Then if EQ8 ' OQn(O) by (6.8), (6.9), and (6.10)

d4__Qn(_ - d#Qtn dt d,|

dO d4 d/ dO eo

(6.12) = x7h { _fd [ f2(X) dX] = HQ(nh), aJt fL 2[g(t)Ig,2 (t) dt co

where g(t) is the inverse of h(x). Therefore by (6.8) and (6.12)

=? {LC f2(X) dX} HQQ t QX {crt L f2 [g(t)Ig'2(t) dt}

and the asymptotic relative efficiency does not merely depend on h(x), the operator defining the transformation, but also very essentially on the underlying distribution f(x).

7. Comparison of the Rh- and T-tests. The T-test by Mann [6] designed to test for randomness against a downward trend is based on the statistic

n

T = Z X (yij + 2) = y E yij + ?n(n- 1), i=l j>i i ,>i

where yi, is defined by (4.5). Making the same assumptions as in 4.1, Mann shows that under the null hypothesis T has a limiting normal distribution with

Page 75: Asymptotic Relative Efficiency

TEST OF RANDOMNESS 245

mean in(n - 1) and variance (2n3 + 3n2 _ 5n), while under the alternative hypothesis

(7.1) ET =n(n - 1) (2n + 1),

where rn is defined by in(n - 1)r" = Ei Ej>i -ij < 0. Let

S= 6 [T - ln(n- 1)1.

When Ho is true, S. is asymptotically normal with mean 0 and variance 1. If

we then put 4(X) = f e'4 dx, a critical region for testing Ho is given by

Sn ? - , where X is determined in such a way that +(X) = a, the level of significance.

When H1 is true, we find from (7.1)

E(Sn It n) --, 3 \n- rn

By paralleling the proof of asymptotic normality of Rh under H1 given in Section 5, it can be shown that (Sn - ESn)/10(Sn) is asymptotically normal with mean 0 and variance 1 provided 0(Sn) = Q(1). This is essentially the result obtained already by Hoeffding [7]. Thus the asymptotic power of the test based on S, is given by

(7.2) Pis ? -} + (x +V 3 )

converging to 1, provided limn, Xn tV4 =-? . This is the condition for consist- ency given by Mann.

We may ask for the asymptotic power of the S,,-test as 0 -*0 with n -* m. More exactly, instead of considering a certain alternative ei, = kic, where the kic are given constants, consider the alternative (changing with n)

kii (7.3) E,3 = n

If then as n -X o

2 , ,kfii - k

n(n- 1) j>

and

o(S.) - 1,

it follows from (7.2) that the asymptotic power of the Sn-test, and therefore of the T-test, for alternatives (7.3) is equal to

q(X + 3k).

Page 76: Asymptotic Relative Efficiency

246 GOWFRIED E. NOETHER

Now consider the same situation when the statistic Rh is used instead of T. We know that when Ho is true

R 12Rh R n -n/2 ;h

where Rh is given by (4.6), is asymptotically normal with mean 0 and variance 1. 1

Thus inthis casethe critical region is given byR' > X. If we setin = n3EijkEiiei+h,k,

we find

E(Rn i n) 12Vn\ X

and asymptotically the power of the Rn-test is

(7.4) PftR/ > XI} 12\/

<t( R) n

provided a(Rn) = Q(1). Thus the test is consistent if limn-- V, =\/n n . oHow- ever, for the alternative (7.3), (7.4) tends to +(X) = a, provided that as n - o

a(R'n) 1.-

Thus the Rh-test is ineffective with respect to the alternative (7.3) in contrast to the T-test. This means that for this alternative the asymptotic relative efficiency of the Rh-test with respect to the T-test is 0.

Acknowledgment. The author wishes to acknowledge the valuable help of Professor J. Wolfowitz who suggested the topic and under whose direction the worlc was completed.

REFERENCES

[1] A. WALD AND J. WOLFOWITZ, "An exact test for randomness in the non-parametric case, based oni serial correlation," Annals of MIath. Stat., Vol. 14 (1943), pp. 378-388.

[2] H. CRAM:R, Mathematical Methods of Statistics, Princeton Univ. Press, Princeton, 1946. [31 W. HOEFFDING AND H. ROBBINS, "The central limit theorem for dependent random

variables," Duke Math. J., Vol. 15 (1948), pp. 773-780. [41 G. E. NOETHER, "On a theorem by Wald and Wolfowitz," .An,nals of Math. Stat., Vol. 20

(1949), pp. 455458. [51 J. V. USPENSKY, Introduction to Mathematical Probability, McGraw-Hill, New York,

1937. [61 H. B. MANN, "Nonparametric tests against trend," Econometrica, Vol. 13 (1945), pp.

245-259. [7] W. HOEFFDING, "A class of statistics with asymptotically normal distributions,"

Annals of Math. Stat., Vol. 19 (1948), pp. 293-325.

Page 77: Asymptotic Relative Efficiency

ON A THEOREM OF PITMAN

BY GOTTFRIED E. NOETHER

Boston University

Summary. A theorem by Pitman on the asymptotic relative efficiency of two tests is extended and some of its properties are discussed.

1. Introduction. The idea of the relative efficiency of one estimate with re- spect to another estimate of the same parameter is well established. This can- not be said, however, of the corresponding concept for two tests of the same statistical hypothesis. This paper is concerned with a definition of the relative efficiency of two tests which seems to be due to Pitman (see, e.g., [1] p. 241) and has been used in several recent papers.

DEFINITION. Given two tests of the same size of the same statistical hypothesis, the relative efficiency of the second test with respect to the first is given by the ratio ni/n2, where n2 is the sample size of the second test required to achieve the same power for a given alternative as is achieved by the first test with respect to the same alternative when using a sample of size n1 .

In general the ratio ni/n2 will depend on the particular alternative chosen (as well as on n1). However, in the asymptotic case, this somewhat undesirable fact can be avoided. It might be argued that restriction to the asymptotic case is even more undesirable in itself, but the unfortunate fact remains that for many test procedures in current use the asymptotic power function is the only one available.

Now, at least for consistent tests, the power with respect to a fixed alternative is practically 1 if the number of observations is sufficiently large. Therefore, the power no longer provides a worthwhile criterion for preferring one test over another. On the other hand, it is possible to define sequences of alternatives changing with n in such a way that as n -+ oo the power of the corresponding sequence of tests converges to some number less than 1. It seems then reasonable to define the asymptotic relative efficiency of the second test procedure with respect to the first test procedure as the limit of the corresponding ratios n1/n2 X

A theorem due to Pitman allows us to compute this limit if certain general conditions are satisfied [1]. The purpose of this paper is to give an extension of Pitman's theorem and to discuss some of its properties. The derivation of the present version of the theorem follows Pitman's original method of proof. Since this proof has not appeared in print, full details are given.

2. Asymptotic power. Assume that we want to test the null hypothesis Ho: 0 = 0o against alternatives H: 0 > 00 . As mentioned in the Introduction, we shall assume actually that a particular alternative 0 = an changes with the sample size n in such a way that limn,o On = 0o.

Received April 7, 1954. 64

Annals of Mathematical Statistics 1955;26(1):64-68.

Page 78: Asymptotic Relative Efficiency

THEOREM OF PITMAN 65

To be more definite, let the test be based on the static T. = T(x, * ..., xn). Let' ETn = #n(0) and var Tn = an(0). Assume that

A. 1W6' (0 ) = *I * * 6(,/n-1) (0) 0 O, t(n4(0i) > , B. limn,O n-maip(nm)(00) / on(0o) = c > 0 for some 6 > 0.

The indicated derivatives are assumed to exist. We shall consider the power of the test based on Tn with respect to the alternative H1: tn- = o + k/n8, where k is an arbitrary positive constant. In addition to A and B we shall assume

C. limn4 (nm) (fn) / 4lnz) (0o) = 1, iMnoo o-n(0n) / crn(0O) = 1,

D. the distribution of [Tn - #n(0)] / crn(0) tends to the normal distribution with mean 0 and variance 1, uniformly in 0, with 0o < 0 < 0o + d for some d > 0.

Let +(X) = f exp (-2x2) dx / V/2ir and find Xa such that 4 (Xa) = a.

For sufficiently large n, a critical region of approximate size a is given by

Tn _ Tn(a), [Tn(a) - frn(0o)] / an(OO) = Aa .

The power of this test with respect to the alternative H1 is given by

Ln(fn) = P{Tn ?> Tn (a) 0 On I "' On),

where tn = [on(0o)Xa + tn((0o) - #n(0n)] / 0rn(0n). Now

n(On) = 1n(0f + k/n') = 1n(0o) + (1/m!)(k/n')m4(n`m)(f),) Go < 0 < Ofn

and

ffn(00)Xc - (1/m!)(k/Th)Yn[ (

m)(6)/I4 m)

)(0)]t,4m)(m ) > -

tn = [~~1n(On)/an(00)]n(fO0) na-o m!

Thus asymptotically, Ln(On) -(X -kmc/m!). It follows from the proof that condition (D) can be replaced by the some-

what weaker condition D'. the distribution of [Tn - #n(0n)] / u(0.) tends to the normal distribution with mean 0 and variance 1, both under the alternative hypothesis H1 and under the null hypothesis O,n = 00.

It is also clear that alternatives of the type O,n < to or 101l4 O t, or the case when )j4(m) (0o) < 0, can be handled correspondingly.

Let 'y > a and consider the alternative H2: On =9o + k/n' an alternative which converges to Oo faster than H1 . Now

- limlikm ! (m)(() - k ~ im ! 0in(00)

= n-o min'y an(OO) n--,o ml n(-r)mI flnOm (00)-

I J. Putter [4] has pointed out that the functions /,6n(O) and a2 (0) need not necessarily be the mean and variance of T., respectively, as long as conditions A, B, C, and D or D' are satisfied.

Page 79: Asymptotic Relative Efficiency

66 GOTTFRIED E. NOETHER

and the power of our test with respect to this alternative H2 is equal to the size of the critical region. The test cannot distinguish between Ho and H2 . Similarly, if -Y < 6, the power of our test converges to 1.

3. Asymptotic relative efficiency of two tests. Assume now that we have two tests based on the statistics Ti. and T2n . Assume further that 31 > 62 and con- sider the alternative H1, On = 0o + k/n8' . It follows from our previous results that the second test is ineffective with respect to this alternative, while the power of the first test can be made as large as we please by choosing k sufficiently large. Therefore, the asymptotic relative efficiency of the second test with respect to the first test is zero.

Thus, from now on, we assume that 61 = 62 = 6. According to our definition of relative efficiency, the two tests must have

identical power with respect to identical alternatives. The two tests have identi- cal power if

(l ) ~~~~~kMl Cl/ml ! = k2 2 C2/M2 ! (1) kfc1/m 2

The alternatives are identical if

(2) k1/n' = k2/n2.

If now m1 = m2 = m, as it must be if mi = for i = 1 and 2, which is true in most cases, we can proceed as follows. From (1) and (2) we have

ni ki C2 lim (1/n) [ViI24)(0o) /0f2,(00)] 1m = lim R2inm (60) E

n2 TO 2 \c1J n_ce (1/n) [V/4m) (0o) / ali,(0o)]1l/m n_5 RmRl /ml(Oo)

where

(3) Rin(f) = 6imn) (0) / in(0) i = 1, 2.

Pitman has called the quantity RlVm'(00) the efficacy of the ith test in testing the hypothesis Ho: 0 = 0o . Thus we get

PITMAN'S THEOREM. The asymptotic relative efficiency of two tests satisfying A, B, C, and D or D', with 6, = 62 and m = m2, is given by the limit of the ratio of the efficacies of the two tests.

For m = 1 and a = 2, this theorem reduces to the one quoted in [1]. If m8 -

(4) E21 = limnoo R2n(f0) / R2ln(00)

If, in addition,

(5) limn --w {24J (O0) / l/ln) (0o) = 1

then (4) reduces to E21 = limn, al'n/aj2n This is the usual expression for measuring the asymptotic relative efficiency of

two estimates of the same parameter. Thus, only if (5) is satisfied can we use the ratio of the variances of the two test statistics as a measure of the asymptotic relative efficiency of the two tests. In particular, if Ti. and T2n are unbiased

Page 80: Asymptotic Relative Efficiency

THEOREM OF PITMAN 67

estimates of the parameter 0, (5) is satisfied with m = 1, and E21 is the same as the asymptotic relative efficiency of T2 and T1 used as estimators of 9.

4. Comparison with another definition of relative efficiency. Another defini- tion of the relative efficiency of two tests in current use (see, e.g., [2] p. 597) is based on the ratio of the respective sample sizes under the assumption that the power functions of the two tests have equal slope at 0 = 0o .

We shall show that if m = 1 and a = 2, the two definitions give the same value for the asymptotic relative efficiency, provided a very general condition is satisfied.

Under the conditions of Pitman's theorem we have, as before, Ln(0) n(t) where

tn = [Xgoen(00) + OJn(0O) - Otn(0)] / -n(0).

If Ln(0) converges uniformly to some limit, this limit must be d4(tn)/d0. Actually, the exact form of L' (0) will rarely be known, so that the uniform convergence cannot be investigated. However, even in this case it is customary to replace Ln (0) by d40(tn)/d0 in computing the ratio of the slopes of the two power func- tions. Thus it seems reasonable to compare the asymptotic relative efficiency based on the slopes dqS(tn)/d0 with E21 . Now

dt(tn) | = exp _-,t2 n dO 0=0o exp 0a0 9==0

1 _ _ _ _ I_ /

= 4 exp ( 2

[) -+ a ()j ] exp X2 )cn 12

provided a' (0o)/rn (0o) = o(Vn), which is very generally true. The requirement that the two power functions have equal slope at the point 0 = Oo becomes ciVInW = c2V'-, so that the asymptotic relative efficiency according to this definition is again given by

(ni/n2) = (C2/Cl)2 = E21.

5. Efficacy of a test. Still assuming that m = 1 and a = 2 it is interesting to investigate more closely the efficacy R 2 , where Rn is given by (3). Consider the function t = An(u) determined by E Tn = tn(f(). Unless I'n(0) = 0, Tn is not an unbiased estimate of 0, but may be considered an unbiased estimate of the ficticious parameter r = On(8).

Let u = n (t) be the inverse of t = Onp(u) and define the statistic Un = 'Pn(Tn). Then we may write

Un- 0 = Von(Tn) - (Pn(T) = (Tn - T) (Pn(T) + .

If it is permissible to neglect terms of higher order in Tn- r, we find E Un - 0 and

var Un~ ' [.on(r)] 2_ (0) = .2n (O) j [(0)] 2 R-2().

Page 81: Asymptotic Relative Efficiency

68 GOTTFRIED E. NOETHER

Thus, if the above conditions are satisfied, asymptotically the efficacy of T. is the reciprocal of the variance of an asymptotically unbiased estimate2 of 0

based on Tn. Now, under the regularity conditions for the Cram6r-Rao inequality,

var U, ? 1 /nE(c log f/do)2. Therefore

Rnf(0) < nE(c logf/aO)2.

Thus we may use the quantity R 2 (0) / nE(d log f/lG)2 as a measure of the asymptotic efficiency of the test based on the statistic T,.

2 Essentially this same result has also been obtained by Stuart [3]. However Stuart uses it even in some cases for which m = 1 but a P 2. That it is then no longer correct can be seen easily from the generalized form of Pitman's Theorem given in Section 3 of this paper.

REFERENCES

[1] G. E. NOETHER, "Asymptotic properties of the Wald-Wolfowitz test of randomness," Ann. Math. Stat., Vol. 21 (1950), pp. 231-246.

[21 N. BLOMQVIST, "On a measure of dependence between two random variables," Ann. Math. Stat., Vol. 21 (1950), pp. 593-600.

[3] A. STUART, "Asymptotic relative efficiencies of distribution-free tests of randomness against normal alternatives," J. Amer. Stat. Assn., Vol. 49 (1954), pp. 147-157.

[41 J. PUTTER, "The treatment of ties in some nonparametric tests." Ph.D. thesis.

Page 82: Asymptotic Relative Efficiency

The Annals of Statistics 1981, Vol. 9, No. 3 663-669

SOME PROPERTIES OF THE ASYMPTOTIC RELATIVE PITMAN EFFICIENCY

BY GUNTER ROTHE

University of Dortmund A general approach to Pitman efficiency as a limit of the ratio of sample

sizes is presented. The results can be used especially to derive the Pitman efficiency of tests based on asymptotically x2-distributed statistics with differ- ent degrees of freedom.

1. Introduction. The concept of asymptotic relative Pitman efficiency (ARPE) is a useful tool for the comparison of test sequences. However, the available techniques generally allow only the investigation of ARPE of tests based on test statistics which under the hypothesis have the same asymptotic distribution, especially both a normal or a x2- distribution with the same number of degrees of freedom. Using a very general definition of ARPE, in Section 2 and 3 we give conditions which can be verified in many applications and under which the ARPE can be calculated.

Section 4 contains the case of asymptotically normal or x2-distributed test statistics (including the case of different degrees of freedom) as well as some applications.

Throughout the paper, (Pa, 9 E 0) is a family of probability measures on a space (s2, 0) where 0 is a topological space. Furthermore, for Oo E 0, (on} is a sequence of level- a-tests (a > 0) for H:9 = Oo against K:@ E 0 - (0o) = 0' (say). In order to avoid complications we also assume that for every 9 34 9o

(l.l.a) EE(4n) c a

(l.l.b) limn. E(4On) = 1.

(1.2) t0) 0) C(90).

Here, C(Oo) denotes the connected component of 9o. Usually, On is a test based on n observations. Now the question arises how many

observations are necessary to achieve a given power ,/ E ]a, 1[. Thus for 0 < a </, < 1, we define

DEFINITION 1. A function N:0' IN is called a Pitman efficiency function for /3 (/3-PEF), if

(1.3.a) Eo(4N()) ) /3

(1.3.b) Eo( )N()-1) </3

where o =_ a. Further, let

(1.4.a) N#(0) = inf (n E N:E9(0n) > ,3)

(1.4.b) N,0(0) = inf (n E N:E9(0Pm) : 3, for all m - n}.

REMARK. Clearly, Np, resp. Np, are the smallest, resp. the largest ,/-PEF. The existence

Received October, 1978; revised January, 1980. A-MS 1970 subject classification. Primary 62F20 62G20; secondary 62E20. Key words and phrases. Asymptotic relative efficiency, Pitman efficiency, method of n rankings.

663

Page 83: Asymptotic Relative Efficiency

664 GUNTER ROTHE

of 86-PEF's is guaranteed by (1.1). If tEo(4n)} is an increasing sequence, the f3-PEF is uniquely determined, but this property is frequently difficult to verify.

For our definition of Pitman efficiency, we modify the concept of Wieand (1976), using the notation 11 for the set of all sequences t6n} satisfying fin E 0', 6in -* Go:

DEFINITION 2. Let t4(')}n<eN, i = 1, 2 be two sequences of level-a-tests with ,B-PEF Ni, No(, respectively. Then

(1.5.a) el2 = infri lim inf,,, Nfl(2)

Nfl(l)(0n)

resp.

+ Nfl2)(6n) (1.5.b) e12 = SUP,- lim sup,. Nfl(l),n

are the lower (resp. upper) ARPE. If el2 = e12 = e12 (say) then e12 is the ARPE of {to n} w.r.t. {n(2)} Simple calculations show that under the conditions A, B, and C given below our

definition of ARPE coincides with several somewhat different versions (e.g., those of Noether (1955), Fraser (1957), Olshen (1967), and Wieand (1976)).

2. Limiting behavior of efficiency functions. In this section we assume that the following condition is satisfied:

CONDITION A. There are functions g:0 -- [0, o[ and H:[O, oo[ -- [a, 1[ such that

(2.1.a) g is continuous, and g(6) = 0 iff 6 = Oo

(2.1.b) H is strictly increasing and bijective

(2.1.c) For sequences (9,,) in 0 satisfying g(6,,)n -* > 0 as n -- oo,

we have lim,nc Eoj(4n) = H(,q).

REMARKS. 1. By (2.1.b), H is continuous, H(O) = a and limtx H(t) = 1 2. By (2.1.a) and (1.2), there is a b > 0 such that [0, b]Ctg(6), 6 E 0). 3. Although Condition A is satisfied in many cases, its verification can become very

tedious, generally uniformity or contiguity arguments are needed (cf., Section 4). An easy consequence is

LEMMA 1. For kn E IN, kn -? 00, g(,n) kn 71, we haveEo,(4k) -(k)

PROOF. (a) If {kn} is strictly increasing, there is a sequence to*} such that for m > 71/b (by remark 2 above)

(2.2.a) 01 = 6, if m = k,,

(2.2.b) mg(6m) = 7, otherwise.

Then g(6*)m -- qj and E6,,(4k) is a subsequence of Eo, (4m) which tends to H(,q) by Condition A.

(b) If {kn} is not strictly increasing, each subsequence contains an increasing subse- quence, hence each subsequence of {Eon(4kj)} contains a subsequence with limit H(,q) by part (a) of the proof and the result follows. [1

The idea of the concept is to show that under simple conditions, for every sequence

Page 84: Asymptotic Relative Efficiency

ASYMPTOTIC RELATIVE PITMAN EFFICIENCY 665

f9n} E H, k,, = N(On) satisfies the conditions of the lemma with i1 = H-1(/). The conditions on {9,} we require are given by

DEFINITION 3. (94 EI H is called an essential sequence (ES) for the ,B-PEF N, if

(2.3.a) N(O.) -0oo

(2.3.b) lim sUpn, g(9n)N(9n) < 00.

Then we have

THEOREM 1. Let t6,} E H be an ES for the ,8-PEF N, then

(2.4) limn. g(0n)N(On) = H-'(,).

PROOF. (a) Let t9Z} be a subsequence of tOn} such that g(On)N(9n) -* q (say). Then, by (2.3.a), N(9n) oo and consequently by Lemma 1 /32 oEo'(4wN(f,)) H(rq*) as well as /83 -

Epn(ON(0n)-1) H(71*), since g(ffn)(N(9n) -1 *. Hence /3 = H(ij*) and q* = H1 (/3).

(b) By (2.1.b) each subsequence of (g(On)N(On)} contains a convergent subsequence that must have the limit H-1(8) by part (a) of the proof. Thus the assertion follows. O

3. Essential sequences. Considering the definition of ARPE, it is useful to find conditions under which a sequence (9,) E H is essential forN1 andNfi.

The goal of this section is to show that for every ,8 E ]a, 1[ each sequence of H is an ES for Nfl as well as for N# (and hence for all PEF's), if the following two conditions are satisfied:

CONDITION B. For every n E IN, the function ,n:9 -f* E0(9,n) is continuous at 9 = 0.

CONDITION C. For every sequence (9n4 E II such that g(9n)n o-* o, we have Eq9n(On)

- 1.

REMARK. Note that C is an extension of A to the case 7 = oo. A generalization similar to Lemma 1 is possible and will be used in the proof of Theorem 2.

However, yet we only assume A to be true. Then we have

LEMMA 2. For every,8 /E ]a, 1[ and every sequence (f9n4 E H, (a) (2.3.a) holds for N = Nf8 (b) (2.3.b) holds for N = Ng.

PROOF. (a) For,8 /E ]a, 1[ and (94n E H, define d = (/3 + a)/2, ij = H-1(d) and kn = ['q/g(4n)] (where [x] = sup {z E7 Z:z c x}). Then kn -*?; k,,g(9n) -711 and thus Eoj(kn) -* H(r1) = d < 3. Hence, kn < NO(9n) for sufficiently large n and the assertion follows.

(b) Define d' = (/3 + 1)/2, iq' = H-1 (d'), kn = [7'/g(9n)]. Then by similar arguments kn s N#(On) and

him SUP g(f9,)Nfl(9,n) - lim sup g(fn9)kn s 7') < 00. U

Now we can show

THEOREM 2. Assume Condition A is satisfied. Then (a) Every sequence (94n E H is an ES forNf for all ,/ E e]a, 1[ if and only if Condition

Page 85: Asymptotic Relative Efficiency

666 GUNTER ROTHE

B is satisfied. (b) Every sequence {tn) e fl is an ES for Np for all ,B E ]a, 1[ if and only if Condition

C is satisfied.

As a direct consequence of the theorem 1 and 2, we get

COROLLARY. Under Conditions A, B, and C, for every sequence {tn) Ef H, every /3 E ]a, 1[ and every ,8-PEF N, (2.4) is satisfied.

PROOF OF THEOREM 2. (a) Assume 0 < a <,8 < 1. Under B, for every m E IN there is a 8m > 0 such that for I tl c 8m and for all n c m we have E0(9n) </8. ThusNf (9) > m for 01j < 8m. Hence every sequence {tOn} E III satisfies (2.3.a) as well as (2.3.b) by L2(a). If B does not hold, there exists k E N, ,B E ]a, 1[ and a sequence tOn} E rf s.th. Ee (4k) > ,3 for all n E N. Hence Nfl(On) < k and (2.3.a) is not satisfied.

(b) Assume there exists /_ E ]a, 1[ and (9n} E HI such that (On is not an ES for Np. Then, by L2(b), w.l.o.g., g(On)Nf (On) -- oo (and consequently g(On) (N, (On) - 1) -* oo) can be assumed. Then, by C and the subsequent remarks, F(4j(eN-1n)) - 1.

But this is a contradiction to E6ep(G,9,, l) < ,. On the other hand, if C does not hold, there exists a 8 > 0 and a se_ence (On) Ee H such that ng(On) -? oo as well asEojo(4n)< 1 - 8. Then, for /3 = 1 - 8/2, Nfl (0,) > n and N5, (On)g(On) -* oo, which is a contradiction to

(2.3.b). 0 Hence we obtain as a general result of the preceding arguments,

THEOREM 3. Let t(4n)), i = 1, 2 be level-a-test sequences satisfying Conditions A, B, and C with function gi, Hi, respectively. Further let

(3.1) gL2 = infri lim infn gl (Ofn) /g29(fn)

and defineg12 similarly (cf., (1.5.b)). Then

(3.2.a) ei2(/3) = gL2H-i'(f3)/HL1(/3)

(3.2.b) e+ (,) = g+H -1(/3)/HL1(/8).

PROOF. For {On) E II and every ,8-PEF N, we have

N((2f (O9 ) N (2)(On)g2(on) .f 1 *fgl(On) (3.3) nfri lim infnx

N(l)(n) =lnn-- N(1`(0n)g1(8n) im g2(On)

REMARK. Clearly, g+1

= 1/g2i. If gj2 = g+, ARPE exists by Theorem 3, but generally depends on ,B.

4. Verification of condition A. By the arguments in the preceding section the calculation of ARPE mainly reduces to the problem of verifying Condition A and hence find suitable functions g and H. In this section we assume that n is an upper level-a-test w.r.t. a test statistic Tn, i.e.,

pn =1 if Tn > tn (4.1) Yn

0 <

where tn and yn are constants such thatE,9 (in) = a. We shall consider the shape of H-1 if the distribution of Tn has one of the following asymptotic properties:

Ao. There is an u > 0 such that g(On)n -- q implies -99 (Tn) -+ N(7qU, 1) for every iq ' 0 and, for K E IN.

Page 86: Asymptotic Relative Efficiency

ASYMPTOTIC RELATIVE PITMAN EFFICIENCY 667

AK. There is an u > 0 such that g(O,n)n T+ implies e9 (Tn) -- X2(K, 2U), where X2(K, 82) is a x2-distribution with K degrees of freedom and noncentrality parameter 82.

Then we obtain

TH ,OREM 4. Assume ({on} is based on {Tn} by (4.1). Then, for 0 < a </ < 1, we have: f Tn satisfies AK for K- 0, Condition A holds with

(4.2) H-'(8) = dl/u(a,8 , K).

Here dka, /, 0) = 4D-1(/8) - 1F1(a), where 1 is the distribution function of the standard normal distribution, and, for K - 1, d2 = d2(a, /3, K) is the (uniquely determined) noncentrality parameter s.th. the /3-fractile of X2(K, d2) and the a-fractile of X2(K, 0) coincide.

PROOF. For K = 0, H(t) = 1 - (DI-'(1 - a) - tu). But H(t) = /3 iff t = (-'(,8) - q>l 1)/u

For K > 0, the assertion follows similarly using H(t) = 1 -FX2(K,t2u) (FX2'(KO)(1 - a)). Here F,, denotes the distribution function of the distribution ,i. 0

REMARKS. 1. Assume that for i = 1, 2; (i)} are based on (Tn) by (4.1) and let Conditions AK,, B, C be satisfied with functions gi and constant Ki, ui, respectively. For gL- as in (3.1), we have:

(a) If K1 = K2 and u1 = u2, then e12 = gj2 independent from a and /3. (b) If K1 0 K2 or u1,# u2, we have

(4.3) e (a, ,/) = gj2dl/U2(a, 3, K2)/dl/ul(a, /, K1),

which depends on a and /3. 2. For K > 0, d 2(a, /3, K) has been tabulated by Haynam et al. (1962) (cf. also Harter

and Owen (1970)). 3. Often lim0 g(6)/c(8) = 1, where c(8) is the Bahadur slope of the test statistic (Tn}

(cf., Bahadur (1960)). In these cases the Pitman efficiency factors into the product of the local Bahadur efficiency and a function of a and /3 which reflects only the analytic structure of the test statistic's limiting behavior.

4. Pitman's conditions in the modified version of Olshen (1967) imply our Condition Ao with u = 1/2, g(6) = c2(0 -_ o)2. An analogous modification of the extensions due to Noether (1955) resp. Hannan (1956) lead to Condition Ao, resp. A,, with u = m8, g(9) = (M/m (6 - 60)1/8, where J = c, or j = (c'A-'(6o)c)12, in the notation of the respective authors.

5. Often contiguity arguments lead to Condition A. As an illustration, consider the rank statistic Q for the k-sample problem as defined by Hajek and Sidak (1967) in (VI. 3.1.2). In Chapter VI they show that suitable assumptions on the underlying model lead to (VI. 4.3.2), which in the case nj/n -*Xj > 0 for 1 c j k k is equivalent to our Condition Ak1 with 0 = ]Rk, u = 1/2 and g(A) = p = I(f) k Xj(4 - A)2j

We close with two numerical examples: 1. Assume Tn = (1/4/ni) Zg=i Xi, where Xi - N(f, 1) independent. For H:-O = 0 against K:

6 > 0 we use {Pnl)} based on {Tn} by (4.1). How many observations are "lost", if we use the two-sided test although the problem is one-sided, i.e., if we use{+(2),} based on{T)}? {Tn} satisfies Ao, (Tn} satisfies A1, both with g(6) = 6, u = 1/2; Conditions B and C can be verified easily. Hence we have

el2(a,/3) - d2(a, /3, 1) eS2ou (ar = T(b-le() (1-.(,O))2 -

Some values of this function are given in Table 1.

Page 87: Asymptotic Relative Efficiency

668 GUNTER ROTHE

2. In a recent paper, Schach (1979), using the concept of Bahadur efficiency, compares a test proposed by Anderson (1959) with the method of n rankings using the optimal scores (cf., e.g., Puri and Sen (1971), Section 7). It can be shown that Conditions AK, B and C with different K but the same g and u are satisfied for the two tests. Details are omitted and can be found in Rothe (1978). Hence the ARPE turns out to be

d 2(a, /3, p -1) (4.4) eAnderson, opt. n-ranking (a, ) = d2(a , (p 1)2)

where p is the number of treatments in each block. Some values are given in Table 2, the values of d2(a, /B, K) have been taken from Harter and Owen (1970).

TABLE 1 ARPE of two-sided against one-sided Gauss-test for one-sided

alternatives.

a\ 0.1 0.050 0.010 0.005 0.001

0.2 0.332 0.519 0.732 0.778 0.842 0.4 0.548 0.665 0.795 0.826 0.871 0.6 0.655 0.736 0.831 0.855 0.890 0.7 0.693 0.762 0.845 0.866 0.897 0.8 0.736 0.788 0.859 0.878 0.906 0.9 0.768 0.815 0.873 0.890 0.914 0.99 0.825 0.858 0.901 0.912 0.930

TABLE 2 ARPE of Anderson-test against method of n-rankings with optimal

scores.

\: 0.3 0.5 0.7 0.9 a

0.1 0.727 0.758 0.780 0.812 0.05 0.743 0.772 0.795 0.821 0.01 0.777 0.800 0.819 0.840 p = 3 0.005 0.790 0.810 0.827 0.847 0.001 0.812 0.829 0.844 0.860

0.1 0.523 0.560 0.591 0.627 0.05 0.541 0.575 0.604 0.639 0.01 0.576 0.606 0.631 0.661 0 = 5 0.005 0.589 0.617 0.640 0.669 0.001 0.614 0.639 0.660 0.685

0.1 0.428 0.461 0.489 0.524 0.05 0.443 0.474 0.501 0.535 0.01 0.473 0.501 0.525 0.555 p = 7 0.005 0.484 0.510 0.533 0.562 0.001 0.506 0.530 0.551 0.578

0.1 0.330 0.356 0.379 0.408 0.05 0.341 0.366 0.388 0.417 0.01 0.363 0.386 0.406 0.433 p = 11 0.005 0.371 0.393 0.413 0.439 0.001 0.388 0.408 0.427 0.451

Page 88: Asymptotic Relative Efficiency

ASYMPTOTIC RELATIVE PITMAN EFFICIENCY 669

REFERENCES

[1] ANDERSON, R. L. (1959). Use of contingency tables in the analysis of consumer preference studies. Biometrics 15 582-590.

[2] BAHADUR, R. R. (1960). Stochastic comparison of tests. Ann. Math. Statist. 31 276-295. [3] FRASER, D. A. S. (1957) Nonparametric Methods in Statistics. Wiley, New York. [4] HAJEK, J. and SIDAK, Z. (1967). Theory of Rank Tests. Academic, New York. [5] HANNAN, E. J. (1956). The asymptotic power of certain tests based on multiple correlation. J.

Roy. Statist. Soc. B. 18 227-233. [6] HARTER, H. L. and OWEN, D. B. (1970). Selected Tables in Mathematical Statistics. Sponsored

by the Institute of Mathematical Statistics Chicago, Markham. [7] HAYNAM, G. E., GOVINDARAJULU, Z. and LEONE, F. C. (1962). Unpublished Report of Case

Institute of Technology. [8] NOETHER, G. E. (1955). On a theorem by Pitman. Ann. Math. Statist. 26 64-68. [9] OLSHEN, R. A. (1967). Sign and Wilcoxon tests for linearity. Ann. Math. Statist. 38 1763-1769.

[10] PURI, M. L. and SEN, P. K. (1971). Nonparametric Methods in Multivariate Analysis. Wiley, New York.

[11] ROTHE, G. (1978). Effizienzvergleiche bei n-ranking-Tests. Report 78/9, Abt. Statistik der Universitat, Dortmund.

[12] SCHACH, S. (1969). An alternative to the Friedman test with certain optimality properties. Ann. Statist. 7 537-550.

[13] WIEAND, H. S. (1976). A condition under which the Pitman and Bahadur approaches to efficiency coincide. Ann. Statist. 4 1003-1011.

ABTEILUNG STATISTIK UNIVERSITAT DORTMUND POSTFACH 50 05 00 4600 DORTMUND 50 WEST GERMANY

Page 89: Asymptotic Relative Efficiency
Page 90: Asymptotic Relative Efficiency
Page 91: Asymptotic Relative Efficiency
Page 92: Asymptotic Relative Efficiency
Page 93: Asymptotic Relative Efficiency
Page 94: Asymptotic Relative Efficiency
Page 95: Asymptotic Relative Efficiency
Page 96: Asymptotic Relative Efficiency
Page 97: Asymptotic Relative Efficiency
Page 98: Asymptotic Relative Efficiency
Page 99: Asymptotic Relative Efficiency
Page 100: Asymptotic Relative Efficiency
Page 101: Asymptotic Relative Efficiency
Page 102: Asymptotic Relative Efficiency
Page 103: Asymptotic Relative Efficiency
Page 104: Asymptotic Relative Efficiency
Page 105: Asymptotic Relative Efficiency
Page 106: Asymptotic Relative Efficiency
Page 107: Asymptotic Relative Efficiency
Page 108: Asymptotic Relative Efficiency
Page 109: Asymptotic Relative Efficiency

Chapter 22

Asymptotic Efficiency in Testing

In estimation, an agreed-on basis for comparing two sequences of estimates

whose mean squared error each converges to zero as n → ∞ is to com-

pare the variances in their limit distributions. Thus, if√

n(θ1,n − θ)L−→

N (0, σ 21 (θ)) and

√n(θ2,n −θ)

L−→ N (0, σ 22 (θ)), then the asymptotic relative

efficiency (ARE) of θ2,n with respect to θ1,n is defined asσ 2

1 (θ)

σ 22 (θ)

.

One can similarly ask what should be a basis for comparison of two se-

quences of tests based on statistics T1,n and T2,n of a hypothesis H0 : θ ∈ �0.

Suppose we use statistics such that large values of them correspond to rejec-

tion of H0; i.e., H0 is rejected if Tn > cn . Let α, β denote the type 1 error

probability and the power of the test, and let θ denote a specific alternative.

Suppose n(α, β, θ, T ) is the smallest sample size such that

Pθ (Tn ≥ cn) ≥ β and PH0(Tn ≥ cn) ≤ α.

Two tests based on T1,n and T2,n can be compared through the ratio n(α,β,θ,T1)

n(α,β,θ,T2),

and T1,n is preferred if this ratio is ≤ 1. The threshold sample size

n(α, β, θ, T ) is difficult or impossible to calculate even in the simplest exam-

ples. Furthermore, the ratio can depend on particular choices of α, β, and θ .

Fortunately, if α → 0 β → 1, or θ → θ0 (an element of the boundary

��0), then the ratio (generally) converges to something that depends on θ

alone or is just a constant.

The three respective measures of efficiency correspond to approaches by

Bahadur, Hodges and Lehmann, and Pitman; see Pitman (1948), Hodges

and Lehmann (1956), and Bahadur (1960). Other efficiency measures, due to

Chernoff, Kallenberg, and others, are hybrids of these three approaches. Ru-

bin and Sethuraman (1965) offer measures of asymptotic relative efficiency

in testing through the introduction of loss functions and priors in a for-

mal decision-theory setting. Chernoff (1952), Kallenberg (1983), Rubin and

Sethuraman (1965), Serfling (1980), and van der Vaart (1998) are excellent

A. DasGupta, Asymptotic Theory of Statistics and Probability,C© Springer Science+Business Media, LLC 2008 347

Page 110: Asymptotic Relative Efficiency

348 22 Asymptotic Efficiency in Testing

references for the technical material in this chapter. For overall expositions

of asymptotic efficiency in testing, see DasGupta (1998), Basu (1956), and

Singh (1984).

Definition 22.1 Let X1, . . . , Xn be iid observations from a distribution Pθ ,

θ ∈ �. Suppose we want to test H0 : θ ∈ �0 vs. H1 : θ ∈ � − �0. Let

Tn = Tn(X1, . . . , Xn) be a sequence of statistics such that we reject H0 for

large values of Tn . Precisely, fix 0 < α < 1, 0 < β < 1, θ ∈ � − �0.

Let cn = cn(θ, β) be defined as Pθ (Tn > cn) ≤ β ≤ Pθ (Tn ≥ cn).

The size of the test is defined as αn(θ, β) = supθ0∈�0Pθ0

(Tn ≥ cn). Let

NT (α, β, θ) = inf{n : αm(θ, β) ≤ α ∀m ≥ n}.Thus NT (α, β, θ) is the smallest sample size beyond which the test based

on the sequence Tn has power β at the specified alternative θ and size ≤ α.

The quantity NT (α, β, θ) is difficult (and mostly impossible) to calculate for

given α, β, and θ . To calculate NT (α, β, θ), the exact distribution of Tn under

any fixed θ and for all given n has to be known. There are very few problems

where this is the case.

For two given sequences of test statistics T1n and T2n , we define eT2,T1(α, β, θ)

= NT1(α,β,θ)

NT2(α,β,θ)

. Let

eB(β, θ) = limα→0

eT2,T1(α, β, θ),

eH L(α, θ) = limβ→1

eT2,T1(α, β, θ),

eP(α, β, θ0) = limθ→θ0

eT2,T1(α, β, θ), where θ0 ∈ ��0,

assuming the limits exist.

eB , eH L, and eP respectively are called the Bahadur, Hodges-Lehmann,

and Pitman efficiencies of the test based on T2 relative to the test based on T1.

Typically, eB(β, θ) depends just on θ , eHL(α, θ) also depends just on θ ,

and eP(α, β, θ0) depends on neither α nor β. Of these, eP is the easiest to cal-

culate in most applications, and eB can be very hard to find. It is interesting

that comparisons based on eB, eHL, and eP can lead to different conclusions.

22.1 Pitman Efficiencies

The Pitman efficiency is easily calculated by a fixed recipe under frequently

satisfied conditions that we present below. It is also important to note that

the Pitman efficiency works out to just the asymptotic efficiency in the point

estimation problem, with T1n and T2n being considered as the respective esti-

mates. Testing and estimation come together in the Pitman approach. We

state two theorems describing the calculation of the Pitman efficiency. The

Page 111: Asymptotic Relative Efficiency

22.1 Pitman Efficiencies 349

second of these is simpler in form and suffices for many applications, but

the first one is worth knowing. It addresses more general situations. See

Serfling (1980) for further details on both theorems.

Conditions A

(1) For some sequence of functions μn(θ), σ 2n (θ), and some δ > 0,

sup|θ−θ0|≤δ

supz

|Pθ

(Tn − μn(θ)

σn(θ)≤ z

)− �(z)| → 0

as n → ∞. This is a locally uniform asymptotic normality condition.

Usually, μn and σn can be taken to be the exact mean and standard de-

viation of Tn or the counterparts in the CLT for Tn .

(2) μ′n(θ0) > 0.

(3)√

nσn(θ0)

μ′n(θ0)

= O(1).

(4) If |θn − θ0| = O( 1√n), then

μ′n(θn )

μ′n(θ0)

→ 1.

(5) If |θn − θ0| = O( 1√n), then σn (θn)

σn (θ0)→ 1.

Theorem 22.1 Suppose T1n and T2n each satisfy conditions A. Then

eP(T2, T1) =

⎛⎝

limn→∞√

nσ1n(θ0)

μ′1n(θ0)

limn→∞√

nσ2n(θ0)

μ′2n(θ0)

⎞⎠

2

.

Remark. In many applications, σ1n , σ2n are fixed functions σ1, σ2 and μ1n ,

μ2n are each the same fixed function μ. In such a case, eP(T2, T1) works

out to the ratioσ 2

1 (θ0)

σ 22 (θ0)

. If σ1(θ), σ2(θ) have the interpretation of being the

asymptotic variance of T1n , T2n , then this will result in eP(T2, T1) being the

same as the asymptotic efficiency in the estimation problem.

Conditions B

Let θ0 ∈ ��0. Let −∞ < h < ∞ and θn = θ0 + h√n.

(1) There exist functions μ(θ), σ (θ), such that, for all h,

√n(Tn − μ(θn))

σ (θn)

L−→Pθn

N (0, 1).

(2) μ′(θ0) > 0.

(3) σ (θ0) > 0 and σ (θ) is continuous at θ0.

Page 112: Asymptotic Relative Efficiency

350 22 Asymptotic Efficiency in Testing

Remark. Condition (1) does not follow from pointwise asymptotic normal-

ity of Tn . Neither is it true that if (1) holds, then with the same choice of

μ(θ) and σ (θ),√

n(Tn−μ(θ))

σ (θ)

L−→ N (0, 1).

Theorem 22.2 Suppose T1n and T2n each satisfy conditions B. Then

eP(T2, T1) =σ 2

1 (θ0)

σ 22 (θ0)

[μ′

2(θ0)

μ′1(θ0)

]2

.

Example 22.1 Suppose X1, . . . , Xn are iid N (θ, σ 2), where σ 2 > 0 is

known. We want to test that the mean θ is zero. Choose the test statistic

Tn = Xs

. Let μ(θ) = θσ

and σ (θ) = 1. Then

√n(Tn − μ(θn))

σ (θn)=

√n

(X

s−

θn

σ

)

=√

n

(X − θn + θn

s−

θn

σ

)

=√

n

(X − θn

s

)+

√nθn

(1

s−

1

σ

).

Of these, the second term goes in probability to zero and the first term is

asymptotically N (0, 1) under Pθn, so (1) is satisfied. But it is actually not

true that√

n(Tn−μ(θ))

σ (θ)=

√n( X

S− θ

σ) is asymptotically N (0, 1).

We give a few examples illustrating the application of Theorems 22.1

and 22.2.

Example 22.2 Suppose X1, X2, . . . , Xn

iid∼F (x − θ), where F is absolutely

continuous with density f (x). Suppose we want to test H0 : θ = 0 against

H1 : θ > 0. We assume F (−x) = 1 − F (x) for any x , f (0) > 0, and f

is continuous at 0. For a technical reason pertaining to an application of the

Berry-Esseen theorem, we also make the assumption EF |X |3 < ∞. This is

stronger than what we need to assume.

A well-known test for H0 is the so-called sign test, which uses the test

statistic Tn = proportion of sample values > 0 and rejects H0 for large

values of Tn . We will denote the sign test statistic by S = Sn . Thus, if

Zi = IX i>0, then S = 1n

∑Zi = Z . We wish to calculate the Pitman effi-

ciency of the sign test with respect to the test that uses X as the test statistic

and rejects H0 for large values of X . For this, because Pitman efficiencies

are dependent on central limit theorems for the test statistics, we will need to

Page 113: Asymptotic Relative Efficiency

22.1 Pitman Efficiencies 351

assume that σ 2F = VarF (X ) < ∞. We will denote the mean statistic simply

as X . To calculate eP(S, X ), we verify conditions A in Theorem 22.1.

For Tn = S, first notice that Eθ (Z1) = Pθ (X1 > 0) = F (θ). Also

Varθ (Z1) = F (θ)(1 − F (θ)). We choose μn(θ) = F (θ) and σ 2n (θ) =

F(θ)(1−F(θ))

n. Therefore, μ′

n(θ) = f (θ) and μ′n(θ0) = μ′

n(0) = f (0) > 0.

Next,√

nσn(θ)

μ′n(θ)

=√

F(θ)(1−F(θ))

f (θ)implies that

√nσn(θ0)

μ′n(θ0)

= 12 f (0)

and so obvi-

ously√

nσn(θ0)

μ′n(θ0)

= O(1). If θn = θ0 + hn√n= hn√

n, where hn = O(1), then

μ′n(θn)

μ′n(θ0)

= f (θn)

f (θ0)=

f ( hn√n

)

f (0)→ 1 as f is continuous. It only remains to verify

that, for some δ > 0,

sup|θ−θ0|≤δ

supz∈R

|Pθ

(Tn − μn(θ)

σn(θ)≤ z

)− �(z)| → 0.

Notice now that S−μn(θ)

σn(θ)=

√n(Z−Eθ Z1)√

Varθ (Z1)and so, by the Berry-Esseen theorem,

supz∈R

∣∣P(

S − μn(θ)

σn(θ)≤ z

)− �(z)

∣∣ ≤ c√

n

Eθ |Z1 − Eθ Z1|3

(Varθ (Z1))3/2

for some absolute constant 0 < c < ∞.

Trivially, Eθ |Z1 − Eθ Z1|3 = F (θ)(1− F (θ))[1−2F (θ)(1− F (θ))]. Thus

supz∈R

∣∣P(

S − μn(θ)

σn(θ)≤ z

)− �(z)

∣∣ ≤ c√

n

1 − 2F (θ)(1 − F (θ))√

F (θ)(1 − F (θ)).

Clearly, 1−2F(θ0)(1−F(θ0))√F(θ0)(1−F(θ0))

= 1 and F is continuous. Thus, for sufficiently

small δ > 0, 1−2F(θ)(1−F(θ))√F(θ)(1−F(θ))

< 2 if |θ − θ0| ≤ δ. This proves that, for Tn = S,

conditions A are satisfied.

For Tn = X , choose μn(θ) = θ and σ 2n (θ) = σ 2

F

n. Conditions A are easily

verified here, too, with these choices of μn(θ) and σn(θ).

Therefore, by Theorem 22.1,

eP(S, X ) =

⎡⎢⎢⎢⎢⎢⎣

limn

√n σF√

n

limn

√n

√F(θ0)(1−F(θ0))

n

f (θ0)

⎤⎥⎥⎥⎥⎥⎦

2

= 4σ 2F f 2(0).

Page 114: Asymptotic Relative Efficiency

352 22 Asymptotic Efficiency in Testing

Notice that eP(S, X ) equals the asymptotic relative efficiency of the sample

median with respect to X in the estimation problem (see Chapter 7).

Example 22.3 Again let X1, X2, . . . , Xn

iid∼ F (x − θ), where F (x) =1 − F (−x) for all x , has a density f , and f is positive and continuous

at 0. We want to test H0 : θ = θ0 = 0 against H1 : θ > 0. We assume

σ 2F = VarF (X ) < ∞. We will compare the t-test with the test based on X

in this example by calculating eP(t, X ). Recall that the t-test rejects H0 for

large values of√

n X

s.

We verify in this example conditions B for both Xs

and X . Then let

T2n = Xs

. Choose μ(θ) = θσF

and σ (θ) = 1. Therefore μ′(θ0) = 1σF

> 0, and

σ (θ) is obviously continuous at θ = θ0. By the CLT and Slutsky’s theorem,

one can verify that

Tn − μ(θn)

σ (θn)

L−→Pθn

N (0, 1).

For T1n = X , it is easily proved that conditions B hold with μ(θ) = θ and

σ (θ) = σF . Therefore, by Theorem 22.2,

eP(t, X ) =σ 2

1 (θ0)

σ 22 (θ0)

[μ′

2(θ0)

μ′1(θ0)

]2

=σ 2

F

1

[1σF

1

]2

= 1.

This says that in the Pitman approach there is asymptotically no loss in es-

timating σF by s even though σF is considered to be known, and the t-test

has efficiency 1 with respect to the test based on the mean and this is true for

all F as defined in this example. We shall later see that this is not true in the

Bahadur approach.

Another reputable test statistic in the symmetric location-parameter prob-

lem is W = 1

(n

2)

∑∑i �= j IX i+X j>0. The test that rejects H0 for large values of

W is called the Wilcoxon test. By verifying conditions B, we can show that

eP(W, X ) = 12σ 2F

[ ∫f 2(x)dx

]2. It turns out that W has remarkably good

Pitman efficiencies with respect to X and is generally preferred to the sign

test (see Chapter 24).

The following bounds are worth mentioning.

Proposition 22.1 (1) infF eP(S, X ) = 13, where the infimum is over all F

that are symmetric, absolutely continuous, and unimodal, is symmetric,

absolutely continuous, and unimodal.

(2) inf{F :F is symmetric} eP(W, X ) = 108125

.

Page 115: Asymptotic Relative Efficiency

22.2 Bahadur Slopes and Bahadur Efficiency 353

Remark. Of course, as long as F is such that eP(t, X ) = 1, the results above

can be stated in terms of eP(S, t) and eP(W, t) as well.

Example 22.4 We provide a table of Pitman efficiencies eP(S, X ) and eP

(W, X ) for some specific choices of F . The values are found from direct

applications of the formulas given above.

f eP(S, X ) eP(W, X )

1√2π

e−x2

22π

12e−|x | 2 3

2

12

I−1≤x≤113

1

Remark. The table reveals that as F gets thicker tailed, the test based on X

becomes less desirable.

22.2 Bahadur Slopes and Bahadur Efficiency

The results in the previous section give recipes for explicit calculation of

the Pitman efficiency eP(T2, T1) for two sequences of tests T1n and T2n . We

now describe a general method for calculation of the Bahadur efficiency

eB(T2, T1). The recipe will take us into the probabilities of large deviation

under the null hypothesis. Large-deviation probabilities under a distribution

P0 are probabilities of the form PH0(Tn ≥ t) when Tn converges in prob-

ability to, say, zero. For fixed t > 0, PH0(Tn ≥ t) typically converges to

zero at an exponential rate. Determining this rate exactly is at the heart of

calculating Bahadur efficiencies, and except for specified types of statistics

Tn , calculation of the large-deviation rate is a very difficult mathematical

problem. We will discuss more general large-deviation problems in the next

chapter. For now, we discuss large-deviation rates for very special types of

statistics Tn in just the real-valued case. First we describe some notation.

Consider first the case of a simple null hypothesis H0 : θ = θ0. Let {Tn}be a specified sequence of test statistics such that H0 is rejected for large

values of Tn .

Page 116: Asymptotic Relative Efficiency

354 22 Asymptotic Efficiency in Testing

Define

In(t) = −2

nlog Pθ0

(Tn > t);

Ln = Pθ0(Tn > tn), where tn is the observed value of Tn;

Kn = −2

nlog Ln = −

2

nlog Pθ0

(Tn > tn) = In(tn).

Note that Ln is simply the p-value corresponding to the sequence {Tn}.

Definition 22.2 Suppose I (t) is a fixed continuous function such that

In(t) → I (t) pointwise and that, for fixed θ , Tna.s.−→ ψ(θ) for some function

ψ(θ). The Bahadur slope of {Tn} at θ is defined to be I (ψ(θ)). I (t) is called

the rate function of {Tn}.

Remark. We will work out the rate function I (t) in many examples. The link

of Bahadur efficiencies to the p-value is described in the following elegant

theorem.

Theorem 22.3 Let {T1n}, {T2n} be two sequences of test statistics for H0 :

θ = θ0, and suppose H0 is rejected for large values of Tin, i = 1, 2. Sup-

pose Tin has Bahadur slope mi (θ) at the alternative θ . Then the Bahadur

efficiency is

eB(T2, T1, β, θ) =m2(θ)

m1(θ).

Remark. See Serfling (1980) for this theorem. Notice that this theorem says

that, provided each sequence {Tin} admits a limit I (t) for the associated se-

quence of functions In(t) and admits a law of large numbers under θ , the

Bahadur efficiency depends only on θ , although according to its definition it

could depend on β.

The next questions concern what is special about the quantity I (ψ(θ))

and why it is called a slope. Note that as In(t) → I (t) pointwise, I (t)

is continuous, Tna.s.−→ ψ(θ), and In is a sequence of monotone functions,

Kn = In(tn)a.s.−→ I (ψ(θ)), under θ . But Kn = − 2

nlog Ln , where Ln is the

p-value, if we use {Tn} as the sequence of test statistics. If, for a range of

successive values of n, the points (n, Kn) are plotted for a simulated sample

from Pθ , then the plot would look approximately like a straight line with

slope I (ψ(θ)). This is why I (ψ(θ)) is known as a slope, and this is also why

I (ψ(θ)) is a special quantity of importance in the Bahadur efficiency theory.

Page 117: Asymptotic Relative Efficiency

22.2 Bahadur Slopes and Bahadur Efficiency 355

To summarize, in order to calculate eB(T2, T2), we need to establish an

SLLN for {Tin}, i = 1, 2, and we need to identify the function I (t) for

each of {Tin}, i = 1, 2. Thus, what are involved are laws of large numbers

and analysis of large-deviation probabilities under the null. The first task is

usually simple. The second task is in general very difficult, although, for spe-

cialized types of statistics, methods for calculating I (t) have been obtained.

We will come to this issue later.

Example 22.5 Suppose that X1, X2, . . . , Xn

iid∼ N (θ, 1) and the null hypoth-

esis is H0 : θ = 0. Suppose the test statistic is Tn = X . Then the rate function

is I (t) = −2 limn1n

log Pθ (Tn > t). Under θ0 = 0,√

n X ∼ N (0, 1), and so

P0(Tn > t) = P0(√

nX > t√

n) = 1 − �(t√

n). For fixed t > 0,

1 − �(t√

n) ∼φ(t

√n)

t√

n=

1√2π

e−

nt2

2

t√

n

∴1n

log P0(Tn > t) = − t2

2+ o(1) =⇒ I (t) = t2. Also, under a general θ ,

Tna.s.−→ ψ(θ) = θ , and so the slope of X at an alternative θ is I (ψ(θ)) = θ2.

In this case, therefore, we can compute the Bahadur slope directly.

The following theorem describes how to find the rate function I (t) in

general when Tn is a sample mean.

Theorem 22.4 (Cramer-Chernoff) Suppose Y1, . . . ,Yn are iid zero-mean

random variables with an mgf (moment generating function) M(z) = E (ezY1)

assumed to exist for all z. Let k(z) = log M(z) be the cumulant generating

function of Y1. Then, for fixed t > 0,

limn

−2

nlog P(Y > t) = I (t) = −2 inf

z>0(k(z) − t z) = 2 sup

z>0

(t z − k(z)).

Remark. See Serfling (1980) or van der Vaart (1998) for a proof. In a

specific application, one has to find the cgf (cumulant generating function)

of Y1 to carry out this agenda. It can be shown by simple analysis from

Theorem 22.4 that I (t) is increasing and convex for t ≥ E (Y1). This is an

important mathematical property of the rate function and is useful in proving

various results on large deviations (see Chapter 23).

Page 118: Asymptotic Relative Efficiency

356 22 Asymptotic Efficiency in Testing

Where is the function t z − k(z) coming from? Toward this end, note that

P(Y > t) = P

(1

n

n∑

i=1

(Yi − t) > 0

)

= P

(n∑

i=1

(Yi − t) > 0

)= P

(e

zn∑

i=1

(Yi−t)

> 1

)for positive z

≤ E

(e

zn∑

i=1

(Yi−t)

)=(e−t z M(z)

)n.

This gives

1

nlog P(Y > t) ≤ log M(z) − t z = k(z) − t z

=⇒ lim supn

1

nlog P(Y > t) ≤ inf

z>0(k(z) − t z).

It takes some work to show that lim infn1n

log P(Y > t) ≥ q infz>0(k(z) −t z), which gives the Cramer-Chernoff theorem; see, e.g., Serfling (1980).

Let us now use this theorem to compute the Bahadur slope of some test

statistics in some selected hypothesis-testing problems.

Example 22.6 Again let X1, X2, . . . , Xn

iid∼ N (θ, 1), and suppose we test

H0 : θ = 0 using Tn = X . To find limn − 2n

log PH0(Tn > t), we use the

Cramer-Chernoff theorem by identifying Yiiid∼ N (0, 1) (i.e., the distribution

of Yi is that of X i under H0), so M(z) = EH0(ez X1) = e

z2

2 =⇒ k(z) = z2

2,

which gives t z − k(z) = t z − z2

2=⇒ d

dz

(t z − z2

2

)= t − z = 0 at

z = t . Therefore, for t > 0, supt>0(t z − k(z)) = t2 − t2

2= t2

2. This gives

I (t) = lim− 2n

log PH0(Tn > t) = t2 by the Cramer-Chernoff theorem.

Example 22.7 Let X1, X2, . . . , Xn

iid∼ N (θ, σ 2), where σ 2 is known and as-

sumed to be 1. For testing H0 : θ = 0, we have seen that the Bahadur slope

of X is θ2. Let T = Tn be the t-statistics Xs= X√

1n−1

∑(X i−X)2

. Previously we

saw that eP(t, X ) = 1. The basic reason that T and X are equally asymp-

totically efficient in the Pitman approach is that√

n X and√

n X

shave the

same limiting N (0, 1) distribution under H0. More precisely, for any fixed t ,

Page 119: Asymptotic Relative Efficiency

22.2 Bahadur Slopes and Bahadur Efficiency 357

PH0(√

n X > t) = 1−�(t); i.e., limn PH0(X > t√

n) = limn PH0

( Xs> t√

n) =

1 − �(t). But Bahadur slopes are determined by the rate of exponential

convergence of PH0(X > t) and PH0

( Xs> t). The rates of exponential con-

vergence are different. Thus the t-statistic has a different Bahadur slope from

X . In fact, the Bahadur slope of the t-statistic in this problem is log(1 + θ2),

so that eB(T, X ) = log(1+θ2)

θ2 . In the rest of this example, we outline the

derivation of the slope log(1 + θ2) for the t-statistic.

For simplicity of the requisite algebra, we will use the statistic T =X√

1n

∑(X i−X)2

; this change does not affect the Bahadur slope. Now,

PH0(T > t) =

1

2PH0

(T 2 > t2

)

=1

2P

⎛⎜⎜⎝

z21

n∑i=2

z2i

> t2

⎞⎟⎟⎠ ,

where z1, . . . , zn are iid N (0, 1); such a representation is possible because

n X2 is χ21 ,∑

(X i − X )2 is χ2(n−1), and the two are independent. Therefore,

PH0(T > t) =

1

2P

(z2

1 − t2

n∑

i=2

z2i > 0

)=

1

2P(e

z(z21−t2

n∑i=2

z2i )

> 1)(z > 0)

≤1

2E

(e

z(z21−t2

n∑i=2

z2i )

)

=⇒ log PH0(T > t) ≤ inf

z>0

{− log 2 + log E

(e

z[z21−t2

n∑i=2

z2i ]

)}.

By direct calculation, log E (ez[z21−t2

∑ni=2 z2

i ]) = − 12

log(1− 2z)− n−12

log(1+2t2z), and by elementary calculus, the minimum value of this over z > 0

is − 12

log(

1+t2

1−t2

)− n−1

2log

(n−1

n(1 + t2)

), which implies lim supn

1n

log PH0

(T > t) ≤ − 12

log(1 + t2).

In fact, it is also true that lim infn1n

log PH0(T > t) ≥ − 1

2log(1 + t2).

Together, these give I (t) = limn − 2n

log PH0(T > t) = log(1 + t2). At a

fixed θ , Ta.s.⇒ θ = ψ(θ) (as σ was assumed to be 1), implying that the

Bahadur slope of T is I (ψ(θ)) = log(1 + θ2).

Page 120: Asymptotic Relative Efficiency

358 22 Asymptotic Efficiency in Testing

Example 22.8 This example is intriguing because it brings out unexpected

phenomena as regards the relative comparison of tests by using the approach

of Bahadur slopes.

Suppose X1, X2, . . . , Xn

iid∼ F (X − θ), where F (·) is continuous and

F (−x) = 1 − F (x) for all x . Suppose we want to test H0 : θ = 0. A

test statistic we have previously discussed is S = Sn = 1n

∑ni=1 IX i>0. This

is equivalent to the test statistic T = Tn = 1n

∑ni=1 sgn(X i ), where sgn(X ) =

±1 according to whether X > 0 or X < 0. To apply the Cramer-Chernoff

theorem, we need the cumulant generating function (cdf) of Y1 = sgn(X1)

under H0. This is k(z) = log EH0[ezY1] = log ez+e−z

2= log cosh(z). There-

fore, k(z)− t z = log cosh(z)− t z and ddz

(k(z)− t z) = tanh(z)− t = 0 when

z = arctanh(t). Therefore, the rate function I (t) by the Cramer-Chernoff

theorem is I (t) = −2 log(cosh(arctanh(t))) + 2tarctanh(t). Furthermore,

Y =1

n

n∑

i=1

sgn(X i )a.s.−→Eθ (sgn(X1))

= Pθ (X1 > 0) − Pθ (X1 < 0)

= 1 − 2F (−θ)

= 2F (θ) − 1 = ψ(θ).

Thus, the slope of the sign test under a given F and a fixed alternative θ is

I (ψ(θ)) = −2log(cosh(arctanh (2F (θ) − 1)))

+2(2F (θ) − 1) arctanh (2F (θ) − 1).

This general formula can be applied to any specific F . For the CDF of

f (x) = 12e−|x |, for θ > 0,

F (θ) =∫ θ

−∞

1

2e−|x |dx

=1

2+∫ θ

0

1

2e−|x |dx

= 1 −1

2e−θ .

Plugging this into the general formula above, the slope of the sign test for

the double exponential case is −2log(cosh(arctanh (1 − e−θ ))) + 2(1 −e−θ ) arctanh (1 − e−θ ).

We can compare this slope with the slope of competing test statistics. As

competitors, we choose X and the sample median Mn . We derive here the

Page 121: Asymptotic Relative Efficiency

22.2 Bahadur Slopes and Bahadur Efficiency 359

slope of X for the double exponential case. To calculate the slope of X , note

that

EH0(ez X1) =

∫ ∞

−∞ezx 1

2e−|x |dx =

1

1 − z2, |z| < 1.

So, k(z) − t z = log 11−z2 − t z, which is minimized at z =

√t2+1−1

t(for

t > 0). Therefore, by the Cramer-Chernoff theorem, the rate function

I (t) = 2√

1 + t2 + 2 log 2√

1+t2−2t2 − 2. Since X

a.s.−→θ , the slope of X is

2√

1 + θ2 + 2 log 2√

1+θ2−2θ2 − 2. As regards the slope of Mn , it cannot be

calculated from the Cramer-Chernoff theorem. However, the rate function

limn − 2n

log PH0(Mn > t) can be calculated directly by analyzing binomial

CDFs. It turns out that the slope of Mn is − log(4pq), where p = 1 − F (θ)

and q = 1 − p = F (θ).

Taking the respective ratios, one can compute eB(S, X , θ), eB(Mn, X , θ),

and eB(S, Mn, θ). For θ close to θ0 = 0, S and Mn are more efficient

than X ; however (and some think it is counterintuitive), X becomes more

efficient than S and Mn as θ drifts away from θ0. For example, when

θ = 1.5, eB(M, X , θ) < 1. This is surprising because for the double ex-

ponential case, Mn is the MLE and X is not even asymptotically efficient as

a point estimate of θ . Actually, what is even more surprising is that the test

based on X is asymptotically optimal in the Bahadur sense as θ → ∞.

The next result states what exactly asymptotic optimality means. See Ba-

hadur (1967), Brown (1971), and Kallenberg (1978) for various asymptotic

optimality properties of the LRT. Apparently, Charles Stein never actually

published his results on optimality of the LRT, although the results are

widely known. Cohen, Kemperman, and Sackrowitz (2000) argue that LRTs

have unintuitive properties in certain families of problems. A counter view

is presented in Perlman and Wu (1999).

Theorem 22.5 (Stein-Bahadur-Brown) Suppose X1, . . . , Xn

iid∼ Pθ , θ ∈�. Assume � is finite, and consider testing H0 : θ ∈ �0 vs. H1 : θ ∈� − �0.

(a) For any sequence of test statistics Tn , the Bahadur slope mT (θ) satisfies

mT (θ) ≤ 2 infθ0∈�0K (θ, θ0), where K (θ, θ0) is the Kullback-Leibler dis-

tance between Pθ and Pθ0.

(b) The LRT (likelihood ratio test) statistic n satisfies m(θ) = 2 infθ0∈�0

K (θ, θ0).

Page 122: Asymptotic Relative Efficiency

360 22 Asymptotic Efficiency in Testing

Remark. This says that if � is finite, then the LRT is Bahadur optimal at

every fixed alternative θ .

Example 22.9 The Bahadur efficiency approach treats the type 1 and type

2 errors unevenly. In some problems, one may wish to treat the two errors

evenly. Such an approach was taken by Chernoff. We illustrate the Chernoff

approach by an example. The calculations are harder than in the Bahadur

approach, the reason being that one now needs large deviation rates under

both H0 and H1. Suppose X1, X2, . . . , Xniid∼ Bin(1, p) and we wish to test

H0 : p = p0 vs. H1 : p = p1, where p1 > p0. Suppose we reject H0

for large values of X . The mgf of X i is q + pez. Therefore, the minimum of

k(z)−t z = log(q+ pez)−t z is attained at z satisfying pez(1−t) = qt . Plug-

ging into the Cramer-Chernoff theorem, we get, for t > p0, PH0(X > t) ≈

e−nK (t,p0), where K (a, b) = a log ab+(1−a) log 1−a

1−b. Analogously, PH1

(X <

t) ≈ e−nK (t,p1) for t < p1. Thus, for p0 < t < p1, αn(t) = PH0(X > t) and

γn(t) = PH1(X < t) satisfy αn(t) ≈ e−nK (t,p0) and γn(t) ≈ e−nK (t,p1). If we

set K (t, p1) = K (t, p0), then αn(t) ≈ γn(t). This is an attempt to treat the

two errors evenly. The unique t satisfying K (t, p0) = K (t, p1) is

t = t(p0, p1) =log

q0

q1

logq0

q1+ log

p1

p0

,

where we write qi for 1− pi . Plugging back, αn(t(p0, p1)) and γn(t(p0, p1))

are each≈ e−nK (t(p0,p1),p0). This quantity K (t(p0, p1), p0)=K (t(p0, p1), p1)

is called the Chernoff index of the test based on X . The function K (t(p0, p1),

p0) is a complicated function of p0 and p1. It can be shown easily that it is

approximately equal to(p1−p0)2

8p0q0when p1 ≈ p0.

We close with a general theorem on the rate of exponential decrease of a

convex combination of the error probabilities.

Theorem 22.6 (Chernoff) Consider testing a null hypothesis against an al-

ternative H1. Suppose we reject H0 for large values of some statistics Tn =∑n

i=1 Yi , where Yi

iid∼ P0 under H0 and Yi

iid∼ P1 under H1. Let μ0 = EH0(Y1)

and μ1 = EH1(Y1). Let 0 < λ < 1 and t > 0. Define αn(t) = PH0

(Tn > t)

and γn(t) = PH1(Tn ≤ t), θn = inft>0(λαn(t) + (1 − λ)γn(t)), and log ρ =

infμ0≤t≤μ1[max{infz(k0(z)−t z), infz(k1(z)−t z)}], where ki (z) = log EHi

ezY1.

Thenlog θn

n→ log ρ.

Remark. See Serfling (1980) for this theorem; log ρ is called the Chernoff

index of the statistic Tn .

Page 123: Asymptotic Relative Efficiency

22.3 Bahadur Slopes of U -statistics 361

22.3 Bahadur Slopes of U-statistics

There is some general theory about Bahadur slopes of U -statistics. They

are based on large-deviation results for U -statistics. See Arcones (1992) for

this entire section. This theory is, in principle, useful because a variety of

statistics in everyday use are U -statistics. The general theory is in general

hard to implement except approximately, and it may even be more efficient

to try to work out the slopes by direct means in specific cases instead of

appealing to this general theory.

Here is a type of result that is known.

Theorem 22.7 Let Un = Un(X1, . . . , Xn) be a U -statistic with kernel h

and order r . Let ψ(x) = E [h(X1, . . . , Xr )|X1 = x], and assume Un is non-

degenerate; i.e., τ 2 = E [ψ2(X1)] > 0. If |h(X1, . . . , Xr )| ≤ M < ∞, then,

for any γn = o(1) and t > E [h(X1, . . . , Xr )], limn1n

log P(Un > t + γn)

is an analytic function for |t − E (h)| ≤ B for some B > 0 and admits the

expansion limn1n

log P(Un > t + γn) =∑∞

j=2 c j (t − Eh) j , where c j are

appropriate constants, of which c2 = − 12r2τ 2 .

Remark. The assumption of boundedness of the kernel h is somewhat re-

strictive. The difficulty with full implementation of the theorem is that there

is a simple formula for only c2. The coefficient c3 has a known complicated

expression, but none exist for c j for j ≥ 4. The practical use of the theorem

is in approximating 1n

log P(Un > t) by − 12r2τ 2 (t − E (h))2 for t ≈ E (h).

Example 22.10 Recall that the sample variance 1n−1

∑ni=1(X i − X )2 is

a U -statistic with the kernel h(X1, X2) = 12(X1 − X2)2. Suppose

X1, X2, . . . , Xn

iid∼ U [0, 1]. This enables us to meet the assumption that

the kernel h is uniformly bounded. Then, by trivial calculation, E (h) = 112

,

ψ(X ) = X 2

2− X

2+ 1

12, and τ 2 = 1

720. Therefore, by a straight application of

Theorem 22.7, limn1n

log P[ 1n−1

∑(X i − X )2 − 1

12> c] = −90c2(1+ o(1))

as c → 0.

Remark. An example of a common U -statistic for which Theorem 22.7 is

hard to apply (even when the X i are uniformly bounded) is the Wilcoxon

statistic Tn = 1n(n−1)

∑∑i �= j IX i+X j>0. Note that Tn is a U -statistic and the

kernel is always bounded. Still, it is difficult to calculate the large-deviation

rate for Tn from Theorem 22.7. It turns out that for certain types of F , the

large-deviation rate for Tn can be worked out directly. This reinforces the

remark we made before that sometimes it is more efficient to attack the

problem directly than to use the general theorem.

Page 124: Asymptotic Relative Efficiency

362 22 Asymptotic Efficiency in Testing

Remark. The case where τ 2 = 0 also has some general results on the large-

deviation rate of Un . The results say that the rate can be represented in the

form∑∞

j=2 c j (t − E (h))j

2 .

22.4 Exercises

Exercise 22.1 Consider iid observations X1, X2, . . . from N (μ, 1) and con-

sider the test that rejects H0 : μ ≤ 0 for large values of√

n X . Find an

expression for NT (α, β,μ) as defined in the text.

Exercise 22.2 Consider again iid observations X1, X2, . . . from N (μ, 1)

and consider the test that rejects H0 : μ ≤ 0 for large values of the sample

median. For α = .05, β = .95, μ = 1, give a numerical idea of the value

of NT (α, β,μ) as defined in the text.

Exercise 22.3 In the previous exercise, take successively smaller values of

μ → 0 and numerically approximate the limit of the ratio emedian,X (α, β,μ),

still using α = .05, β = .95. Do you get a limit that is related to the

theoretical Pitman efficiency?

Exercise 22.4 Suppose X1, X2, . . .iid∼ U [0, θ], and consider the statistic

Tn = X . Do conditions A hold? With what choice of μn, σn?

Exercise 22.5 * For the symmetric location-parameter problem, derive a

formula for eP(S, X ) and eP(W, X ) when f (x) = c(α) exp[−|x |α], α > 0,

and c(α) is the normalizing constant. Then plot eP(W, X ) vs. eP(S, X ) and

identify those densities in the family for which (i) W is more efficient than

S and (ii) W is more efficient than both S and X .

Exercise 22.6 * Consider the family of densities as in the previous exercise,

but take 1 ≤ α ≤ 2. Find the average value of eP(W, X ) and eP(S, X ) in this

family by averaging over α.

Exercise 22.7 * Find distributions F for which the bounds of Proposi-

tion 22.1 are attained. Are these distributions unique?

Exercise 22.8 * Find the Bahadur slope of the sample median for iid obser-

vations from the location-parameter double exponential density.

Remark. Note that this has to be done directly, as the Cramer-Chernoff re-

sult cannot be used for the median.

Page 125: Asymptotic Relative Efficiency

References 363

Exercise 22.9 * For each of the following cases, find the Bahadur slope of

the sample mean. Then, simulate a sample, increase n by steps of 10, draw

a scatterplot of (n, Kn) values, and eyeball the slope to see if it roughly

matches the Bahadur slope:

(a) Exp(θ), H0 : θ = 1.

(b) Gamma(α, θ), H0 : θ = 1, α known.

(c) U [0, θ], H0 : θ = 1.

(d) N (0, σ 2), H0 : σ = 1.

(e) N (θ, θ); H0 : θ = 1.

(f) Poisson(λ); H0 : λ = 1.

(a) Double Exponential(μ, 1), H0 : μ = 0.

(b) Logistic(μ, 1), H0 : μ = 0.

Exercise 22.10 * Show directly that in the general continuous one-parameter

exponential family, the natural sufficient statistic attains the lower bound of

the Bahadur-Brown-Stein theorem stated in the text.

Exercise 22.11 * Compute the exact Chernoff index numerically for the bi-

nomial case when the two errors are treated evenly, and compare it with the

approximation(p1−p0)2

8p0q0.

Exercise 22.12 * By using the general theorem on Bahadur slopes of U -

statistics, derive a one-term expansion for limn1n

log PF (Un > σ 2(F ) + c)

when Un is the sample variance and F is a general Beta(m,m) distribution

with an integer-valued m.

References

Arcones, M. (1992). Large deviations for U statistics, J.Multivar. Anal., 42(2), 299–301.

Bahadur, R.R. (1960). Stochastic comparison of tests, Ann. Math. Stat., 31, 276–295.

Bahadur, R.R. (1967). Rates of convergence of estimates and test statistics, Ann. Math.

Stat., 38, 303–324.

Basu, D. (1956). On the concept of asymptotic efficiency, Sankhya, 17, 193–196.

Brown, L. (1971). Non-local asymptotic optimality of appropriate likelihood ratio tests,

Ann. Math. Stat., 42, 1206–1240.

Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based

on the sum of observations, Ann. Math. Stat., 23, 493–507.

Cohen, A., Kemperman, J.H.B., and Sackrowitz, H. (2000). Properties of likelihood

inference for order restricted models, J. Multivar. Anal., 72(1), 50–77.

Page 126: Asymptotic Relative Efficiency

364 22 Asymptotic Efficiency in Testing

DasGupta, A. (1998). Asymptotic Relative Efficiency, in Encyclopedia of Biostatistics,

P. Armitage and T. Colton (eds.), Vol. I, John Wiley, New York.

Hodges, J.L. and Lehmann, E.L. (1956). The efficiency of some nonparametric

competitors of the t test, Ann. Math. Stat., 27, 324–335.

Kallenberg, W.C.M. (1983). Intermediate efficiency: theory and examples, Ann. Stat.,

11(1), 170–182.

Kallenberg, W.C.M. (1978). Asymptotic Optimality of Likelihood Ratio Tests,

Mathematical Centre Tracts, Vol. 77,Mathematisch Centrum, Amsterdam.

Perlman, M.D. and Wu, L. (1999). The emperor’s new tests, Stat. Sci., 14(4), 355–381.

Pitman, E.J.G. (1948). Lecture Notes on Nonparametric Statistical Inference, Columbia

University, New York.

Rubin, H. and Sethuraman, J. (1965). Bayes risk efficiency, Sankhya Ser.A, 27, 347–356.

Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, John Wiley,

New York.

Singh, K. (1984). Asymptotic comparison of tests—a review, in Handbook of Statistics,

P.K. Sen and P.R. Krishnaiah (eds.), Vol. 4, North-Holland, Amsterdam, 173–184.

van der Vaart, A. (1998). Asymptotic Statistics, Cambridge University Press, Cambridge.

Page 127: Asymptotic Relative Efficiency

International Encyclopedia of Statistical Science

Springer-Verlag Berlin Heidelberg 2011

10.1007/978-3-642-04898-2_127

Miodrag Lovric

Yakov Nikitin1

(1) St. Petersburg University, St. Petersburg, Russia

Yakov Nikitin (Professor, Chair of Probability and Statistics)

URL: [email protected]

Without Abstract

Making a substantiated choice of the most efficient statistical test of several ones being at thedisposal of the statistician is regarded as one of the basic problems of Statistics. This problembecame especially important in the middle of XX century when appeared computationally simplebut “inefficient” rank tests.

Asymptotic relative efficiency (ARE) is a notion which enables to implement in large samples thequantitative comparison of two different tests used for testing of the same statistical hypothesis.The notion of the asymptotic efficiency of tests is more complicated than that of asymptoticefficiency of estimates. Various approaches to this notion were identified only in late forties andearly fifties, hence, 20–25 years later than in the estimation theory. We proceed now to theirdescription.

Let {Tn} and {Vn} be two sequences of statistics based on n observations and assigned for testingthe null-hypothesis H against the alternative A. We assume that the alternative is characterized byreal parameter θ and for θ = θ0 turns into H. Denote by NT(α, β, θ) the sample size necessary forthe sequence {Tn} in order to attain the power β under the level α and the alternative value ofparameter θ. The number NV(α, β, θ) is defined in the same way.

It is natural to prefer the sequence with smaller N. Therefore the relative efficiency of thesequence {Tn} with respect to the sequence {Vn} is specified as the quantity

so that it is the reciprocal ratio of sample sizes NT and NV.

The merits of the relative efficiency as means for comparing the tests are universally

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

1 of 7 4/13/2012 5:53 PM

Page 128: Asymptotic Relative Efficiency

acknowledged. Unfortunately it is extremely difficult to explicitly compute NT(α, β, θ) even for thesimplest sequences of statistics {Tn}. At present it is recognized that there is a possibility to avoidthis difficulty by calculating the limiting values eT, V(α, β, θ) as θ → θ0, as α → 0 and as β → 1

keeping two other parameters fixed. These limiting values eT, VP, eT, V

B and eT, VHL are called

respectively the Pitman, Bahadur and Hodges–Lehmann asymptotic relative efficiency (ARE), theywere proposed correspondingly in Pitman (1949), Bahadur (1960) and Hodges and Lehmann(1956).

Only close alternatives, high powers and small levels are of the most interest from the practicalpoint of view. It keeps one assured that the knowledge of these ARE types will facilitatecomparing concurrent tests, thus producing well-founded application recommendations.

The calculation of the mentioned three basic types of efficiency is not easy, see the description oftheory and many examples in Serfling (1980), Nikitin (1995) and Van der Vaart (1998). We onlymention here, that Pitman efficiency is based on the central limit theorem (see Central LimitTheorems) for test statistics. On the contrary, Bahadur efficiency requires the large deviationasymptotics of test statistics under the null-hypothesis, while Hodges–Lehmann efficiency isconnected with large deviation asymptotics under the alternative. Each type of efficiency has itsown merits and drawbacks.

Pitman efficiency is the classical notion used most often for the asymptotic comparison of varioustests. Under some regularity conditions assuming asymptotic normality of test statistics under Hand A, it is a number which has been gradually calculated for numerous pairs of tests.

We quote now as an example one of the first Pitman’s results that stimulated the development ofnonparametric statistics. Consider the two-sample problem when under the null-hypothesis bothsamples have the same continuous distribution and under the alternative differ only in location.

Let eW, tP be the Pitman ARE of the two-sample Wilcoxon rank sum test (see

WilcoxonMannWhitney Test) with respect to the corresponding Student test (see Student’s

t-Tests). Pitman proved that for Gaussian samples eW, tP = 3 ⁄ π ≈ 0. 955 , and it shows that the

ARE of the Wilcoxon test in the comparison with the Student test (being optimal in this problem)is unexpectedly high. Later Hodges and Lehamann (1956) proved that

if one rejects the assumption of normality and, moreover, the lower bound is attained at thedensity

Hence the Wilcoxon rank test can be infinitely better than the parametric test of Student but theirARE never falls below 0.864. See analogous results in Serfling (2010) where the calculation ofARE of related estimators is discussed.

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

2 of 7 4/13/2012 5:53 PM

Page 129: Asymptotic Relative Efficiency

Another example is the comparison of independence tests based on Spearman and Pearson

correlation coefficients in bivariate normal samples. Then the value of Pitman efficiency is 9 ⁄ π2 ≈0. 912.

In numerical comparisons, the Pitman efficiency appear to be more relevant for moderate samplesizes than other efficiencies Groeneboom and Oosterhoff (1981). On the other hand, Pitman AREcan be insufficient for the comparison of tests. Suppose, for instance, that we have a normallydistributed sample with the mean θ and variance 1 and we are testing H : θ = 0 against A : θ > 0.Let compare two significance tests based on the sample mean and the Student ratio t. As the

t-test does not use the information on the known variance, it should be inferior to the optimal testusing the sample mean. However, from the point of view of Pitman efficiency, these two tests areequivalent. On the contrary, Bahadur efficiency is strictly less than 1 for any θ > 0.

If the condition of asymptotic normality fails, considerable difficulties arise when calculating thePitman ARE as the latter may not at all exist or may depend on α and β. Usually one considerslimiting Pitman ARE as α → 0. Wieand (1976) has established the correspondence between thiskind of ARE and the limiting approximate Bahadur efficiency which is easy to calculate.

The Bahadur approach proposed in Bahadur (1960; 1967) to measuring the ARE prescribes oneto fix the power of tests and to compare the exponential rate of decrease of their sizes for theincreasing number of observations and fixed alternative. This exponential rate for a sequence ofstatistics {Tn} is usually proportional to some non-random function cT(θ) depending on thealternative parameter θ which is called the exact slope of the sequence {Tn}. The Bahadur ARE

eV, TB(θ) of two sequences of statistics {Vn} and {Tn} is defined by means of the formula

It is known that for the calculation of exact slopes it is necessary to determine the large deviationasymptotics of a sequence {Tn} under the null-hypothesis. This problem is always nontrivial, andthe calculation of Bahadur efficiency heavily depends on advancements in large deviation theory,see Dembo and Zeitouni (1998) and Deuschel and Stroock (1989).

It is important to note that there exists an upper bound for exact slopes

in terms of Kullback–Leibler information number K(θ) which measures the “statistical distance”between the alternative and the null-hypothesis. It is sometimes compared in the literature withthe Cramér–Rao inequality in the estimation theory. Therefore the absolute (nonrelative) Bahadur

efficiency of the sequence {Tn} can be defined as eTB(θ) = cT(θ) ⁄ 2K(θ).

It is proved that under some regularity conditions the likelihood ratio statistic is asymptoticallyoptimal in Bahadur sense (Bahadur 1967; Van der Vaart 1998, Sect. 16.6; Arcones 2005).

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

3 of 7 4/13/2012 5:53 PM

Page 130: Asymptotic Relative Efficiency

Often the exact Bahadur ARE is uncomputable for any alternative θ but it is possible to calculatethe limit of Bahadur ARE as θ approaches the null-hypothesis. Then one speaks about the localBahadur efficiency.

The indisputable merit of Bahadur efficiency consists in that it can be calculated for statistics withnon-normal asymptotic distribution such as Kolmogorov-Smirnov, omega-square, Watson andmany other statistics.

Consider, for instance, the sample with the distribution function (df) F and suppose we are testingthe goodness-of-fit hypothesis H0 : F = F0 for some known continuous df F0 against thealternative of location. Well-known distribution-free statistics for this hypothesis are the

Kolmogorov statistic Dn and omega-square statistic ωn2. The following table presents their local

absolute efficiency in case of six standard underlying distributions:

We see from Table 1 that the integral statistic ωn2 is in most cases preferable with respect to the

supremum-type statistic Dn. However, in the case of Laplace distribution the Kolmogorov statisticis locally optimal, the same happens for the Cramér-von Mises statistic in the case of hyperboliccosine distribution. This observation can be explained in the framework of Bahadur localoptimality, see Nikitin (1995Chap. 6).

Asymptotic Relative Efficiency in Testing. Table 1 Some local Bahadur efficiencies

Statistic Distribution

Gauss Logistic Laplace Hyperbolic cosine Cauchy

Dn 0.637 0.750 1 0.811 0.811 0.541

ωn2 0.907 0.987 0.822 1 0.750 0.731

See also Nikitin (1995) for the calculation of local Bahadur efficiencies in case of many otherstatistics.

This type of the ARE proposed in Hodges and Lehmann (1956) is in the conformity with theclassical Neyman-Pearson approach. In contrast with Bahadur efficiency, let us fix the level oftests and let compare the exponential rate of decrease of their second-kind errors for theincreasing number of observations and fixed alternative. This exponential rate for a sequence ofstatistics {Tn} is measured by some non-random function dT(θ) which is called the Hodges–Lehmann index of the sequence {Tn}. For two such sequences the Hodges–Lehmann ARE isequal to the ratio of corresponding indices.

The computation of Hodges–Lehmann indices is difficult as requires large deviation asymptoticsof test statistics under the alternative.

There exists an upper bound for the Hodges–Lehmann indices analogous to the upper bound for

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

4 of 7 4/13/2012 5:53 PM

Page 131: Asymptotic Relative Efficiency

Bahadur exact slopes. As in the Bahadur theory the sequence of statistics {Tn} is said to beasymptotically optimal in the Hodges–Lehmann sense if this upper bound is attained.

The drawback of Hodges–Lehmann efficiency is that most two-sided tests like Kolmogorov andCramér-von Mises tests are asymptotically optimal, and hence this kind of efficiency cannotdiscriminate between them. On the other hand, under some regularity conditions the one-sidedtests like linear rank tests can be compared on the basis of their indices, and their Hodges–Lehmann efficiency coincides locally with Bahadur efficiency, see details in Nikitin (1995).

Coupled with three “basic” approaches to the ARE calculation described above, intermediateapproaches are also possible if the transition to the limit occurs simultaneously for two parametersat a controlled way. Thus emerged the Chernoff ARE introduced by Chernoff (1952), see alsoKallenberg (1982); the intermediate, or the Kallenberg ARE introduced by Kallenberg (1983), andthe Borovkov–Mogulskii ARE, proposed in Borovkov and Mogulskii (1993).

Large deviation approach to asymptotic efficiency of tests was applied in recent years to moregeneral problems. For instance, the change-point, “signal plus white noise” and regressionproblems were treated in Puhalskii and Spokoiny (1998), the tests for spectral density of astationary process were discussed in Kakizawa (2005), while Taniguchi (2001) deals with the timeseries problems, and Otsu (2010) studies the empirical likelihood for testing moment conditionmodels.

Professor Nikitin is Chair of Probability and Statistics of St. Petersburg University. He is anAssociate editor of Statistics and Probability Letters, and member of the editorial Board,Mathematical Methods of Statistics and Metron. He is a Fellow of the Institute of MathematicalStatistics. Professor Nikitin is the author of the text Asymptotic efficiency of nonparametric tests,Cambridge University Press, NY, 1995, and has authored more than 100 papers, in manyinternational journals, in the field of Asymptotic efficiency of statistical tests, large deviations oftest statistics and nonparametric Statistics.

Asymptotic Relative Efficiency in Estimation

Chernoff-Savage Theorem

Nonparametric Statistical Inference

Robust Inference

Arcones M (2005) Bahadur efficiency of the likelihood ratio test. Math Method Stat 14:163–179

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

5 of 7 4/13/2012 5:53 PM

Page 132: Asymptotic Relative Efficiency

Bahadur RR (1960) Stochastic comparison of tests. Ann Math Stat 31:276–295

Bahadur RR (1967) Rates of convergence of estimates and test statistics. Ann Math Stat 38:303–324

Borovkov A, Mogulskii A (1993) Large deviations and testing of statistical hypotheses. Siberian Adv Math 2(3, 4); 3(1,2)

Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on sums of observations. AnnMath Stat 23:493–507

Dembo A, Zeitouni O (1998) Large deviations techniques and applications, 2nd edn. Springer, New York

Deuschel J-D, Stroock D (1989) Large deviations. Academic, Boston

Groeneboom P, Oosterhoff J (1981) Bahadur efficiency and small sample efficiency. Int Stat Rev 49:127–141

Hodges J, Lehmann EL (1956) The efficiency of some nonparametric competitors of the t-test. Ann Math Stat26:324–335

Kakizawa Y (2005) Bahadur exact slopes of some tests for spectral densities. J Nonparametric Stat 17:745–764

Kallenberg WCM (1983) Intermediate efficiency, theory and examples. Ann Stat 11:170–182

Kallenberg WCM (1982) Chernoff efficiency and deficiency. Ann Stat 10:583–594

Nikitin Y (1995) Asymptotic efficiency of nonparametric tests. Cambridge University Press, Cambridge

Otsu T (2010) On Bahadur efficiency of empirical likelihood. J Econ 157:248–256

Pitman EJG (1949) Lecture notes on nonparametric statistical inference. Columbia University, Mimeographed

Puhalskii A, Spokoiny V (1998) On large-deviation efficiency in statistical inference. Bernoulli 4:203–272

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

6 of 7 4/13/2012 5:53 PM

Page 133: Asymptotic Relative Efficiency

Serfling R (1980) Approximation theorems of mathematical statistics. Wiley, New York

Serfling R (2010) Asymptotic relative efficiency in estimation. In: Lovric M (ed) International encyclopedia of statisticalsciences. Springer

Taniguchi M (2001) On large deviation asymptotics of some tests in time series. J Stat Plann Inf 97:191–200

Van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge

Wieand HS (1976) A condition under which the Pitman and Bahadur approaches to efficiency coincide. Ann Statist4:1003–1011

10.1007/978-3-642-04898-2_127 http://www.springerlink.com/content/j735827510q3615n/fulltext.html

7 of 7 4/13/2012 5:53 PM

Page 134: Asymptotic Relative Efficiency

International Encyclopedia of Statistical Science

Springer-Verlag Berlin Heidelberg 2011

10.1007/978-3-642-04898-2_126

Miodrag Lovric

Robert Serfling1

(1) University of Texas at Dallas, Richardson, TX, USA

Robert Serfling (Professor)Email: [email protected]

Without Abstract

For statistical estimation problems, it is typical and even desirable that several reasonableestimators can arise for consideration. For example, the mean and median parameters of asymmetric distribution coincide, and so the sample mean and the sample median becomecompeting estimators of the point of symmetry. Which is preferred?By what criteria shall we makea choice?

One natural and time-honored approach is simply to compare the sample sizes at which twocompeting estimators meet a given standard of performance. This depends upon the chosenmeasure of performance and upon the particular population distribution F.

To make the discussion of sample mean versus sample median more precise, consider adistribution function F with density function f symmetric about an unknown point θ to be

estimated. For {X1, …, Xn} a sample from F, put and Medn = median{X1, …,

Xn}. Each of and Medn is a consistent estimator of θ in the sense of convergence in

probability to θ as the sample size n → ∞. To choose between these estimators we need to usefurther information about their performance. In this regard, one key aspect is efficiency, whichanswers: How spread out about θ is the sampling distribution of the estimator? The smaller thevariance in its sampling distribution, the more “efficient” is that estimator.

Here we consider “large-sample” sampling distributions. For , the classical central limit

theorem (see Central Limit Theorems) tells us: if F has finite variance σF2, then the sampling

distribution of is approximately Nθ, σF2 ⁄ n, i.e., Normal with mean θ and variance σF

2 ⁄ n. For

Medn, a similar classical result (Serfling 1980) tells us: if the density f is continuous and positive at

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

1 of 9 4/12/2012 8:54 PM

Page 135: Asymptotic Relative Efficiency

θ, then the sampling distribution of Medn is approximately N(θ, 1 ⁄ 4[f(θ)]2n). On this basis, we

consider and Medn to perform equivalently at respective sample sizes n1 and n2 if

Keeping in mind that these sampling distributions are only approximations assuming that n1 andn2 are “large,” we define the asymptotic relative efficiency (ARE) of Med to as the large-sample

limit of the ratio n1 ⁄ n2, i.e.,

(1)

For any parameter η of a distribution F, and for estimators and approximately N(η, V1(F) ⁄

n) and N(η, V2(F) ⁄ n), respectively, the ARE of to is given by

(2)

Interpretation. If is used with a sample of size n, the number of observations needed for

to perform equivalently is .

Extension to the case of multidimensional parameter. For a parameter η taking values in ℝk, and

two estimators which are k-variate Normal with mean η and nonsingular covariance matrices

Σi(F) ⁄ n, i = 1, 2, we use [see Serfling (1980)]

(3)

the ratio of generalized variances (determinants of the covariance matrices), raised to the power 1⁄ k.

Let F have density f(x | η) parameterized by η ∈ ℝ and satisfying some differentiability conditions

with respect to η. Suppose also that I(F) = (the Fisher information) is

positive and finite. Then (Lehmann 1988) it follows that (a) the maximum likelihood estimator

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

2 of 9 4/12/2012 8:54 PM

Page 136: Asymptotic Relative Efficiency

of η is approximately N(η, 1 ⁄ I(F)n), and (b) for a wide class of estimators that are

approximately , a lower bound to is 1 ⁄ I(F). In this situation, (2) yields

(4)

making (asymptotically) the most efficient among the given class of estimators . We

note, however, as will be discussed later, that (4) does not necessarily make the estimator

of choice, when certain other considerations are taken into account.

Let us now discuss in detail the example treated above, with F a distribution with density fsymmetric about an unknown point θ and {X1, …, Xn} a sample from F. For estimation of θ, we

will consider not only and Medn but also a third important estimator.

Let us now formally compare and Medn and see how the ARE differs with choice of F. Using

(1) with F = Nθ, σF2, it is seen that

Thus, for sampling from a Normal distribution, the sample mean performs as efficiently as thesample median using only 64% as many observations. (Since θ and σF are location and scale

parameters of F, and since the estimators and Medn are location and scale equivariant, their

ARE does not depend upon these parameters.) The superiority of here is no surprise since it

is the MLE of θ in the model N(θ, σF2).

As noted above, asymptotic relative efficiencies pertain to large sample comparisons and need notreliably indicate small sample performance. In particular, for FNormal, the exact relative efficiencyof Med to for sample size n = 5 is a very high 95%, although this decreases quickly, to 80% for

n = 10, to 70% for n = 20, and to 64% in the limit.

For sampling from a double exponential (or Laplace) distribution with density f(x) = λe− λ | x − θ | ⁄

2, − ∞ < x < ∞ (and thus variance 2 ⁄ λ2), the above result favoring over Medn is reversed: (1)

yields

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

3 of 9 4/12/2012 8:54 PM

Page 137: Asymptotic Relative Efficiency

so that the sample mean requires 200% as many observations to perform equivalently to thesample median. Again, this is no surprise because for this model the MLE of θ is Medn.

We see from the above that the ARE depends dramatically upon the shape of the density f and

thus must be used cautiously as a benchmark. For Normal versus Laplace, is either greatly

superior or greatly inferior to Medn. This is a rather unsatisfactory situation, since in practice wemight not be quite sure whether F is Normal or Laplace or some other type. A very interestingsolution to this dilemma is given by an estimator that has excellent overall performance, theso-called Hodges–Lehmann location estimator (Hodges and Lehmann 1963; see Hodges-Lehmann Estimators):

the median of all pairwise averages of the sample observations. (Some authors include the cases

i = j, some not.) We have (Lehmann 1998a) that HLn is asymptotically N(θ, 1 ⁄ 12[ ∫ f2(x)dx]2n),

which yields that = 3 ⁄ π = 0.955 and = 1.5.

Also, for the Logistic distribution with density f(x) = σ− 1e(x − θ) ⁄ σ ⁄ [1 + e(x − θ) ⁄ σ]2, − ∞ < x < ∞,

for which HLn is the MLE of θ and thus optimal, we have = π2 ⁄ 9 = 1. 097

[see Lehmann (1998b)]. Further, for ℱ the class of all distributions symmetric about θ and having

finite variance, we have = 108/125 = 0.864 [see Lehmann (1998a)]. The

estimator HLn is highly competitive with at Normal distributions, can be infinitely more efficient

at some other symmetric distributions F, and is never much less efficient at any distribution F in ℱ.

The computation of HLn appears at first glance to require O(n2) steps, but a much more efficientO(nlogn) algorithm is available [see Monohan (1984)].

Although the asymptotically most efficient estimator is given by the MLE, the particular MLEdepends upon the shape of F and can be drastically inefficient when the actual F departs even alittle bit from the nominal F. For example, if the assumed F is N(μ, 1) but the actual model differs

by a small amount ε of “contamination,” i.e., F = (1 − ε)N(μ, 1) + εN(μ, σ2), then

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

4 of 9 4/12/2012 8:54 PM

Page 138: Asymptotic Relative Efficiency

which equals 2 ⁄ π in the “ideal” case ε = 0 but otherwise → ∞ as σ → ∞. A small perturbation ofthe assumed model thus can destroy the superiority of the MLE.

One way around this issue is to take a nonparametric approach and seek an estimator with AREsatisfying a favorable lower bound. Above we saw how the estimator HLn meets this need.

Another criterion by which to evaluate and compare estimators is robustness. Here let us usefinite-sample breakdown point (BP): the minimal fraction of sample points which may be taken to alimit L (e.g., ± ∞) without the estimator also being taken to L. A robust estimator remains stableand effective when in fact the sample is only partly from the nominal distribution F and containssome non-F observations which might be relatively extreme contaminants.

A single observation taken to ∞ (with n fixed) takes with it, so has BP = 0. Its optimality at

Normal distributions comes at the price of a complete sacrifice of robustness. In comparison,Medn has extremely favorable BP = 0.5 but at the price of a considerable loss of efficiency atNormal models.

On the other hand, the estimator HLn appeals broadly, possessing both quite high ARE over a

wide class of F and relatively high BP = 1 − 2 − 1 ⁄ 2 = 0.29.

As another example, consider the problem of estimation of scale. Two classical scale estimatorsare the sample standard deviationsn and the sample MAD (median absolute deviation about themedian) MADn. They estimate scale in different ways but can be regarded as competitors in the

problem of estimation of σ in the model F = N(μ, σ2), as follows. With both μ and σ unknown, theestimator sn is (essentially) the MLE of σ and is asymptotically most efficient. Also, for this F, the

population MAD is equal to Φ − 1(3 ⁄ 4)σ, so that the estimator = MADn ⁄ Φ − 1(3 ⁄ 4) = 1. 4826

MADn competes with sn for estimation of σ. (Here Φ denotes the standard normal distribution

function, and, for any F, F− 1(p) denotes the pth quantile, inf{x : F(x) ≥ p}, for 0 < p < 1.) Tocompare with respect to robustness, we note that a single observation taken to ∞ (with n fixed)takes sn with it, sn has BP = 0. On the other hand, MADn and thus have BP = 0.5, like Medn.

However, = 0.37, even worse than the ARE of Medn relative to . Clearly

desired is a more balanced trade-off between efficiency and robustness than provided by either ofsn and . Alternative scale estimators having the same 0.5 BP as but much higher ARE of

0.82 relative to sn are developed in Rousseeuw and Croux (1993). Also, further competitorsoffering a range of trade-offs given by (BP, ARE) = (0. 29, 0. 86) or (0. 13, 0. 91) or (0. 07, 0. 96),for example, are developed in Serfling (2002).

In general, efficiency and robustness trade off against each other. Thus ARE should beconsidered in conjunction with robustness, choosing the balance appropriate to the particularapplication context. This theme is prominent in the many examples treated in Staudte andSheather (1990).

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

5 of 9 4/12/2012 8:54 PM

Page 139: Asymptotic Relative Efficiency

In view of the asymptotic normal distribution underlying the above formulation of ARE inestimation, we may also characterize the ARE given by (2) as the limiting ratio of sample sizes atwhich the lengths of associated confidence intervals at approximate level 100(1 − α)%,

converge to 0 at the same rate, when holding fixed the coverage probability 1 − α. (In practice, ofcourse, consistent estimates of Vi(F), i = 1, 2, are used in forming the CI.)

One may alternatively consider confidence intervals of fixed length, in which case (under typicalconditions) the noncoverage probability depends on n and tends to 0 at an exponential rate, i.e.,

n− 1logαn → c > 0, as n → ∞. For fixed width confidence intervals of the form

we thus define the fixed width asymptotic relative efficiency (FWARE) of two estimators as the

limiting ratio of sample sizes at which the respective noncoverage probabilities αn(i), i = 1, 2, of the

associated fixed width confidence intervals converge to zero at the same exponential rate. Inparticular, for Med versus , and letting η = 0 and σF = 1 without loss of generality, we obtain

(Serfling and Wackerly 1976)

(5)

where m( − d) is a certain parameter of the moment generating function of F. The FWARE isderived using large deviation theory instead of the central limit theorem. As d → 0, the FWAREconverges to the ARE. Indeed, for F a Normal distribution, this convergence (to 2 ⁄ π = 0.64) isquite rapid: the expression in (5) rounds to 0.60 for d = 2, to 0.63 for d = 1, and to 0.64 for d ≤ 0.1.

For an estimator which is asymptotically k-variate Normal with mean η and covariance matrix Σ ⁄

n, as the sample size n → ∞, we may form (see Serfling 1980) an associated ellipsoidalconfidence region of approximate level 100(1 − α)% for the parameter η,

with Pχk2 > cα = α and in practice using a consistent estimate of Σ. The volume of the region En, α

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

6 of 9 4/12/2012 8:54 PM

Page 140: Asymptotic Relative Efficiency

is

Therefore, for two such estimators , i = 1, 2, the ARE given by (3) may be characterized as the

limiting ratio of sample sizes at which the volumes of associated ellipsoidal confidence regions atapproximate level 100(1 − α)% converge to 0 at the same rate, when holding fixed the coverageprobability 1 − α.

Under regularity conditions on the model, the maximum likelihood estimator has a

confidence ellipsoid En, α attaining the smallest possible volume and, moreover, lying whollywithin that for any other estimator .

Parallel to ARE in estimation as developed here is the notion of Pitman ARE for comparison of twohypothesis test procedures. Based on a different formulation, although the central limit theorem isused in common, the Pitman ARE agrees with (2) when the estimator and the hypothesis teststatistic are linked, as for example paired with the t-test, or Medn paired with the sign test, or

HLn paired with the Wilcoxon-signed-rank test. See Lehmann 1998b, Nikitin 1995, Nikitin 2010,and Serfling 1980.

As illustrated above with FWARE, several other important approaches to ARE have beendeveloped, typically using either moderate or large deviation theory. For example, instead ofasymptotic variance parameters as the criterion, one may compare probability concentrations of

the estimators in an ε-neighborhood of the target parameter η: , i = 1, 2. When

we have

as is typical, then the ratio of sample sizes n1 ⁄ n2 at which these concentration probabilities

converge to 0 at the same rate is given by γ(1)(ε, η) ⁄ γ(2)(ε, η), which then represents another ARE

measure for the efficiency of estimator relative to . See Serfling (1980, 1.15.4) for

discussion and Basu (1956) for illustration that the variance-based and concentration-basedmeasures need not agree on which estimator is better. For general treatments, see Nikitin (1995),Puhalskii and Spokoiny (1998), Nikitin (2010), and Serfling (1980, Chap. 10), as well as the otherreferences cited below. A comprehensive bibliography is beyond the present scope. However,very productive is ad hoc exploration of the literature using a modern search engine.

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

7 of 9 4/12/2012 8:54 PM

Page 141: Asymptotic Relative Efficiency

Support by NSF Grant DMS-0805786 and NSA Grant H98230-08-1-0106 is gratefullyacknowledged.

Robert Serfling is author of the classic textbook Approximation Theorems of MathematicalStatistics, Wiley, 1980, and has published extensively in statistical journals. He received aHumboldt-Preis, awarded by the Alexander von Humboldt Stiftung, Germany, “in recognition ofaccomplishments in research and teaching” (1985). He is a Fellow of the American StatisticalAssociation and of Institute of Mathematical Statistics, and an Elected Member of InternationalStatistical Institute. Professor Serfling was Editor of the IMS Lecture Notes Monograph Series(1988–1993) and currently is an Associate Editor for Journal of Multivariate Analysis (2007–) andfor Journal of Nonparametric Statistics (2007–).

Asymptotic Relative Efficiency in Testing

Estimation

Estimation: An Overview

Mean Median and Mode

Normality Tests

Properties of Estimators

Statistical Fallacies: Misconceptions, and Myths

Basu D (1956) On the concept of asymptotic relative efficiency. Sankhyā 17:193–196

Hodges JL, Lehmann EL (1963) Estimates of location based on rank tests. Ann Math Stat 34:598–611

Lehmann EL (1998a) Elements of large-sample theory. Springer, New York

Lehmann EL (1998b) Nonparametrics: statistical methods based on ranks. Prentice-Hall, Upper Saddle River, NJ

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

8 of 9 4/12/2012 8:54 PM

Page 142: Asymptotic Relative Efficiency

Lehmann EL, Casella G (1988) Theory of point estimation, 2nd edn. Springer, New York

Monohan JF (1984) Algorithm 616: fast computation of the Hodges–Lehmann location estimator. ACM T MathSoftware 10:265–270

Nikitin Y (1995) Asymptotic efficiency of nonparametric tests. Cambridge University Press, Cambridge

Nikitin Y (2010) Asymptotic relative efficiency in testing. International Encyclopedia of Statistical Sciences. Springer,New York

Puhalskii A, Spokoiny V (1998) On large-deviation efficiency in statistical inference. Bernoulli 4:203–272

Rousseeuw PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88:1273–1283

Serfling R (1980) Approximation Theorems of Mathematical Statistics. Wiley, New York

Serfling R (2002) Efficient and robust fitting of lognormal distributions. N Am Actuarial J 4:95–109

Serfling R, Wackerly DD (1976) Asymptotic theory of sequential fixed-width confidence interval procedures. J Am StatAssoc 71:949–955

Staudte RG, Sheather SJ (1990) Robust estimation and testing. Wiley, New York

10.1007/978-3-642-04898-2_126 http://www.springerlink.com/content/r030m7211645g670/fulltext.html

9 of 9 4/12/2012 8:54 PM