Qualitative Robustness of Tests

7
Qualitative Robustness of Tests Author(s): Diane Lambert Source: Journal of the American Statistical Association, Vol. 77, No. 378 (Jun., 1982), pp. 352- 357 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2287252 . Accessed: 15/06/2014 09:06 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AM All use subject to JSTOR Terms and Conditions

Transcript of Qualitative Robustness of Tests

Page 1: Qualitative Robustness of Tests

Qualitative Robustness of TestsAuthor(s): Diane LambertSource: Journal of the American Statistical Association, Vol. 77, No. 378 (Jun., 1982), pp. 352-357Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2287252 .

Accessed: 15/06/2014 09:06

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 2: Qualitative Robustness of Tests

Qualitative Robustness of Tests DIANE LAMBERT*

A test is qualitatively robust by definition if its sequence of - n'- log transformed P values, n being a measure of the sample size, is continuous as a point function of the observations and weakly equicontinuous as a function of discrete probability measures. This definition is appli- cable both to unconditional and to conditional tests. Under weak regularity conditions, an unconditional test is qualitatively robust if and only if its test statistic is continuous; a counterexample shows that conditional tests do not share this property. The sample mean, Stu- dent's t and Y - X permutation tests are not qualitatively robust; the sign, Wilcoxon, Huber censored likelihood, and normal scores tests are qualitatively robust. KEY WORDS: P value; Exact and approximate slope; Prokhorov metric.

1. INTRODUCTION AND SUMMARY

In practice, testing proceeds through the use of a P value. The P value, which is the tail probability of the realized test statistic under a hypothesis H, is chosen to be sensitive to specific departures from H. Small dis- turbances, such as rounding errors or infrequent inliers or outliers, are assumed to alter the assessment of sig- nificance only slightly or not at all. That is, a P value should be "continuous" with respect to the sample dis- tribution. Such a concept of test robustness is developed here by applying Hampel's (1971) work on estimator con- tinuity to P values.

Rieder (1978, 1981) defines test robustness in terms of the behavior of the error probabilities, which are averages over all samples. His theory is mathematically elegant but not directly relevant to the observable quantities that determine the decision for a particular sample. Conse- quently, his approach is not followed here.

Instead, the development is consistent with that of Ylvisaker (1977): robustness depends on observables rather than on averages. Ylvisaker defines test resistance as one minus the fraction of observations that determine the test decision regardless of the value of the other ob- servations in the sample. The binary accept-reject scale is appealing because of its relevance to testing practice, but it is not sufficiently rich to support a definition of test continuity. The accept-reject scale is here refined by aug-

* Diane Lambert is Assistant Professor, Department of Statistics, Carnegie-Mellon University, Pittsburgh, PA 15213. She thanks Profes- sor W.J. Hall for his guidance during the preparation of the doctoral dissertation on which this article is based. The research was supported by the U.S. Army Research Office through the University of Rochester and Office of Naval Research through Carnegie-Mellon University.

menting the decision with the strength of the supporting evidence or, actually and equivalently, by replacing the decision with a P value.

The major result of this article is that the use of - n'-l log transformed P values, n being a measure of the sample size(s), conveniently converts the investigation of test continuity into an investigation of estimator con- tinuity. A - n'-l log P value is typically a consistent es- timator of the slope (see Bahadur 1971 and Sec. 3) and the problem of estimator continuity has been studied ex- tensively by Hampel (1971). In Section 4 a test is defined to be qualitatively robust if its sequence of - n'-l log P values is equicontinuous both as a function of the obser- vations and as a function of the empirical probability measure.

This definition of a qualitatively robust test is inspired by the work of Hampel (1971), but it is not equivalent to an application of his definition of a qualitatively robust estimator to the - n-' log P value sequence. Hampel calls an estimator qualitatively robust if its sampling dis- tributions are equicontinuous. That is, simply stated, a small change in the distribution sampled should cause a small change in the distribution of the estimator. Through several theorems Hampel shows that equicontinuity of the sampling distributions is a weaker, but not much weaker, requirement than is equicontinuity of the esti- mator as a function of the sample and empirical distri- bution function. In this article, equicontinuity of the -n-' log P value sequence is preferred over equicon- tinuity of its sequence of sampling distributions because the latter describes "average" behavior over all the pos- sible samples from a fixed distribution and the former describes the behavior in sequences of particular samples from a fixed distribution.

With the proposed definition, under weak regularity conditions, an unconditional test is qualitatively robust at some distribution P if and only if its sequence of test statistics is continuous at P and in the observations a.e. [P]. This equivalence of qualitatively robust uncondi- tional tests and continuous test statistics is intuitively plausible since unconditional tests depend on the data through the test statistic alone. However, the class of qualitatively robust tests would be wider without a sam- ple-size-dependent transformation, such as - n' log, of the P value.

The pertinent facts about slopes, as developed in Ba- hadur (1971), are reviewed in Section 2. Hampel's work

? Journal of the American Statistical Association June 1982, Volume 77, Number 378

Theory and Methods Section

352

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 3: Qualitative Robustness of Tests

Lambert: Qualitative Robustness of Tests 353

on continuous estimators is reviewed in Section 3. Qual- itatively robust tests are defined in Section 4. Several one- and two-sample tests for location are considered from the perspective of qualitative robustness in Section 5. In particular, the normal scores test proves to be qual- itatively robust under a wide class of alternative distri- butions and the two-sample permutation test based on Y - X proves to be not qualitatively robust.

2. ESTIMATOR CONTINUITY

Hampel (1971) calls an estimator continuous if a slight disturbance of the sampled distribution affects the esti- mator only slightly and the effect does not intensify as the sample size n increases. He formalizes this notion of equicontinuity in the following setting.

Let 9P be the set of all probability measures on a meas- urable space (S/, 9s) with s4 the Borel subsets of W/. Let 9P, be the subset of 9P containing the discrete measures that assign mass in multiples of l/n. Given a random sample X, = (XI, . . . , Xj) from P, an estimator Tn can and will be considered either as a point function in X, or as a functional on ,1. For an example, define the em- pirical probability measure P, = n- l bx1, where 6, is the probability measure supported on x, and take the sample mean X = p(P,.) = f xP,.(dx). The continuity of an estimator may be examined with respect to either a metric on the space of random samples or a metric on the space of empirical probability measures P.,

The Prokhorov distance iT between probability meas- ures P, Q E 9P is I(P, Q) = inf{E > 0: P(A) c Q(A') + E for all A E si} where AE is the set of points within E of A. The Prokhorov metric is well suited for the study of robustness because it allows for frequent small (size E) perturbations such as rounding errors and infrequent (probability less than E) large errors such as inliers or outliers. It also admits models that are only asymptoti- cally correct because weak convergence and 1n conver- gence are equivalent.

Using the Prokhorov metric, Hampel defines a contin- uous estimator as follows:

Definition 1. A sequence of estimators {Tn} is contin- uous at P E 9P if for every E> 0 there is a 8 > 0 and no such that

SUpn,m-no supWnm I Tn(Pn) - Tm(Pm) j < E

where Wnm = {(Pn, Pm) E_ 9(n X (im:

iT(PI, P) < 8, I = m, n}.

For our purposes, the two most important consequences of Hampel's definition of a continuous sequence of es- timators are described on page 1891 of Hampel (1971): (a) continuity of {Tn} at P implies consistency of {Tn} at P for some T-(P) and (b) continuity of a functional T at P implies continuity of the sequence of restrictions Tn -Tj | Pat P.

3. P VALUES AND SLOPES

Bahadur (1967,1971) has developed the following framework for analyzing the behavior of P values as the sample size increases.

Take {Po, 0 E O} C 9P to be a set of probability measures of interest and consider the hypothesis H: 0 E 0O. Let X, be a random sample from Po for some 0 E 0 and let T,(X,) be a real-valued test statistic for H. Define the null df G, of Tn by G,(t) = infeo P0(T, < t). Denote the null tail df 1 - Gn( ) by G&().

When only large values of Tn are inconsistent with H, the P value or observed significance level Ln(Tj) is de- fined by G&(T,). If the Dull df of Tn is conditional on a statistic U,, then Ln is called a conditional P value. For example, the two-sample Y - X permutation P value is conditional on the order statistics of the combined sam- ple. If the null tail df G, is approximated by another function W(a), then G,(a)(Tj) is called an approximate P value. Approximate P values must be used if Gn is un- known or untabulated.

Typically, any appropriate P value, whether exact or approximate, unconditional or conditional, approaches zero exponentially fast under any nonnull measure Po as the sample size n increases. If - n'- log Ln -* s(O) a.s. [PO], Bahadur (1971) calls the exponential rate s(0) the slope (actually, half the slope). For examples, see Ba- hadur (1971) or Section 5. If Ln is an approximate P value, Bahadur calls the corresponding slope s(a)(0) an approx- imate slope. An approximate slope need not be a good approximation to an exact slope, but an approximate slope (and not an exact slope) is appropriate whenever an approximate P value is used. Subsequently the term slope is used without the qualifier exact or approximate; the type of the slope is assumed to agree with the type of the P value.

Bahadur (1971) gives the following prescription for de- termining slopes in his Theorem 7.2, here referred to as Theorem 1.

Theorem 1. Suppose that lim,x Tn = b(0) a.s. [PF] for each 0 E 01 = 0 - 0O, where - 0 < b(0) < X and that lim,, log Gn(t) = - c(t) for each t in an open interval I, where - c is a continuous function on I and {b(0): 0 E Oi} C I. Then lim,. n- log Gn(Tn) = -s(0) a.s. [PO] with s(0) = c(b(0)) for each 0 E 01. As Bahadur remarks, if 0 E 0o, then lim n-' log Ln -

O for any exact P value Ln

4. QUALITATIVE ROBUSTNESS OF P VALUES

Because P values belong to [0, 1] and typically sto- chastically decrease to zero under many nonnull distri- butions as the sample size increases, their absolute size is often uninteresting for large sample sizes. Conse- quently, from the perspective of robustness, the change in the size of a P value caused by a disturbance of the sample is uninteresting for large sample sizes and the continuity of a sequence of P values is likewise uninter-

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 4: Qualitative Robustness of Tests

354 Journal of the Amerloan Statistioal Assooiation, June 1982

esting. Of more concern is the change in the P value caused by a disturbance of the sample relative to the size of the undisturbed P value.

Hampel's definition of continuity could be restated for relative changes in P values, but it is more convenient (and accomplishes the same goal) to apply his definition to - n'-l log-transformed P values, n being a measure of the sample size. The sample-size dependent factor - n- is, of course, not needed to express relative change. It merely ensures, as discussed in Section 3, that typical transformed P values do not have a constant limit of zero under many disturbed as well as undisturbed alternative distributions. These considerations lead to the descrip- tion of qualitatively robust tests next given as Defini- tion 2.

Definition 2. A test is qualitatively robust under a meas- ure P E 9P if the corresponding sequence of transformed P values { - n -' log Ln} is continuous at P and each term in the sequence is continuous as a point function a.e. [P].

Definition 2 reduces the investigation of test qualitative robustness to an investigation of estimator continuity since - n -l log Ln is often a consistent estimator of the slope. Moreover, with Definition 2 the qualitative ro- bustness character of an unconditional test generally de- pends only on the continuity character of its test statistic sequence. Given an unconditional test with statistic Tn, the transformed P value - n l'log Ln can be written as Cn(Tn), where cn = -n1 log G, and G6, the exact or approximate null tail df of Tn does not vary with the data. Theorem 2 gives weak conditions on the sequence of functions {cn} and their limit c (here c(T.(P)) is the slope at P) that imply an unconditional test is qualitatively robust iff its test statistic is continuous. That these con- ditions are weak (i.e., satisfied by many tests of interest) is shown in Lambert and Hall (1982).

Theorem 2. Take {Tn} to be a sequence of real-valued statistics and {cn} to be a sequence of nonrandom func- tions on the real line. Suppose there exists a function c such that cn(bn) -> c(b) whenever bn-> b. Then {Cn(Tn)} is continuous at P if{Tn} is continuous at P. If, in addition, Tn-T I 9'Pn - oo < T(P) < a:, and c is strictly increasing, then {Cn(Tn)} is discontinuous at P if {Tn} is discontinuous at P.

Proof. Suppose {Tn} is continuous at P. If {cn(Tn)} is discontinuous at P, there exist an integer no, constant E > 0, and sequences {Pn}j {Qnl such that

Pn, Qn E 9Pn , 1T(Pn P) -O0, 1T(Pn Qn) < I/n,

and

I cn(TTn(Pn)) - cn(Tn(Qn)) | > E for all n > no.

Therefore,

I cn(Tn(Pn)) - c(T-(P)) I +| c,n(T,n(Q,)) - c(Tcx(P)) j > e

for all n > no, which is a contradiction since {Tn} con- verges to Tc(P) under P.

If {Tn} {T I 9Pnl is discontinuous at P, then T is dis- continuous at P and there are no, E> 0, and {Pnl satisfying rT(Pn, P) < l/n such that I T(PW) - T(P) j > e for all n > no. If the sequence {T(Pn)} is bounded, then it contains a convergent subsequence and without loss of generality we may take T(Pn) -* T. * T(P). On the other hand, if lim sup I T(PW) I = xo, then {T(Pn)} contains a subse- quence for which {Cn(Tn(Pn))} approaches either lim,x cn(t) or lim,_. cn(t) monotonically. In all cases, I Cn( T(Pn))-c(T(P)) I and, hence I Cn(T(Pn)) - cn(T(P)) 1, approaches a positive constant. Therefore {cn( Tn)} cannot be continuous at P.

The condition that Tn a T I VP,n and I T(P) I < X is not necessary, but it is simple to state and weak enough for the usual tests of interest. Theorem 2 is applied to several one- and two-sample tests in Section 5.

Theorem 2 also identifies the class of transformations for which the qualitative robustness character of an un- conditional test derives from the continuity properties of the test statistic. Simply stated, the qualitative robust- ness, or lack of qualitative robustness, of an uncondi- tional test is invariant under any scale in which the P value has a limit that is not constant in the alternative. The - n -' log transformation is emphasized here because it is traditional. Lambert (1978) suggests that the trans- formations (-n-l log Ln) 12 and n-1/2r- 1(Ln), where ( is the standard normal tail df, may be preferable to - n- log Ln since they may have more stable behavior in large samples. Test qualitative robustness is indepen- dent of which of the three scales is chosen.

The restriction to nonrandom functions cn, and there- fore to unconditional P values, in Theorem 2 cannot be relaxed. The last example of Section 5 describes a per- mutation test that is qualitatively robust even though its test statistic is discontinuous. Such examples are not sur- prising since a conditional P value does not depend on the data through the test statistic alone.

5. APPLICATIONS

Consider first the one-sample problem in which Xi, X2, ... are iid normal (0, 1) random variables and the null hypothesis is H: 0 c 0. Under the alternative the data are iid with positive mean 0. With these conditions, the lack of qualitative robustness of the sample mean and Student's t tests, the qualitative robustness of the sign test at distributions continuous at zero, and the qualita- tive robustness of the Huber (1965, 1968) and Wilcoxon tests everywhere are next established by applying Theo- rem 2. Two-sample tests are considered later in this sec- tion. Throughout we identify a probability measure by its df and denote the empirical df (edf) by Fn.

For the sample mean test, the test statistic is T=(F1) = r xdF,n(x) and the P value is cn(T(Fn)) with cn(t) =

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 5: Qualitative Robustness of Tests

Lambert: Qualitative Robustness of Tests 355

- n'-I log F (V W t). Since Mill's ratio asserts

1 1 ?_((r) 1 r r3- 4(r) r

where +(*) is the standard normal pdf (Feller 1968, p. 175), it follows that cn(tn) converges to c(t) = t212 when- ever t, -> t, 0 < t < o. Hence, by Theorem 2, the sample mean test is qualitatively robust at an alternative distri- bution P with finite positive mean if and only if T is continuous at P. But Tx is discontinuous at every P with finite mean since for any such P there is a sequence of measures {P,} that converges in law but not in mean to P.

For the Student's t test, take the test statistic to be

Tt(Fn) f xdFn(x)I (f x2dFn(x) - (f xdFn(x)2))

The exact null tail df evaluated at x is Gn(nVNI(n -

1)x), where Gn is the Student's t df with n - 1 degrees of freedom. Bahadur (1960) has shown that

Cn(tn) nl log Gn(nNlvI(n - 1)tn) ->log(1 + t2)

if tn -> t > 0. Hence the Student's t test is qualitatively robust only if Tt is continuous. But Tt is discontinuous. If P E 9P has finite positive mean and variance, then the sequence of probability measures Pn = (n - 1)/nP + (1In)86V, n = 1, 2, . .. , where Ax is the probability measure supported on x, converges weakly to P but Tt(Pn) does not converge to Tt(P).

The sign test statistic Ts(Fn) equals the proportion 1 - Fn(0) of positive observations in the random sample of n; its exact null tail probabilities are available from the binomial distribution. Bahadur (1960) has shown that the - n- 1 log-transformed P value function cn( ), in the no- tation established in Section 4, satisfies cn(tn) -> C(t) = t log t + (1 - t) log(1 - t) + log 2 whenever tn -> t E (1, 1). If the df P is continuous at zero and P(0) < 2,

then the test statistic is continuous as a function of P and as a point function a.e. [P]. Therefore, by Theorem 2, the sign test is qualitatively robust at all distributions continuous at zero with P(0) < I

Taking R(Xi) to be the rank of I Xi I in IX , .X . Xn I the Wilcoxon test statistic is

TW=n n - 2 E R(Xi) sgn(Xi)

- f (Fn(x) - Fn(-x)) dFn(x).

The - n-l log-transformed Wilcoxon exact null tail df satisfies n - 1 log P[TW(Fn) - tn] = cn(tn)-> c(t) when- ever tn -> t E (0, 1) with c(t) = Pt - f 1 log cosh(ry) dy and P8 the solution to f l y tanh(fZy) dy = t (Lambert and Hall 1982). The function c is strictly increasing, Tn is a.e. continuous as a point function, and T is continuous as a function of P since

I TW(P) - T(Q) | - I (P(x)-Q(x)) d(P(x)-Q(x))|

for any P, Q E 9P. Thus, the Wilcoxon test is qualitatively robust everywhere. The Wilcoxon approximate P value Fb((3n)"l2Tw(Fn)), which is based on the asymptotic dis- tribution of the Wilcoxon test statistic, is shown to be qualitatively robust in a similar manner.

Huber (1965,1968) constructed a maximin test for the composite, contaminated hypothesis Ho*: X1, , Xn - iid P for

P E 9Po* - {(1 - E)(4o + EQ, Q E 'P}

against the contaminated alternative

HI*: XI, . X. - ,niid P C 9*

= {(1 - E)4D0 + EQ, Q E C?}, where (o is the normal (0, 1) probability distribution. These hypotheses allow for contamination of any kind by an amount E or less. (Huber's test is actually maximin for wider null and alternative hypotheses, but this is unim- portant here; see Huber 1968.) Huber's test statistic is TH(Fn) = f median(a, x, b)dFn(x), where the censoring parameters a, b depend on the level E of contamination and the alternative 0 of interest. If E is positive then a and b are both finite. Since the exact null df of TH(Fn) is intractable, an approximate P value based on a central limit theorem approximation to the exact distribution of TH(Fn) is likely to be used in practice. The convergence of cn(tn) to c(t) then follows as in the example of the sample mean test (the corresponding convergence for the exact P value is shown in Lambert and Hall 1982). How- ever, unlike the sample mean, TH is continuous at any P E 9P since weak convergence of Pn to P is equivalent to convergence of f fdPn for all bounded, uniformly con- tinuous real f. Therefore, Huber's test based on approx- imate (or exact) P values is qualitatively robust everywhere.

In the case of two-sample tests, take the index n to be the combined sample size. Assume each permissible n has an associated partition into positive integers, say n = n1 + n2, such that nln -> X, 0 < X < 1. The test statistic Tn is then actually a statistic Un,,n2 for each per- missible combined sample size n. With these conven- tions, the definitions and theorems of Sections 2, 3, and 4 generalize in a straight-forward manner. In particular, Definition 1 extends to two-sample tests as follows:

Definition 1'. A sequence of estimators {Tn} whose evaluation depends on two samples and whose index n is partitioned into ni, n2 is continuous at (P, Q) E 9P x 9P if for every E > 0 there is a 8 > 0 and no such that

SUpn,,m?no SUp'nm I Tn(PnI, Pn2) - Tm(Pm I Pm2) < E,

where

Ignm = {(Pn ,1 Qn2, PmI 9 Qm2) E 9ni

xa (n2 x Qm Q) 6,1I=m,(Pn}, P) < 8

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 6: Qualitative Robustness of Tests

356 Journal of the American Statistical Association, June 1982

The slope of a two-sample test is defined to be the a.s. limit, in the permissible combined sample size n, of the _n-n log-transformed P value. Definition 2 of a quali- tatively robust test requires no modification with the un- derstanding that the index n runs through the permissible combined sample sizes.

For each index n = nI + n2, take XI, . . ., Xnl and YI,9 . ., Yn2 to be two independent random samples. The proportion nlIn of observations labeled X is assumed to approach X E (0, 1). All marginal distributions are iden- tical under the null hypothesis; each Xi has mean zero and each Y, has positive mean 0 under the alternative. The edf of Xl, .. .., Xn is denoted by F and the edf of YI I. .I Yn2 is denoted by Fyn Like their one-sample analogs, the two-sample Y - X

and Student's t tests with either pooled or unpooled var- iance estimates are not qualitatively robust and the exact and approximate Wilcoxon P values are qualitatively ro- bust. The qualitative robustness of the van der Waerden test at all distributions under which the test statistic is asymptotically finite and the lack of qualitative robust- ness of the permutation test based on Y - X are next established. It follows that a van der Waerden test can replace a permutation test under normal alternatives with no sacrifice in Pitman efficiency, a negligible sacrifice in exact slope (cf Lambert and Hall 1982), no sacrifice in influence function behavior (Lambert 1981), and a gain in qualitative robustness.

The test statistic for van der Waerden's test may be written as f -'((nlF,n(y) + n2Fyn(Y))I(n + 1)) dFyn(Y)

since

n1Fxn( Yi) + n2Fyn( Yi)

is the rank of Yi in the combined sample of n. The exact slope of van der Waerden's test is given in Stone (1969); the approximate slope based on a normal approximation to the exact null df is available from Mill's ratio. The convergence cn(tn) -> c(t) for tn -* t, which is needed in the application of Theorem 2, is proved for exact P values and approximate P values in Lambert and Hall (1982). Hence, the exact and approximate van der Waerden's tests are qualitatively robust if and only if

T,(Fx, Fy) = f -1(XFx(u)

+ AFy(u)) dF,(u), = 1 - A,

is finite and continuous at (Fx, Fy). We will assume fi- niteness of the integral at (Fx, Fy) and Fx(t) > Fy(t) for all t smaller than some to-

Continuity of T, at (Fx, Fy) is proved by showing that the supremum and infimum of {T,(Fl F2): (Fi, F2) E (Fx8, Fy8)}, where F1 is a b-Prokhorov neighborhood of the distribution F, both converge to TV(Fx, Fy) as 8 ap- proaches zero. Since Fr, F-, and I- -l are nondecreasing, the supremum of Tv over (FX8, Fy8) occurs at the sto- chastically largest distributions (Uxa, Uy8) in (FX8, Ff8)

and the infimum at the stochastically smallest distribu- tions (L,8, Ly8) in (Fx8, Fy8).

The df's Ux5 and Uy, are defined by U,8(u) = max(F, (u - 8) - 8, O), z = x or y. If 8 is so small that Fy 1() < to, then

TV(UX8, Uy8) = 1( FxF(u)

+ XFy(u) - 6)dFy(u) + b5D-1(1 -).

The term 1D-r(1 - 8) approaches zero as 8 -* 0; the integral approaches TV(Fx, Fy) as 8 approaches zero by the monotone convergence theorem. A similar argument shows Tv(Lx8, Ly8) also converges to TV(Fx, Fy) as 8 -> 0. Therefore T, is continuous and van der Waerden's test is qualitatively robust at (Fx, Fy).

The lack of qualitative robustness of the Y - X per- mutation test or, equivalently, of the Y permutation test, can be shown as follows. The Y-permutation P value L, is evaluated by first calculating the mean Z1 of each size n2 subset wr of the n-combined sample observations and then determining the proportion of the Z,'s that equal or exceed Y. That is,

= (n)1 >(Z Y

where I(x, y) = 1 if x 2 y and 0 otherwise. Suppose Lpn is asymptotically well defined at the al-

ternative (F, Fy) in the sense that its exact slope at (Fx, Fy) exists and is nonzero (appropriate regularity condi- tions are given in Lambert and Hall 1982). Take Fxn = nj- l E Ax, E2QP,n and Fyn = n2 - - E n2 to satisfy wr(Fzn, Fz) -> 0 as n increases, z = x or y. By relabeling the Yi's, if necessary, we may assume Y, = min Yi and define a new df F'yn E P2 by

n2

F =yn (SY" + Sy)/n2. 2

Clearly, r(F'yn, Fy) 0. For Y'n small enough, the mean A. of any subset wr not containing Y'n is larger than Y and

Lpn(Fxn, F'yn) _ (n '((n27) + v I(Z Y) 7'a

where -T' is any subset of n2 containing Y'n. As n increases,

(n )(n- 1)

converges to X and

GO) (ZT Y)

behaves like XL'pn, where L'pn is the Y-permutation P value calculated from the n - 1 observations XI, ...

Xn 1, Y2, . . . , Yn,2. Since L'pn and Lpn(FXn, Fyn) have the same exact slope, it follows that - n 1 log Lpn(Fxn, F'yn) > 0 as n >-* c. Consequently, the Y (or Y - X) per-

mutation test is nowhere qualitatively robust. It is interesting to note that increasing the maximum

Yi or decreasing the minimum Xi cannot degrade the Y

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions

Page 7: Qualitative Robustness of Tests

Lambert: Qualitative Robustness of Tests 357

- X permutation test asymptotically. Suppose, for ex- ample, that max Yi increases. Then the values Z, asso- ciated with subsets of size n2 that include max Yi increase and all other Z, are unchanged. Once max Yi is so large that min{Z,: max Yi E Tr} 2 max{Z-: max Yi (E r} there is no further decrease in the P value. Hence, as max Yi increases the Y-permutation P value decreases until it reaches (n2In)L*pn, where L*pn is the Y-permutation P value based on the n - 1 combined sample observations that exclude max Yi. Finally, Lpn and (n2/n)L*pn have the same exact slope, if any.

The insensitivity of the Y - X permutation test to infrequent "large" outliers in the Yi's and infrequent "small" outliers in the Xi's suggests the permutation test based on

T(Fxn f Fyn) = fmax(a, y) dFyn(y)

- min(b, x) dFxn(x)

is continuous at all (Fx Fy). This continuity is indeed valid. Therefore, qualitatively robust permutation tests based on discontinuous test statistics do exist, and qual- itative robustness of conditional tests is not equivalent to continuity of their test statistics.

[Received April 1979. Revised September 1981.]

REFERENCES BAHADUR, R.R. (1960), "Simultaneous Comparison of the Optimum

and Sign Tests of a Normal Mean," in Contributions to Probability and Statistics-Essays in Honor of Harold Hotelling, eds. I. Olkin, S.G. Ghurye, W. Hoeffding, W.G. Madow, and Henry B. Mann, Stanford: Stanford University Press, 79-88.

(1967), "Rates of Convergence of Estimates and Test Statistics," Annals of Mathematical Statistics, 38, 303-324.

(1971), Some Limit Theorems in Statistics, Philadelphia: SIAM. FELLER, W. (1968), An Introduction to Probability Theory and Its

Applications (vol. 1, 3rd ed.), New York: John Wiley. HAMPEL, F.R. (1971), "A General Qualitative Definition of Robust-

ness," Annals of Mathematical Statistics, 42, 1887-1896. HUBER, P.J. (1965), "A Robust Version of the Probability Ratio Test,"

Annals of Mathematical Statistics, 36, 1753-1758. (1968), "Robust Confidence Limits," Zeitschrift fur Wahr-

scheinlichkeitstheorie und Verwandte Gebiete, 10, 269-278. LAMBERT, D. (1978), "P-Values: Asymptotics and Robustness," un-

published PhD Dissertation, Department of Statistics, University of Rochester.

(1981), "Influence Functions for Testing," Journal of the Amer- ican Statistical Association, 76, 649-657.

LAMBERT, D., and HALL, W.J. (1982), "Asymptotic Lognormality of P-Values," Annals of Statistics, 10, 44-64.

RIEDER, H. (1978), "A Robust Asymptotic Testing Model," Annals of Statistics, 6, 1080-1094.

(1981), "Robustness of One- and Two-Sample Rank Tests Against Gross Errors," Annals of Statistics, 9, 245-265.

STONE, M. (1969), "Approximations to Extreme Tail Probabilities for Sampling Without Replacement," Proceedings of the Cambridge Philosophical Society, 66, 587-606.

YLVISAKER, D. (1977), "Test Resistance," Journal of the American Statistical Association, 72, 551-557.

This content downloaded from 91.229.229.49 on Sun, 15 Jun 2014 09:06:59 AMAll use subject to JSTOR Terms and Conditions