Trimmed Comparison of Distributions

9
Trimmed Comparison of Distributions Author(s): Pedro César Álvarez-Esteban, Eustasio Del Barrio, Juan Antonio Cuesta-Albertos and Carlos Matran Source: Journal of the American Statistical Association, Vol. 103, No. 482 (Jun., 2008), pp. 697-704 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/27640092 . Accessed: 15/06/2014 05:09 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AM All use subject to JSTOR Terms and Conditions

Transcript of Trimmed Comparison of Distributions

Page 1: Trimmed Comparison of Distributions

Trimmed Comparison of DistributionsAuthor(s): Pedro César Álvarez-Esteban, Eustasio Del Barrio, Juan Antonio Cuesta-Albertos andCarlos MatranSource: Journal of the American Statistical Association, Vol. 103, No. 482 (Jun., 2008), pp.697-704Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27640092 .

Accessed: 15/06/2014 05:09

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 2: Trimmed Comparison of Distributions

Trimmed Comparison of Distributions Pedro C?sar Alvarez-Esteban, Eustasio del Barrio, Juan Antonio Cuesta-Albertos, and Carlos Matr?n

This article introduces an analysis of similarity of distributions based on the ?2-Wasserstein distance between trimmed distributions. Our

main innovation is the use of the impartial trimming methodology, already considered in robust statistics, which we adapt to this setup. Instead of simply removing data at the tails to provide some robustness to the similarity analysis, we develop a data-driven trimming method

aimed at maximizing similarity between distributions. Dissimilarity is then measured in terms of the distance between the optimally trimmed

distributions. We provide illustrative examples showing the improvements over previous approaches and give the relevant asymptotic results

to justify the use of this methodology in applications.

KEY WORDS: Asymptotics; Impartial trimming; Similarity; Trimmed distributions; Wasserstein distance.

1. INTRODUCTION

An intrinsic consequence of randomness is variability. Sam

ples obtained from a random experiment generally will differ, and even two ideal samples coming from the same random gen

erator cannot be expected to be the same. A main challenge for

the statistician is to be able to detect departures from this ideal

equality that cannot reasonably be attributed to randomness.

Often the researcher is not really concerned about exact co

incidence, but rather wants to guarantee that the random gen

erators do not differ too much. The usual approach in the sta

tistical literature to this "not differ too much" involves fixing a

certain parameter related to the distribution of the random gen

erators (possibly the distribution itself) and checking whether some distance between the parameters in the two samples lies

below a given threshold. In this article we propose a different

approach to the problem with a motivation somehow influenced

by robust statistics.

Imagine that we want to compare two univariate data sam

ples. We observe that the associated histograms look different, but we realize that we can remove a certain fraction, say 5% of

the data in the first sample and another 5% of data in the second

sample, in a such way that the remaining data in both samples produce very similar histograms. We then would be tempted to say that the (95%) core of the underlying distributions are

similar. This could be the case when, for instance, trying to as

sess the similarity of two human populations with respect to a

given feature. Both populations could be initially equal, but the

presence of different immigration patterns might cause a differ ence in the overall distribution of that feature, whereas on the other hand, the "cores" of both populations remain equal. An

other example in which we could be interested in comparing the "cores" of two distributions is when we want to check equality in the distributions generating the two samples of a physical

Pedro C?sar ?lvarez-Esteban is Associate Professor (E-mail: pedroc@eio. uva.es), Eustasio del Barrio is Associate Professor, and Carlos Matr?n is Profes

sor, Department of Statistics and Operations Research, University of Valladolid, Valladolid, Spain. Juan Antonio Cuesta-Albertos is Professor, Department of

Mathematics, Statistics, and Computation, University of Cantabria, Santander,

Spain. This research was supported in part by the Spanish Ministry of Science and Technology and FEDER (grant BFM2005-04430-C02-01 and 02) and by the Consejer?a de Educaci?n y Cultura de la Junta de Castilla y Le?n (grant PA PIJCL VA 102/06). The data sets corresponding to the multiclinical study were

kindly provided by Axel Munk and Claudia Czado. The data used in Section 3 are available at the majors.dat file in the examples data sets of many statistics

packages. We obtained them from the textbook by Moore and McCabe (2003). The computational analyses were done using R statistical software. The R pro grams and functions used to analyze the examples considered in this work are available at http://www.eio.uva.es/~pedroc/RJ. The authors thank the reviewing team for their careful reading of the manuscript and useful suggestions.

magnitude but find that the measuring devices are not perfect and introduce some distortions when the true values lie within a certain range, leaving other values unaffected. The distortions

introduced by the two measuring devices could be of different

types, but if they did not affect more than a small fraction of the observations, again the "core" of the distributions could be

equal. Let us formalize this idea of the core of a distribution. When

trimming a fraction (of size at most a) of the data in the sam

ple to allow a better comparison with the other sample, we re

place the empirical measure ? X^=i ^/ wim a new probability

measure that gives weight 0 to the observations in the bad set

and weight -^ to every observation remaining in the sample.

Here k is the number of trimmed observations; thus k < na

and -^ <

? ^~. Instead of simply keeping/removing data, we

could increase the weight of data in good ranges (by a factor bounded by yz^) and downplay the importance of data in bad

zones, not necessarily removing them. The new trimmed em

pirical measure can be written as

1 n

11" - 7 hi 8X., where 0 < bi <-and

- 7 bi

= 1. n *?'

(1 ?

a) n *?^ ?=i i=i

If the random generator of the sample were P, then the the

oretical counterpart of the trimming procedure would be to re

place the probability P(B) = fB 1 dP by the new measure

P(B)= [ gdP, JB

whereO<g<-and gdP = l. (1) (I-a) J

We call a probability measure like P in (1) an a -trimming of P. We show in Section 2 that all a-trimmings of P can be ex

pressed in terms of trimming functions. For a given trimming function, h, Pn denotes the corresponding a -trimming of P.

The trimming function h determines which zones in the distri bution P are downplayed or removed.

Turning to the measurements-with-errors example, the un

derlying distribution of the samples, P and Q, could be dif ferent because of the distortions introduced by the measuring device, but a suitable trimming function, h, could produce a

trimmings, Pn and Qn, that are very similar or even equal.

The right trimming function generally will be unknown, and

? 2008 American Statistical Association Journal of the American Statistical Association

June 2008, Vol. 103, No. 482, Theory and Methods DO110.1198/016214508000000274

697

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 3: Trimmed Comparison of Distributions

698 Journal of the American Statistical Association, June 2008

we should look for the best possible one. This makes sense if we consider a metric, d, between probability measures and take

/z0 = argmind(P/,, Qh). (2) h

If the ?f-trimmings PnQ and QnQ are equal then we can say that

the core of the distributions coincide. It also would be of in terest to check whether these optimally trimmed probabilities are close to one another. Our goal is to introduce and analyze

methods for testing the similarity/dissimilarity of trimmed dis tributions.

In this article we consider the L2-Wasserstein (or Mallows) distance between distributions. Note that in a related work, Munk and Czado (1998) considered a trimmed version of the Wasserstein distance consisting in trimming both distributions

solely in their tails and in a symmetric way. In the next sec

tion we discuss this approach in our context. However, we want

to emphasize that in this article we use impartial trimming not

only as a way to robustify a statistical procedure, but also as

a method to discard part of the data to achieve the best possi ble fit between two given samples or between a sample and a

theoretical distribution, thus searching for the maximum simi

larity between them. To the best of our knowledge, this point of view has not been previously considered in the literature and can lead to a new methodology in relation with the similarity concept. However, the fact that the data themselves decide the method of trimming is common to several statistical method

ologies (see, e.g., Cuesta, Gordaliza, and Matr?n 1997; Garc?a

Escudero, Gordaliza, Matr?n, and Mayo-Iscar 2008; Gordaliza

1991; Maronna 2005; Rousseeuw 1985), described here by the term "impartial trimming."

The article organized as follows. In Section 2 we formally in troduce the trimming methodology to measure dissimilarities.

We present the properties of trimming and a preliminary exam

ple describing the innovation of our methodology with respect to the naif approach of symmetrically trimming to gain robust ness. Asymptotics for our dissimilarity measure complete the

mathematical analysis considered in Section 2. In Section 3 we

compare our methodology with that of Munk and Czado on a

real data set, showing the flexibility that impartial trimming in troduces in the similarity setup. We give proofs of our results in the Appendix.

2. MEASURING DISSIMILARITIES THROUGH IMPARTIAL TRIMMING

As discussed earlier, we could consider trimmings of a proba

bility distribution on a Borel set simply by considering the con

ditional probability given that set. But here it is convenient to

introduce a slightly more general concept. Trimmed probabili ties can be defined in general probability spaces, although for

practical purposes we restrict ourselves to probabilities on the

real line.

Definition L Let P be a probability measure on R and let 0 < a < 1. We say that a probability measure P*, on R, is an

a -trimming of P if P* is absolutely continuous with respect to

P(P*?F)and^<i^.

We denote the set of a-trimmings of P by T (P); that is, if V denotes the set of probability measures on R, then

? dP* 1 1 T*(P)= P*eP:P*?P. -<-P-a.s. . (3)

[ dP 1 ? a J

The limit case in which a = 1, T1 (P), is just the set of proba bility measures absolutely continuous with respect to P.

An equivalent characterization is that P* e Ta(P) if and

only if P* ? P and ̂ =

j^f with 0 < / < 1. If / takes

only the values 0 and 1, then it is the indicator of a set, say A, such that P(A)

= 1 ? a and trimming corresponds to consider

ing the probability measure P(-\A). Definition (3) allows us to reduce the weight of some regions of the sample space without

completely removing them from the feasible set. The following proposition gives an useful characterization

of trimmings of a probability distribution in terms of the trim

mings of the U[0, 1] distribution.

Proposition I. Let Ca be the class of absolutely continuous functions h : [0, 1] -> [0, 1] such that, h(0) = 0 and h(\) = 1, with derivative h' such that 0 < t? < j^. For any real proba bility measure P, we have the following:

a. Ta(P) = {P* eV: P*(-oo, t] = h(P(-oo, t]), h e Ca} b. Ta(U[0, 1]) = {P* e V:P*(-oo,t] = h(t),0 < t < 1,

heCa}.

It will be useful to write Pn for the probability measure with distribution function h(P(?oo,t]), leading to Ta(P) =

{Ph:heCa}. To measure closeness between distributions, we resort to the

L2-Wasserstein distance defined in the set V2 of probabilities with finite second moment. For P and Q in V2, y\?i(P, Q) is defined as the lowest L2-distance between random variables

(rv's), defined on any probability space, with distributions P and Q. The measure of closeness, or matching, between P and

Q at a given level, a, or, equivalently, between their distribution functions F and G is now defined by

Ta(P, Q) = ra(F, G) := inf W2(Ph, Qh). (4) heCa

The following alternative expression for Wi(P, ?) is a key aspect to the usefulness of this distance in statistics on the line. If F and G are the distribution functions of P and Q and F~l and G~l are the respective (left-continuous) quantile functions, then the L2-Wasserstein distance between P and Q is given by (see, e.g., Bickel and Freedman 1981)

l'/2 m(p,Q) [ (F-\t)-G-\t))2dt

Jo (5)

/o

Recall that F~l is defined on (0, 1) by F~l(t) = mf{s : F(s) >

t}, which satisfies that its distribution function is F. From this, it is obvious that for the probability measures based on two sam

ples (resp. one sample and a theoretical distribution), W2 coin cides with the L2 distance to the diagonal in a Q-Q plot (resp. probability plot).

From (5) and Proposition 1, we obtain the equivalent expres sion of (4) as

ra(F,G)= inf f (F~l(t)

- G~l(t))2t?(t)dt. (6)

heCa Jo

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 4: Trimmed Comparison of Distributions

?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions 699

The infimum in (6) is easily attained at the function ho below

(see Gordaliza 1991), associated with a set with Lebesgue mea sure 1 ? a. We call this minimizer, ho, an impartial a-trimming between P and Q. Obviously, after Proposition 1, ho(F(x)) and ho(G(x)) are the distribution functions of the impartially a-trimmed probabilities.

To analyze (6), let us consider the map t :?> |F_1(r) ?

G~l(t)\ as a random variable defined on (0,1) endowed with the Lebesgue measure, i. Let

Lf,g(x) :=l{t e (0,1) : \F~l(t) -

G~l(t)\ <x}, x> 0,

denote its distribution function and write LJ G for the corre

sponding quantile inverse. If Lf,g is continuous at L~^lG(\ ?

0, then

inf [ <eCa Jo (F-\t)-G-\t))2hf(t)dt

heCc

?1 =

f (F-l(t)-G'l(t))%(t)dt, (7) Jo

where

AoC) -

YZ^h\F-Ht)-G-Ht)\<Lj)G{\-a)Y (8)

In this case ho is in fact the unique mimimizer of the criterion functional.

Even if L/TG is not continuous at ?^(l ?

a), we can en sure the existence of a set Ao (not necessarily unique) such that

i(A0) = 1 - a and

[t e (0,1) : \F~l(t) -

G~\t)\ < L^G(l -a)}

CA0c{te (0,1) : \F~l(t) -

G~l(t)\ < Lj]G(l

- a)}.

Obviously, if for any such Ao we consider the function ho Ca with h'0

= yz^/ao? then the infimum in (6) is attained at ho.

Therefore, problem (6) is equivalent to

"Krb fA(F~l(t) ~ G~lw2dt)

' (9) where A varies on the Borel sets in (0,1) with Lebesgue mea sure equal to 1 ?a.

2.1 Comparison With Symmetric Trimming

Munk and Czado (1998) (see also Czado and Munk 1998;

Freitag, Czado, and Munk 2007) considered a trimmed version of the Wasserstein distance for the assessment of similarity be tween the distribution functions F and G as

/ 1 rl-ct/2 \l/2

Ta(F, G) := -- / (F~l(t)

- G-\t))2dt .

\l-0i Ja? )

(10)

Note that the right side of the foregoing expression equals Wi(Pci, ?a)> where Pa is the probability measure with distrib ution function

Fa(t) = -^?(F(t)-a/2), 1 -a

F~\a/2) < t < F_1(l -a/2), (11)

Center 1 (a= 0.1) Center2(a=0.1)

a_^??U=mi_ ._^-o_

0 200 400 600 800 0 200 400 600 800

Cholesterol Cholesterol

Figure 1. Histograms of trimmed data (cholesterol levels) in two

clinical centers. The white part in the bars shows the trimming propor tion in the associated zone.

and similarly for Qa. When comparing two samples, this corre

sponds to the distance between the sample distributions associ ated with the symmetrically trimmed samples. This naif way of

trimming is widely used and confers protection against conta mination by outliers. However, the arbitrariness in the choice of the trimming zones has been largely reported as a serious draw back of procedures based on this method (see e.g., Cuesta et al.

1997; Garc?a-Escudero et al. 2008; Gordaliza 1991; Rousseeuw

1985). In our setting, the question is why two distributions that are very different in their tails are considered similar but ones that differ in their central parts are considered nonsimilar.

To get an idea of the differences between our approach and the symmetrical trimming, let us recall example 1 of Munk and Czado (1998), which corresponds to a multiclinical study on

cholesterol and fibrinogen levels in two sets of patients (of size 116 and 141) in two clinical centers. For the fibrinogen data, our

impartial trimming proposal for a = .1 essentially coincides with the symmetrical trimming. But Figure 1 displays the ef fects of our trimming proposal for the cholesterol data, showing a significant trimming in the middle part of the histograms as

well, corresponding to both centers. This even improves on the level of similarity shown by Munk and Czado (1998), strength ening their assessment of similarity on these data, but also pro vides a descriptive look at the way in which both populations (dis)agree. The posterior analysis of the trimmed data can be

very useful in a global comparison of the populations.

2.2 Nonparametric Test of Similarity

As is usual in many statistical analyses, the interest of statisti cians when analyzing similarity of distributions relies on assert

ing the equivalence of the involved probability distributions. In

hypothesis testing this is achieved by taking equivalence or sim

ilarity as the alternative hypothesis, whereas dissimilarity is the null hypothesis. In agreement with this point of view, Munk and Czado (1998) considered the testing problem with the null hy pothesis that the trimmed distance (10) exceeds some A value, a threshold to be analyzed by the experimenters and statisticians in an ad hoc way. Graphics of p values for different A values

(see Fig. 4 in Sec. 3) play a key role in this analysis, and the fact that our measure of dissimilarity, (ra(F, G))1/2, is mea sured on the same scale as the variable of interest favors this

goal. We also note that it is the Wasserstein distance between trimmed versions of the original distributions. This allows us to

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 5: Trimmed Comparison of Distributions

700 Journal of the American Statistical Association, June 2008

handle the very nice properties of this distance (see, e.g., Bickel and Freedman 1981) in a friendly way in connection with our

problem. Let X\, ...,Xn (resp. Y\,..., Ym) be iid observations with

common distribution function F (resp. G), and let X(\),..., X(n) (resp. Y(\),..., Y(m)) be the corresponding ordered sam

ples. We base our test of Ho:ra(F,G) > Aq against Ha :

ra(F,G) < Aq on the empirical counterparts of ra(F,G),

namely, Tna := ra(Fn, G), where Fn denotes the empirical dis tribution function based on the data in the one-sample problem and Tn,m,a := xa(Fn, Gm) in the two-sample case. Our next re

sults show that under some mild assumptions on F and G,Tna and Tn%m,a are asymptotically normal, a fact that we use later

to approximate the critical values of Ho against Ha. For nota tional reasons, in the Appendix we give the proof only of the

one-sample statement.

To obtain the asymptotic behavior of our statistics, we as sume that

F and G have absolute moments of order 4 + 8, for some 8 > 0.

A further regularity assumption is that F has a continuously differentiate density, F' = /, such that

sup xeR

F(x)(l - F(x))f(x)

f2(x) Additional notation includes ho as defined in (8) and

*F-l(t)

Kt)

and

<oo. (13)

:= (x- G-l(F(x)))h'0(F(x))dx, (14)

JF-{(\?)

n-\

sl^G) := (T3^2

- E (min<U)

- l{ynjanj,

(15)

where

anJ = (X(/+d -

X(/))((X(/+1) + X{i))/2 -

G-\i/n)) X

h\X{i)-G-Hi/n)\<L-FXnG(\-a)y <16)

Theorem 2. Assume that F and G satisfy (12) and (13) and that Lf,c is continuous at L~pXG(\ ?a). Then +Jh~(Tn^

?

ra(F, G)) is asymptotically centered normal with variance

ol(F,G)=*(j l2(t)dt-(f l(t)dt\ Y (17)

This asymptotic variance can be consistently estimated by

s2a(G) given by (15). If G also satisfies (13) and ^

-* X e

(0,1), then y^(^,m,of

- tct(F, G)) is asymptotically cen

tered normal with variance (1 -

X)a2(F, G) + Xa2(G, F). This variance can be consistently estimated by s2ma

=

n+mSn,a(Grn) +

n+msm,a(Fn)

If ra(F,G) = 0, then Theorem 2 reduces to V^X,a ~~>

0 in probability; note that ra(F, G) = 0 implies that (x ?

G_1 (F(x)))2h,0(F(x)) = 0 for almost every x and thus cr2(F,

G) = 0. This generally would suffice for our applications, but we also give the exact rate and the limiting distribution in the

Appendix.

GPA GPA

Figure 2. Histograms for variable GPA. (a) Males; (b) females;

(c) computer science students; (d) engineering students.

3. EXAMPLE AND SIMULATIONS

Our analysis is based on the variable college grade point average (GPA) collected from a group of 234 students. This variable takes values of 0-4. The students are classified by the variables gender and major (1 = computer science, 2 = engi neering, 3 = other sciences). We are interested in studying the distributional similarity of the GPA between males (n = 117) and females (m = 117) and also between students with a ma

jor in computer sciences (n = 78) and students with a major in

engineering (m = 78). Figure 2 shows the histogram for each

sample.

Comparisons of these samples using classical procedures produce the results displayed in Table 1. Because the Shapiro

Wilks tests reject the normality of the four samples, we use non

parametric methods like the Kolmogorov-Smirnov test (KS) or the Wilcoxon-Mann-Whitney test (WMW) to analyze the null hypothesis that both samples come from the same distribu tion in the comparisons of GPA by sex and GPA by major. The

p values of these tests clearly reject the null hypotheses. Under the possibility of impartially trimming both samples

as described in Section 2, we obtain the optimal trimming func tions displayed in Figure 3. In this figure, and for each compar ison, we plot the value of IF"1^)

? G~l(t)\ and the cutting

values L~pl G (1 ?

a) for a = .05, .1, and .2. Figure 3(a) shows that the optimal trimming involves the lower tail, but not ex

actly from the lower end point. When the trimming level grows

Table 1. Two-sample p values for classical tests

p value

Test GPA by gender GPA by major

Shapiro-Wilks (sample 1) .0176 .0360

Shapiro-Wilks (sample 2) .0217 .0001 KS .0028 .0040

WMW .0004 .0175

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 6: Trimmed Comparison of Distributions

?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions

(a) (b)

701

CO b

?

TE * O o

^ ? CN ?

O ?

CO o

I ?

"i-1-1-1-1-r 0.0 0.2 0.4 0.6 0.8 1.0

CM ?

O ?

i i i i i r

0.0 0.2 0.4 0.6 0.8 1.0

t t

Figure 3. Trimming functions, (a) GPA by gender; (b) GPA by major. (? a = .05; - - a = . 1 ;

- a = .2.)

(a = . 1 and .2), the trimmed zone is not an interval and it in cludes points around percentiles 20%, 40%, 60%, and 70%.

Figure 3(b) shows that the points that should be trimmed to make both samples more similar are between percentiles 10% and 30%. This example illustrates a nonsymmetrical dissimi

larity between samples; in fact, in the first comparison, the less similar zone is close to the lower tail but not to the upper tail

where the values are more similar.

3.1 p Value Curve

To gain some insight into the assessment of the similarity or dissimilarity of the underlying distributions, we can use the

p value curve to test the null hypothesis Ho : Ta(F, G) > Aq against Ha:za(F,G) <

Aq. In the two-sample comparison case, we use the statistic

_ / nm (Tn^a-A20) ?n,m,a

? + -;- U?; Vn + m sn,m,a

To obtain the values of Tn,m,a, we compute \F~l(t) ?

G~l(t)\2 over a grid in [0,1], using the (1 ?

oO-quantile of

these values to determine LJl G (1 ?

a). The integral is then

calculated numerically. The computation of sn^m^ is done sim

ilarly. The asymptotic p value curve, P(An), is defined as

P(A0):= sup lim PF,G(Zn,m,a < zo) = 4>(zo), (F,G)e//o"'m""O?

where zo is the observed value of Zn,mM. (Note that the supre mum is attained when the distance between both distributions is exactly Ao?) These asymptotic p value curves can be used in two ways. On one hand, given a fixed value of Ao that con trols the degree of dissimilarity, it is possible to find the p value associated with the corresponding null hypothesis to de cide whether or not the distributions are similar. On the other

hand, given a fixed test level (p value), we can find the value of Ao such that for every A > Ao, we should reject the hy pothesis Ho : ra(F, G) > A2. In this way, we can get a sound idea of the degree of dissimilarity between the distributions. To handle the values of Ao, the experimenter should take into ac count how to interpret the Wasserstein distance, recalling that in the case where F and G belong to the same location fam

ily, their Wasserstein distance is the absolute difference of their locations.

Figure 4 illustrates the improved assessment obtained by im

partial trimming over the Munk and Czado methodology. It dis

(a) (b)_ ioj

- i.o-l-"

' "

?"8' ^^^v

?'8" ^^s?N.

0.6- "

*'%\?^v S 0.6- \\V

ft 0.4.-. v\ *>. ?-4""

\\\

' *' *''

0.2- \^v VVV^ ?*2'

^VVV ^'N. 0.0-""

" * " " " " ^^ -?*>V2m 0.0-

" " T^S^

" "*" -^ V^?i-1-r??f-1-r?' T"?-p-'?i-1. .; m-:->?r-r?-^

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0,4 0.5 0:6

Figure 4. p value curves using impartial and symmetrical (MC) trimmings, (a) GPA by gender; (b) GPA by major. [? a = .051 ; - a = .051

(MC); ? a = .102;

- a = .102 (MC); a = .205; a = .205 (MC).]

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 7: Trimmed Comparison of Distributions

702 Journal of the American Statistical Association, June 2008

plays the p value curves using impartial trimming and symmet

rical trimming for both comparisons for different trimming lev

els (a =

.05, .1, and .2). For each plot, a horizontal line marks

a reference level for the test (.05). The GPAs of males and fe

males are similar up to Ao ranging from .32 to .36 (depending on the trimming size) when impartial trimmings are used. These

values represent between 100 x .32/2.815 = 11.4% and 12.8%

of the average of the medians of the samples. But when us

ing symmetrical trimmings, the horizontal line cuts the p value curves for Ao ranging from .56 to .59, between 20% and 21%

of the average of the medians. A similar analysis of the com

parison of GPAs by major leads us to values of Ao ranging from .29 to .36, which represent between a 9.6% and a 11.9%

of the average of the medians when using impartial trimming.

Instead, when using symmetrical trimming, these percentages

range from 16.6% to 19.5%.

3.2 Simulations

We end this section by reporting a small simulation study to illustrate our procedure's performance for finite samples for

testing Ho : ra(F, G) > Aq against Ha : ra(F, G) < Aq in the

two-sample problem. We considered two different contami

nated normal models, two different trimming sizes, and sev

eral values of the threshold Ao. In each situation we gener

ated 10,000 replicas of the trimmed score ZnjnM as defined in

(18) for several values of ? = m. We compared these replicas

with the .05 theoretical quantile of the standard normal distrib

ution, rejecting Hq for observed values smaller than this quan

tity. Table 2 shows the observed rejection frequencies. We find

good agreement with our asymptotic results even for moder

ate sample sizes, with low rejection frequencies for thresholds

Ao smaller than the true distance and high rejection frequencies otherwise. When the threshold equals the true distance, we also

can see how the observed frequency approximates the nominal

level.

4. CONCLUSIONS AND POSSIBLE EXTENSIONS

We have introduced a procedure to compare two samples or

probability distributions on the real line based on the impartial trimming methodology. The procedure is designed mainly to assess similarity of the core of two samples by discarding that

part of the data that has a greater influence on the dissimilarity of the distributions. Our method is based on trimming the corre

sponding samples according to the same trimming function, but

it allows nonsymmetrical trimming; thus it can greatly improve the previous methodology based on simply trimming the tails.

We have evaluated the performance of our procedure through an analysis of some real data samples that emphasized the ap

pealing possibilities in data analysis and the significance of the

analysis of the p value curves for assessing similarities. A sim

ulation study has provided also evidence about the behavior of the procedure for finite samples, in agreement with asymptotic results. Although we treated only dissimilarities based on the

Wasserstein distance, other metrics or dissimilarities could be

handled under the same scheme.

Representation of trimmings of any distribution in terms of those of the uniform distribution is no longer possible in the

multivariate setting. However, under very general assumptions,

it has been proven (see Cuesta and Matr?n 1989) that given two

probabilities P and Q on R*, there exists an "optimal transport map" T such that Q

? PoT~l, and if X is any random vector

with law P,thenE\\X-T(X)\\2 = W?(P, g). Moreover if Pa

is an a-trimming of P, then Pa o T~l is an a -trimming of Q

and T is an optimal map between Pa and Pa o T~l, so the mul

tivariate version of (4) would be the minimization over the set

of a-trimmings of P of the expression W|(Pa, Pa o T~l). We also should mention that obtaining the optimal map T remains an open problem for k > 1. Although these are troubling facts,

obtaining the optimal trimming between two samples is already possible through standard optimization procedures. A final dif

ficulty concerns the asymptotic behavior of the involved statis

tics, to which the techniques used in our proofs do not extend.

Table 2. Simulated powers for the trimmed scores Zn^

P = .95N(0, 1) + .05N(5, 1) P = .9N(0, 1) + .1N(5, 1) Q = .95N(0, 1) + .05N(-5. 1) Q = 0.9N(0, 1) + 0.1 N(-5, 1)

a = .05 (1-qOAq

n Frequency a ? A

(1??)Aq n Frequency

[(1 -a)ra(P, Q) = .384] .25 100 .0320 [(1 -

a)ia(P, Q) = 1.004] .25 100 .0028 200 .0268 200 0 500 .0086 500 0

1,000 .0021 1,000 0 5,000 0 5,000 0

.5 100 .1412 .5 100 .0109 200 .1633 200 .0031 500 .2264 500 .0002

1,000 .3134 1,000 0 5,000 .7648 5,000 0

1 100 .4912 1 100 .0850 200 .6957 200 .0727 500 .9474 500 .0657

1,000 .9989 1,000 .0584 5,000 1.0000 5,000 .0486

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 8: Trimmed Comparison of Distributions

?lvarez-Esteban, del Barrio, Cuesta-Albertos, and Matr?n: Trimmed Comparison of Distributions 703

Allowing trimming in both samples with different trimming functions would provide an interesting alternative to our present

proposal. Through our research, still in progress, we have iden

tified a radically different behavior to that presented in this ar ticle for identical trimming in both samples.

APPENDIX: PROOFS AND FURTHER RESULTS In this appendix, pn(t)

= ^?f(F~x(t))(F~X(t)

- F~x(t)) de

notes the weighted quantile process, where / is the density function

of F.

Proof of Proposition 1

Let A = {P* eV: P*(-oo, t] = h(P(-oo, t]), heCa}. For P* e

A, absolute continuity of h entails

P*(s, il = h(P(-oo, t]) -

h(P(-oo, s])

rP(-oo,t] ! = / h\x)dx<-P(s,t].

JP(-oo,s] ! ~a

Thus P* ? P and ^

< ? and, therefore, P* eTa(P).

Conversely, given P* e Ta(P), if F is the distribution function of

P and we define h(t) = f? ̂p-(F~x (s))ds, then it is immediate that h Ca and

P*(-?o,il= / JL-(s)dF(s) f J?o

*F(t) jp* = i C1^(F-{(s))ds

= h(P(-oo,t]). JO dP

Therefore, P* e A, and part a is proven. The proof of part b is imme

diate from the proof of part a.

The following lemmas collect some results, which we use in our

proofs of Theorems 2 and A.l. These results can be easily proven us

ing Schwarz's inequality, standard arguments in empirical processes

theory, or the Arzel?-Ascoli theorem.

Lemma A.I. If F and G have finite absolute moment of order r > 4,

then the following hold:

a. ^??/\F-l(t))2dt^0md^nfl_[/n(F-l(t))2dt^0.

b. y/h~f?/n(F-l(t))2dt

-? Oand Jh~?l_{/n(F-\t))2dt

-> 0 in

probability. c.

^^T{Y^l)lg(G-x(t))\F-x{t)-G-{(t)\dt <oo.

d. Furthermore, if G satisfies (13), then -j= Jy ,n

t(\ ?

t)/

g2(G-](t))dt^0.

Lemma A.2. Under the || ||oo topology, the set Ca in Proposition 1

and the set Ca(F, G) = {h e Ca : ?q (F~x(t) - G'x (t))2h'(t) dt = 0),

for F, G e Vi, are compact.

Proof of Theorem 2

From theorem 6.2.1 of Cs?rg? and Horv?th (1993) and (13), we can

assume that there exist Brownian bridges Bn satisfying

1/2-v \Pn(t) -

Bn(t)\ n ' sup -;

\/n<t<\-\/n ('0 ~t))V

Op(\ogn), ifv = 0

Op(\), if 0 < v < 1/2

Now we set M?(h) = Vh~fo(F~l(t)

- G~l(t))2h\t)dt and

A~l/n Bn(t)

l/

-1-1/?

1/

(A.l)

Nn(h) = 2 f n J^)(G-\t)-F-\t))ti(t)dt J\/n f(F l(t))

+ V^ / J\ln (G-X(t)-F'x(t))2h\t)dt.

Observe that

sup \Mn(h) -

Nn{h)\ heCa

-1/?

<V^i (F-\t)-G~\t))2dt Jo 1

1/ + vH (F?-1(/)-G-'(i))2rfr

-? In f2(F~\t)) dt

Jl/n f(F HO) = : AnA + A??2 + An,3 + ^/i,4 + A?,5

Lemma A.l implies that An\ -> 0 and AnjL ?> 0 in probability.

From (A.1), we get An,3 < 0/?(l)-^ Z,1"1'" j$=fadt, and

this last integral tends to 0 by Lemma A.l. Thus An? -> 0 in

probability. Similarly, A? 4 -> 0 in probability. Finally, (A.l) yields

A?,5 < 0P(1K-V2 //-;/- i^l]_|G-l(0 _

F-l(0|^ for

some v e (0, 1/2). Lemma A.l shows that /q1 ^p-w^l0'1^) ~

F~x(t)\dt < co. Thus, by dominated convergence, we obtain that

An,5 -> 0 in probability. Collecting the foregoing estimates, we obtain

suP/zeC \Mn(h) ?

Nn(h)\ -> 0 in probability, and thus ?Jn(Tn,a ?

Sn,a) ?> 0 in probability, where y/?Sni(X ?

infheCa Nn(h). Therefore,

we need only show that <J?(Sn,a ?

*a(F, G)) ->w N(0, <?2(F, G)), where

. 1 r:-l tt\ ? r-1

V? 5W a = inf ^ . ^ v , . ' " '0 f(F-[(t)) heCa if b,)g {t)~f (fW

, Jo

+ v^ f {G-X{t)-F-X{t))2h\t)dt Jo

Let us denote

?w=argmin f (F~{ (t) - G'X (t))2hf (t)dt ' CaJo heCaJo

^~~ 0

"'" f(F'Ht))

2 f] G-{(t)~F-x(t) , + -?=/ 5?-Al i,? h\t)dt. Vn Jo

" '

Clearly, /z^(0 "^ ^n(0 f?r almost every f. Furthermore, optimality of

hn shows that

Bn := V^n.c* ~ 2

/ 5(0-'l . ^KJh'0(t)dt V Jo f(F-HO)

Jo

f(F-Ht)) 1

0 < 0,

but, in contrast,

B? = >fi([ (F-X(t)-G-\t))2hfn(t)dt

(F~x(t)-G-](t))2hf0(t)dt

1 (t\ _ 77-1 / r G~[(t)-F~[(t) , , +

2Jo B{t)

^ K\h'n(t)-h'0(t))dt f(F~x(t))

='Bnj +B)h2

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions

Page 9: Trimmed Comparison of Distributions

704 Journal of the American Statistical Association, June 2008

and Bn \ > 0 by optimality of /iq, whereas Bn 2 = op (1) by the dom

inated convergence theorem. Therefore, Bn ? 0 in probability, which

shows that

A?-Tff(F,G))^2? B(t)-?(r)

~

F?~h'(t)dt. ' Jo /(F-!(r))

?

(A.2)

Integrating by parts, we obtain ?^ B(t)G }!lZu ^ hfQ(t)dt =

r\ f{F~x{t)) - Jq l(t)dB(t), which proves the asymptotic normality and the ex

pression (17) for the variance. The claim about the variance es

timator readily follows by noting that s2 a(G) = 4(/0 l2(t)dt

?

(fXln(t)dt)2), where ln(t) = fpf^/2)(x

" G~x(Fn(x))) x

h'n(Fn(x))dx and /^(0 =

argmin/^^ J(F?~ ?

G~x)hf. It can be

shown that, with probability 1, ln (t) -> / (f ) for almost every ? e (0, 1 ).

A standard uniform integrability argument completes the proof. The final result in this section establishes the asymptotic behavior

of nTn,a when F and G are equivalent at trimming level a. Recall the

definition of Ca(F, G) in Lemma A.2 and note that Ca(F, F) ?Ca,

but also note that for F ^ G, we have that Ca (F, G) is a proper subset

of Co;. Also note that Ca (F, G) + 0 if and only if za (F, G) = 0. In fact,

the size of Ca(F, G) depends on the Lebesgue measure of the set {t e

(0, 1) : F~l(t) ^ G~x(t)}. xa(F, G) = 0 if and only if the measure of this last set is at most a; if it equals a, then the only function in

Ca(F,G) corresponds to h'(t) =

f=^/(F-1(0=G-1(0)*

Theorem A.l. If ra(F, G) ?

0, F satisfies (13) and

t(\-t)

I 0 fl(F-v(t))

dt < 00, (A.3)

then

nTn,a-> min / 9 h (t)dt, heCa(F,G)Jo f2(F~x(t))

where {#(0}o<r<l is a Brownian bridge. Because h i-> Jq B2(t)/

f2(F~x(t))h''(t)dt is || ||oo-continuous as a function of h, it attains

its minimum value on Ca (F, G).

Proof. We define Dn(h) =nf?(F~X(t) -

G~x(t))2h'\t)dt and

D(h) = ?Q B2(t)/f2(F-x(t))h\t)dt for heCa. Note that

Dn(h)= [ JO

1 ?2

fZ(F-[(t)) h'(t)dt Pn(0_

0

1/^ n~\u^2Ur + rc [ (F~v(t)-G~v(t)Yh'(t)dt JO JO

'1 + 2V^i Pn(|} (F-X(t)-G~l(t))h'(t)dt.

Jo f(F~x(t))y

Also observe that nTn,a = Dn(hn) for some hn e C^. If /z e Ca(P, G),

then the second and third summands on the right side vanish and

Dn(h) = Jo f2$-\t))h'{t)dt'

By ?3)' (A3)' and a-S- rePresenta"

tion of weak convergence, versions of pn(-)/f(F~x(-)) and B(-)/

f(F H-)) exist (for which we keep the same notation) such that

\\pn(-)/f(F~\-)) -

B(.)/f(F-l(-))\\2 -* 0 a.s.

Now for these versions, we have

sup \Dn(h)-D(h)\ heCa(F,G)

1 -a Jo P2(t) B2(t)

dt -+0 a.s., f2(F'\t)) f2(F-\t))\

whereas for ho eCa ?

Ca(F, G), we have a.s. that Dn(h) -> co uni

formly in a sufficiently small neighborhood of h?. Furthermore, if

hn ?? h e Ca(F, G), then we can extract a subsequence such that

n?Q(F~{(t) -

G~\t))2t?n(t)dt -+ 0. The result follows from the

next technical lemma, the easy proof of which is omitted.

Lemma A.3. Let (X, d) be a compact metric space, let A be a com

pact subset of X, and let {fn} and / be real valued, continuous func

tions on X such that the following hold:

a- suPxeA 1/nW -

/Ml -> 0, as ^ ̂ co.

b. For x e X ? A there exists sx > 0 such that

t?d(y,x)<?x fn(y) -> oo, as n -> CO.

c. If xn -> x e A there exists a subsequence, {xm}, such that

fm(xm)^ fix).

Then min^x fn(x) -> minjeA /(*).

[Received October 2007. Revised January 2008.]

REFERENCES

Bickel, P., and Freedman, D. (1981), "Some Asymptotic Theory for the Boot

strap," The Annals of Statistics, 9, 1196-1217.

Cs?rg?, M., and Horv?th, L. (1993), Weighted Approximations in Probability and Statistics, New York: Wiley.

Cuesta, J. A., and Matr?n, C. (1989), "Notes on the Wasserstein Metric in Hilbert Spaces," The Annals of Probability, 17, 1264-1276.

Cuesta, J., Gordaliza, A., and Matr?n, C. (1997), "Trimmed ?-Means: An At

tempt to Robustify Quantizers," The Annals of Statistics, 25, 553-576.

Czado, C, and Munk, A. (1998), "Assessing the Similarity of Distributions?

Finite-Sample Performance of the Empirical Mallows Distance," Journal of Statistical Computation and Simulation, 60, 319-346.

Freitag, G., Czado, C, and Munk, A. (2007), "A Nonparametric Test for Simi

larity of Marginals With Applications to the Assessment of Population Bioe

quivalence," Journal of Statistical Planning and Inference, 137, 691-1M.

Garc?a-Escudero, L., Gordaliza, A., Matr?n, C, and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis," The Annals of Statistics, to appear.

Gordaliza, A. (1991), "Best Approximations to Random Variables Based on

Trimming Procedures," Journal of Approximation Theory, 64, 162-180.

Maronna, R. (2005), "Principal Components and Orthogonal Regression Based on Robust Scales," Technometrics, 41, 264-273.

Moore, D. S., and McCabe, G. P. (2003), Introduction to the Practice of Statis tics (4th ed.), New York: W.H. Freeman.

Munk, A., and Czado, C. (1998), "Nonparametric Validation of Similar Dis tributions and Assessment of Goodness of Fit," Journal of Royal Statistical

Society, Ser. B, 60, 223-241.

Rousseeuw, P. (1985), "Multivariate Estimation With High Breakdown Point," in Mathematical Statistics and Applications, Vol. B, eds. W. Grossmann,

G. Pflug, I. Vincze, and W. Werz, Dordrecht: Reidel, pp. 283-297.

This content downloaded from 194.29.185.230 on Sun, 15 Jun 2014 05:09:25 AMAll use subject to JSTOR Terms and Conditions