Advanced Review Energy distance - University of Wisconsin ...

12
Advanced Review Energy distance Maria L. Rizzo 1 * and Gábor J. Székely 2,3 Energy distance is a metric that measures the distance between the distributions of random vectors. Energy distance is zero if and only if the distributions are identical, thus it characterizes equality of distributions and provides a theo- retical foundation for statistical inference and analysis. Energy statistics are functions of distances between observations in metric spaces. As a statistic, energy distance can be applied to measure the difference between a sample and a hypothesized distribution or the difference between two or more samples in arbitrary, not necessarily equal dimensions. The name energy is inspired by the close analogy with Newtons gravitational potential energy. Applications include testing independence by distance covariance, goodness-of-t, nonparametric tests for equality of distributions and extension of analysis of variance, generali- zations of clustering algorithms, change point analysis, feature selection, and more. © 2015 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2016, 8:2738. doi: 10.1002/wics.1375 Keywords: Multivariate, goodness-of-t, distance correlation, DISCO, independence INTRODUCTION E nergy distance is a distance between probability distributions. The name energyis motivated by analogy to the potential energy between objects in a gravitational space. The potential energy is zero if and only if the location (the gravitational center) of the two objects coincide, and increases as their dis- tance in space increases. One can apply the notion of potential energy to data as follows. Let X and Y be independent random vectors in d , with cumulative distribution function (CDF) F and G, respectively. In what follows, kk denotes the Euclidean norm (length) of its argument, E denotes expected value, and a primed random variable X 0 denotes an independent and identically distributed (iid) copy of X; that is, X and X 0 are iid. Similarly, Y and Y 0 are iid. The squared energy distance can be dened in terms of expected distances between the random vectors D 2 F, G ð Þ : =2E k X Y k E k X X 0 k E k Y Y 0 k 0; and the energy distance between distributions F and G is dened as the square root of D 2 (F,G). It can be shown that energy distance D(F,G) satises all axioms of a metric, and in particular D (F,G) = 0 if and only if F = G. Therefore, energy dis- tance provides a characterization of equality of distri- butions, and a theoretical basis for the development of statistical inference and multivariate analysis based on Euclidean distances. In this review we discuss sev- eral of the important applications and illustrate their implementation. *Correspondence to: [email protected] 1 Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH, USA 2 National Science Foundation, Arlington, VA, USA 3 Rényi Institute of Mathematics, Hungarian Academy of Sciences, Hungary Conict of interest: The authors have declared no conicts of inter- est for this article. Volume 8, January/February 2016 © 2015 Wiley Periodicals, Inc. 27

Transcript of Advanced Review Energy distance - University of Wisconsin ...

Advanced Review

Energy distanceMaria L. Rizzo1* and Gábor J. Székely2,3

Energy distance is a metric that measures the distance between the distributionsof random vectors. Energy distance is zero if and only if the distributionsare identical, thus it characterizes equality of distributions and provides a theo-retical foundation for statistical inference and analysis. Energy statistics arefunctions of distances between observations in metric spaces. As a statistic,energy distance can be applied to measure the difference between a sample anda hypothesized distribution or the difference between two or more samples inarbitrary, not necessarily equal dimensions. The name energy is inspired by theclose analogy with Newton’s gravitational potential energy. Applications includetesting independence by distance covariance, goodness-of-fit, nonparametrictests for equality of distributions and extension of analysis of variance, generali-zations of clustering algorithms, change point analysis, feature selection,and more. © 2015 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2016, 8:27–38. doi: 10.1002/wics.1375

Keywords: Multivariate, goodness-of-fit, distance correlation, DISCO,independence

INTRODUCTION

Energy distance is a distance between probabilitydistributions. The name ‘energy’ is motivated by

analogy to the potential energy between objects in agravitational space. The potential energy is zero ifand only if the location (the gravitational center) ofthe two objects coincide, and increases as their dis-tance in space increases. One can apply the notion ofpotential energy to data as follows. Let X and Y beindependent random vectors in ℝd, with cumulativedistribution function (CDF) F and G, respectively. Inwhat follows, k � k denotes the Euclidean norm

(length) of its argument, E denotes expectedvalue, and a primed random variable X0 denotes anindependent and identically distributed (iid) copy ofX; that is, X and X0 are iid. Similarly, Y and Y0 areiid. The squared energy distance can be defined interms of expected distances between the randomvectors

D2 F,Gð Þ : = 2E kX−Y k −E kX−X0 k−E kY−Y 0 k ≥ 0;

and the energy distance between distributions F andG is defined as the square root of D2(F,G).

It can be shown that energy distance D(F,G)satisfies all axioms of a metric, and in particular D(F,G) = 0 if and only if F = G. Therefore, energy dis-tance provides a characterization of equality of distri-butions, and a theoretical basis for the developmentof statistical inference and multivariate analysis basedon Euclidean distances. In this review we discuss sev-eral of the important applications and illustrate theirimplementation.

*Correspondence to: [email protected] of Mathematics and Statistics, Bowling Green StateUniversity, Bowling Green, OH, USA2National Science Foundation, Arlington, VA, USA3Rényi Institute of Mathematics, Hungarian Academy of Sciences,Hungary

Conflict of interest: The authors have declared no conflicts of inter-est for this article.

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 27

BACKGROUND AND APPLICATIONOF ENERGY DISTANCE

The notion of ‘energy statistics’1–3 was introduced bySzékely in 1984–1985 in a series of lectures given inBudapest, Hungary, and at MIT, Yale, and Columbia.As mentioned above, Székely’s main idea and thename derive from the concept of Newton’s potentialenergy. Statistical observations can be consideredas objects in a metric space that are governed by astatistical potential energy that is zero if and only ifan underlying statistical null hypothesis is true.Energy statistics (E-statistics) are a class of functionsof distances between statistical observations.Several examples of one-sample, two-sample, andmulti-sample energy statistics will be illustratedbelow.

Cramér’s distance4 is closely related, but onlyin the univariate (real valued) case. For two real-valued random variables with CDFs F and G, thesquared energy distance is exactly twice the distanceproposed by Harald Cramér:

D2 F,Gð Þ = 2ð∞

−∞F xð Þ−G xð Þð Þ2dx:

However, the equivalence of energy distancewith Cramer’s distance cannot extend to higherdimensions, because while energy distance is rotationinvariant, Cramér’s distance does not have thisproperty.

A proof of the basic energy inequality,D(F,G) ≥ 0 with equality if and only if F = G followsfrom Ref 5 and also from Mattner’s result.6 An alter-nate proof related to a result of Morgenstern7

appears in Refs 8,9.Application to testing for equality of two distri-

butions appeared in Refs 8,10–12 as well as a multi-sample test for equality of distributions ‘distancecomponents’ (DISCO).13 Goodness-of-fit tests havebeen developed for multivariate normality,8,9 stabledistribution,14 Pareto distribution,15 and multivariateDirichlet distribution.16 Hierarchical clustering and ageneralization of k-means clustering based on energydistance are developed in Refs 17,18.

Generalizations and interesting specialcases of the energy distance have appeared in therecent literature; see Refs 19–22. A similaridea related to energy distance and E-statistics wereconsidered as N-distances and N-statistics in Ref 23;see also Ref 5. Measures of the energy distance typehave also been studied in the machine learningliterature.21

TESTING FOR EQUALDISTRIBUTIONS

Consider the null hypothesis that two random vari-ables, X and Y, have the same cumulative distribu-tion functions: F = G. For samples x1, …, xn andy1, …, ym from X and Y, respectively, the E-statisticfor testing this null hypothesis is

En,m X,Yð Þ : = 2A−B−C;

where A, B, and C are simply averages of pairwisedistances:

A =1nm

Xni =1

Xmj = 1

k xi−yj k , B =1n2

Xni =1

Xnj = 1

k xi−xj k ,

C =1m2

Xmi =1

Xmj = 1

k yi−yj k :

One can prove2,9 that E(X, Y) := D2(F, G) is zero ifand only if X and Y have the same distribution(F = G). It is also true that the statistic En,m is alwaysnon-negative. When the null hypothesis of equal dis-tributions is true, the test statistic

T =nmn +m

En,m X,Yð Þ:

converges in distribution to a quadratic form ofindependent standard normal random variables.Under an alternative hypothesis the statisticT tends to infinity stochastically as sample sizestends to infinity, so the energy test for equal distribu-tions that rejects the null for large values of T isconsistent.12

Because the null distribution of T depends onthe distributions of X and Y, the test is implementedas a permutation test in the energy package,24 whichis available for R25 on the Comprehensive R ArchiveNetwork (CRAN) under general public license. Thetest is implemented in the function eqdist.etest.The data argument can be the data matrix or dis-tance matrix of the pooled sample, with thedefault being the data matrix. The second argumentis a vector of sample sizes. Here we test whether twospecies of iris data differ in the distribution of theirfour dimensional measurements. In this example,n = m =50 and we use 999 permutation replicatesfor the test decision.

Advanced Review wires.wiley.com/compstats

28 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016

To compute the energy statistic only:

The E-statistic is not standardized, so it isreported only for reference. To interpret the value,one should normalize the statistic. One way of doingthis is to divide by an estimate of E kX−Y k. Notethat if

H : =D2 FX,FYð Þ2E kX−Y k =

2E kX−Y k −E kX−X0 k −E kY−Y0 k2E kX−Y k ;

then 0 ≤ H ≤ 1 with H = 0 if and only if X and Y areidentically distributed.

For background on permutation tests, see Ref26 or Ref 27.

For more details, applications, and power com-parisons see Refs 11,12 and the documentationincluded with the energy package. The same func-tions are generalized to handle multi-sample pro-blems, discussed below.

MULTI-SAMPLE ENERGYSTATISTICS

Distance Components: A NonparametricExtension of ANOVAAnalogous to the ANOVA decomposition of vari-ance, we partition the total dispersion of the pooledsamples into between and within components, calleddistance components (DISCO). For two samples, thebetween-sample component is the two-sample energystatistic above. For several samples, the betweencomponent is a weighted combination of pairwisetwo-sample energy statistics.

To test the K-sample hypothesis H0 : F1 = … =FK, K ≥ 2, one can apply the between-sample test sta-tistic, or alternately a ratio statistic similar to thefamiliar F statistic of ANOVA, both implemented asa randomization test.

First, let us introduce a generalization ofthe energy distance. The characterization of equalityof distributions by energy distance also holds ifwe replace Euclidean distance by k X − Y k α,where 0 < α < 2. The characterization does nothold if α = 2 because 2E kX−Yk2−E kX−X0k2−E kY−Y 0k2 = 0 whenever EX =EY. We denote thecorresponding two-sample energy statistic by E αð Þ

n,m.Let A = a1,…,an1f g, B = b1,…,bn2f g be two

samples, and define

gα A,Bð Þ : = 1n1n2

Xn1i = 1

Xn2m =1

k ai−bmkα; ð1Þ

for 0 < α ≤ 2. If A1, …, AK are samples of sizes

n1, n2, …, nK, respectively, and N =XK

j = 1nj, the

within-sample dispersion statistic is

Wα =Wα A1,…,AKð Þ=XKj = 1

nj2gα Aj,Aj� �

; ð2Þ

and the total dispersion of the observed response is

Tα =Tα A1,…,AKð Þ= N2gα A,Að Þ; ð3Þ

where A is the pooled sample. The between-sampleenergy statistic is

Sn,α =X

1 ≤ j <k ≤K

nj + nk2N

� � njnknj + nk

E αð Þnj,nk Aj,Ak

� �� �

=X

1 ≤ j < k ≤K

njnk2N

2gα Aj,Ak� �

−gα Aj,Aj� �

−gα Ak,Akð Þ� �n o:

ð4Þ

Note that Sn,α weights each pairwise E-statistic basedon the proportion of data in the two samples. Analo-gous to the decomposition of between-sample vari-ance and within-sample variance of ANOVA, wehave the decomposition of distances

Tα = Sα +Wα;

where both Sα and Wα are nonnegative.

WIREs Computational Statistics Energy distance

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 29

For every 0 < α < 2, the statistic Eq. (4) deter-mines a consistent test of the multi-sample hypothesisof equal distributions.13 In the special case where all Fjare univariate distributions and α = 2, Sn,2 is theANOVA between sample sum of squared error andthe decomposition T2 = S2 + W2 is the ANOVAdecomposition. The ANOVA test statistic measuresdifferences in means, not distributions. However, if weapply α = 1 (Euclidean distance) or any 0 < α < 2 asthe exponent on Euclidean distance, the correspondingenergy test is consistent against all alternatives withfinite α moments. If any of the underlying distributionsmay have non-finite first moment, a suitable choice ofα extends the energy test to this situation.

Returning to the iris data example, we can eas-ily apply the test for equality of the three species’ dis-tributions using a choice of methods, and therelevant options here are method="discoB" ormethod="discoF". The first uses the between-sample statistic, and the second uses an ‘F ’ ratio likeANOVA. Both are implemented as permutation tests.

Fn,α =Sn,α= K−1ð ÞWn,α= N−Kð Þ ;

and the decomposition details are displayed in a tablesimilar to an ANOVA table. Although it has thesame form as an F statistic, it does not have anF distribution and a permutation test is applied.

One can obtain a table of pairwise energy sta-tistics for any number of samples in one step usingthe edist function in the energy package. It can beused, for example, to display E-statistics for theresult of a cluster analysis.

E-clusteringEnergy distance has been applied in hierarchical clus-ter analysis.17 It generalizes the well-known Ward’sminimum variance method in a similar way thatDISCO generalizes ANOVA. The DISCO decomposi-tion has recently been applied to generalize k-meansclustering.18

In an agglomerative hierarchical clusteringalgorithm, starting with single observations, at eachstep we merge clusters that have minimum clusterdistance. In the energy distance algorithm, the clusterdistance is the two sample energy statistic.

There is a general class of hierarchical cluster-ing algorithms uniquely determined by their respec-tive recursive formula for updating all clusterdistances following each merge of two clusters. Onecan show17 that the energy clustering algorithmis also a member of this class and its recursive for-mula shows that it it is formally similar to Ward’smethod.

Suppose that the disjoint clusters Ci, Cj are tobe merged at the current step. If Ck is a disjoint clus-ter, then the new cluster distances can be computedby the following recursive formula:

d Ci[Cj,Ck� �

=ni + nk

ni + nj + nkd Ci,Ckð Þ

+nj + nk

ni + nj + nkd Cj,Ck� �

−nk

ni + nj + nkd Ci,Cj� �

;

ð5Þ

Advanced Review wires.wiley.com/compstats

30 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016

where d Ci,Cj� �

= Eni,nj Ci,Cj� �

, and ni, nj, nk are thesizes of clusters Ci, Cj, Ck, respectively. Let dij := d(Ci, Cj). Then

d ijð Þk : = d Ci[Cj,Ck� �

= αid Ci,Ckð Þ+ αj d Cj,Ck� �

+ βd Ci,Cj� �

= αidik + αjdjk + βdij + γjdik−djkj;

where

αi =ni + nk

ni + nj + nk; β =

−nkni +nj +nk

; γ = 0:

If we substitute squared Euclidean distances forEuclidean distances in this recursive formula, keepingthe same parameters (αi, αj, β, γ), then we obtain theupdating formula for Ward’s minimum variancemethod. However, we know that Ward’s method(with exponent α = 2 on distances) is a geometricalmethod that separates clusters by their centers, notby their distributions. E-clustering generalizes Wardbecause for every 0 < α < 2, the energy clusteringalgorithm separates clusters that differ indistribution.

Overall in simulations and real dataexamples17 the characterization property of E is aclear advantage for certain clustering problems, with-out sacrificing the good properties of Ward’s mini-mum variance method for separating sphericalclusters.

The hclust hierarchical clustering functionprovided in R25 implements exactly the above recur-sive formula Eq. (5), and therefore one can usehclust to apply either the E-clustering solution orWard’s method by specifying method ‘ward.D’(energy) or ‘ward.D2’ (Ward).

TESTING INDEPENDENCE

An important application for two samples applies totesting independence of random vectors. In this case,we test whether the joint distribution of X and Y isequal to the product of their marginal distributions.Interestingly, the statistics can be expressed in aproduct–moment expression involving the double-centered distance matrices of the X and Y samples.The statistics based on distances are analogous to,but more general than, product–moment covarianceand correlation. This suggests the names distancecovariance (dCov) and distance correlation (dCor),defined below.

Distance CovarianceThe simplest formula for the distance covariance sta-tistic is the square root of

V2n X,Yð Þ = 1

n2Xni, j = 1

bAijbBij;

where  and bB are the double-centered distancematrices of the X sample and the Y sample, respec-tively, and the subscript ij denotes the entry in thei-th row and j-th column. The double-centered dis-tance matrices are computed as in classical multidi-mensional scaling. Given a random sample(x, y) = {(xi, yi) : i = 1, …, n} from the joint distribu-tion of random vectors X in ℝp and Y in ℝq, com-pute the Euclidean distance matrix (aij) = (kxi − xjk)for the X sample and (bij) = (kyi − yjk) for theY sample. The ij-th entry of  is

bAij = aij−�ai:−�a: j + �a.. , i, j = 1,…,n;

where

�ai: =1n

Xnj = 1

aij, �a:j, =1n

Xni = 1

aij, �a.. =1n2

Xni, j = 1

aij:

Similarly, the ij-th entry of bB is

bBij = bij−�bi:−�b: j + �b.., i, j = 1,…,n:

The sample distance variance is

V2n Xð Þ =V2

n X,Xð Þ= 1n2

Xni, j = 1

bA2ij :

The distance covariance statistic is always non-nega-tive, and V2

n Xð Þ = 0 only if all of the sample observa-tions are identical (see Ref 28).

Distance CorrelationDistance correlation is the standardized distancecovariance. We have defined the squared distancecovariance statistic

V2n X,Yð Þ= 1

n2Xni, j =1

bAijbBij ;

and the squared distance correlation is defined by

WIREs Computational Statistics Energy distance

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 31

R2n X,Yð Þ =

V2n X,Yð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

V2n Xð ÞV2

n Yð Þq , V2

n Xð Þ V2n Yð Þ > 0 ;

0, V2n Xð ÞV2

n Yð Þ = 0:

8><>:

Distance correlation satisfies

1. 0 ≤ Rn(X, Y) ≤ 1.

2. If Rn(X, Y) = 1 then there exists a vector a, anon-zero real number b and an orthogonalmatrix R such that Y = a + bXR, for the datamatrices X and Y.

Remark 1. One could also define dCov as V2n and

dCor as R2n rather than by their respective

square roots. There are reasons to prefer each defini-tion, but historically28,29 the above definitions wereused. When we deal with unbiased statistics, we nolonger have the non-negativity property, so we can-not take the square root and need to work with thesquare.

Population CoefficientsSuppose that X 2 ℝp, Y 2 ℝq, E kX k < ∞ , andE kY k < ∞ . The squared population distance covari-ance coefficient can be written in terms of expecteddistances:

V2 X,Yð Þ =E kX−X0 kkY−Y 0 k +E kX−X0 k �EkY−Y 0 k −2E kX−X0 kkY−Y 00 k ;

where (X,Y), (X0, Y0), and (X00, Y00) are iid.29 HereV2 X,Yð Þ is an energy distance between the joint dis-tribution of (X,Y) and the product of the marginaldistributions of X and Y.

We have a characterization of independence:V2 X,Yð Þ ≥ 0 with equality to 0 if and only if X andY are independent. Population distance correlationR(X, Y) is the square root of the standardizedcoefficient:

R2 X,Yð Þ =V2 X,Yð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiV2 Xð ÞV2 Yð Þ

q , V2 Xð Þ V2 Yð Þ > 0 ;

0, V2 Xð Þ V2 Yð Þ = 0:

8><>:

The empirical coefficients Vn X,Yð Þ and Rn(X, Y)converge almost surely to the population coefficientsV X,Yð Þ and R(X, Y), as n ! ∞. The distribution ofQn = nV2

n X,Yð Þ converges to a quadratic formX∞

i =1λiZ2

i , where Zi are iid standard normal and λiare non-negative coefficients that depend on the dis-tributions of the underlying random variables. Valuesof Qn near zero are consistent with the null hypothe-sis of independence, while large Qn support the alter-native. Thus, a consistent test of multivariateindependence is based on the distance covarianceQn = nV2

n, and it can be implemented as a nonpara-metric permutation test. The dCov test applies torandom vectors in arbitrary, not necessarily equaldimensions, for any sample size n ≥ 4. For highdimensional X and Y, there is also a distance correla-tion t-test of independence introduced in Ref 30 isapplicable when dimension exceeds sample size.

For the permutation test in this case, the nulldistribution is sampled by permuting the indices ofone of the two variables each time a sample is drawn.The replicates then provide a reference distributionfor estimating the tail probability to the right of theobserved test statistic Qn.

Remark 2. Note that the permutation test is onlyapplicable if the observations are exchangeable underthe null, which would not be true e.g., for timeseries data.

The statistics and tests are implemented in theenergy package for R.24,25 Functions dcor, dcov,dcov.test and dcor.ttest compute the statisticsand tests of independence. The following exampleuses the crabs data in the MASS package. Afterconverting the binary factors to integers, we test ifthe two-dimensional categorical variable (species, sex)is independent of the vector of body measurements.

Advanced Review wires.wiley.com/compstats

32 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016

The data arguments to dcov.test can be datamatrices or distance objects returned by the R distfunction. Here the arguments are data matrices.

The test is significant and we reject independ-ence. One may also want to compute the statisticsusing dcor or dcov:

To recover all of the statistics from dCor, a util-ity function is provided:

An Unbiased Distance CovarianceStatisticThe population distance covariance is zero underindependence, but the dCov statistic is non-negative,hence its expected value is positive except in degener-ate cases. Clearly the dCov statistic in its original for-mulation is biased for the population coefficient. Thebias is in fact increasing with dimension.

An unbiased estimator of V2 X,Yð Þ was given inRef. 30 and an equivalent unbiased statistic wasgiven in Ref 31. Although the latter one looks sim-pler, the original may be faster to compute. The fol-lowing is from Ref 31.

Let A = (aij) be a symmetric, real valued n × nmatrix with zero diagonal (not necessarily Euclideandistances). Instead of the classical method of doublecentering A, we introduce the U-centered matrix Ã.The (i,j)-th entry of à is

eAi, j =ai, j −

1n−2

Xni = 1

ai, j −1

n−2

Xnj =1

ai, j +1

n−1ð Þ n−2ð ÞXni, j = 1

ai, j, i 6¼ j;

0, i = j:

8><>:

Here ‘U-centered’ refers to the result that the corre-sponding squared distance covariance statistic is anunbiased estimator of the population coefficient.

As the first step in computing the unbiased sta-tistic, we replace the double centering operation withU-centering, to obtain U-centered distance matricesà and eB. Then

eA�eB� �: =

1n n−3ð Þ

Xi6¼j

eAi, jeBi, j,

is an unbiased estimator of squared population dis-tance covariance V2 X,Yð Þ. The inner product nota-tion is due to the fact that this statistic is an innerproduct in the Hilbert space of U-centered distancematrices.31

A bias corrected R2n is defined by normalizing

the inner product statistic with the bias correcteddVar statistics. The bias-corrected dCor statistic isimplemented in the R energy package by the bcdcorfunction. Returning to the example, here is a compar-ison of biased and bias-corrected R2

n:

The above unbiased inner product dCov statis-tic is easy to compute, but since it can take negativevalues, we cannot define the bias corrected statisticto be the square root of it. Thus we avoid the‘squared’ notation and use the inner product opera-tor or V*

n and R*n. One could have defined ‘dCov’

from the start to be the square of the energy distance.Historically, the rationale for choosing the squareroot definition is that in this case distance covarianceis the energy distance between the joint distributionof the variables and the product of their marginals. Adisadvantage is that the distance variance, ratherthan the distance standard deviation, is measured inthe same units as the distances.

It is clear that both the sample dCov and thesample dCor can be computed in O(n2) steps.Recently Huo and Székely32 proved that for real val-ued samples the unbiased estimator of the squaredpopulation distance covariance can be computed byan O(n log n) algorithm. The supplementary files toRef. 32 include an implementation in Matlab.

Distance Correlation for DissimilarityMatricesIt is important to notice that à does not change if weadd the same constant to all off-diagonal entries andU-center the result. In Ref 31, it is shown that the innerproduct version of dCov can be applied to any U-

WIREs Computational Statistics Energy distance

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 33

centered dissimilarity matrix (zero-diagonal symmeticmatrix). The algorithm is outlined in Ref 31, and it is astrong competitor of the Mantel test for associationbetween dissimilarity matrices based on comparisonsin the paper. This makes dCov tests and energy statis-tics ready to apply to problems in e.g., communityecology, where one must often work with data in theform of non-Euclidean dissimilarity matrices.

Partial Distance CorrelationBased on the inner product dCov statistic (the unbi-ased estimator of V2) theory is developed to definepartial distance correlation analogous to (linear) par-tial correlation. There is a simple computing formulafor the pdCor statistic and there is a test for thehypothesis of zero pdCor based on the inner product.Energy statistics are defined for random vectors, sopdCor(X, Y; Z) is a scalar coefficient defined for ran-dom vectors X, Y, and Z in arbitrary dimension. Thestatistics and tests are described in detail in Ref. 31and currently implemented in an R package pdcor,33

which will become part of the energy package.

GOODNESS-OF-FIT

Goodness-of-fit is a one-sample problem, but thereare two distributions to consider: one is thehypothesized distribution and the other is the under-lying distribution from which the observed samplehas been drawn. Energy distance applies to comparethese two distributions with a variation of the twosample energy distance.

The energy distance for this problem must bethe same as E(X, Y) where one of the variables nowrepresents the unknown sampled distribution. Sup-pose that a random sample x1, …, xn is observedand the problem is to test whether the sampled distri-bution F is equal to the hypothesized distribution FX.The energy goodness-of-fit statistic is

En = n2n

Xni = 1

E k xi−Xkα−E kX−X0kα− 1n2

Xni = 1

Xnj = 1

kxi−xjkα0@

1A;

ð6Þ

where X and X0 are iid with distribution FX, and0 < α < 2. The statistic is defined in arbitrary dimen-sion and is not restricted by sample size. The onlyrequired condition is that kXk has finite α momentunder the null hypothesis. Under the null hypothesisEEn =E kX−X0kα, and the asymptotic distribution ofEn is a quadratic form of centered Gaussian randomvariables. The rejection region is in the upper tail.Under an alternative hypothesis, En tends to infinity

stochastically, and therefore En determines a consist-ent goodness-of-fit test.

For most applications the exponent α = 1(Euclidean distance) can be applied, but smallerexponents have been applied for testing distributionswith heavy tails including Pareto, Cauchy, and stabledistributions.14,15 The important special case of test-ing multivariate normality8,9 is fully implemented inthe energy24 package for R.

A detailed introduction to the energy goodness-of-fit tests with several simple examples can be foundin Ref 3. Here we will focus on the important appli-cations of testing for multivariate normality, startingwith the special case of univariate normality.

The energy statistic for testing whether a sam-ple X1, …, Xn is from a multivariate normal distribu-tion N(μ, Σ) is developed by Székely and Rizzo.9 Letx1, …, xn denote an observed random sample.

Univariate NormalityFor a test of univariate normality, we apply the sta-tistic Eq. (6). Suppose that the null hypothesis is thatthe sample has a Normal distribution with mean μand variance σ2. Then it can be derived that

E k xi−X k = 2 xi−μð ÞF xið Þ+ 2σ2f xið Þ− xi−μð Þ;E kX−X0 k =

2σffiffiffiπ

p ;

where F and f denote the cdf and density of thehypothesized Normal(μ, σ2) distribution. The lastsum in the statistic En can be linearized in terms ofthe ordered sample, which allows computation inO(n log n) time.

Generally, the parameters μ and σ are unknown.In that case we first standardize the sample using thesample mean and the sample standard deviation, thentest the fit to the standard normal distribution. Theestimated parameters change the critical values of thedistribution but not the general shape, and the rejec-tion region is in the upper tail. See Ref 8 for a detailedproof that the test with estimated parameters is statis-tically consistent (also for the multivariate case)against all alternatives with finite moments.

Multivariate NormalityThe statistic En for testing multivariate normality isconsiderably more difficult to derive. We first stand-ardize the sample using the sample mean vector andthe sample covariance matrix to estimate parameters.Then we test the sample for fit to standard multivari-ate normal. If Z and Z0 are iid standard normal indimension d, we have

Advanced Review wires.wiley.com/compstats

34 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016

E kZ−Z0kd =ffiffiffi2

pE kZkd = 2

Γ d +12

� �Γ d

2

� � ;

where Γ(�) is the complete gamma function. Ify1, …, yn are the standardized sample elements, thecomputing formula for the test of multivariate nor-mality in ℝd is

nEn,d = n2n

Xnj =1

E k yj−Zkd −2Γ d + 1

2

� �Γ d

2

� � −1n2

Xnj,k = 1

k yj−ykkd

0B@

1CA

where

E k a−Zkd =

ffiffiffi2

d + 12

Γd2

+

ffiffiffi2π

r X∞k = 0

−1ð Þkk!2k

k akd2k + 22k + 1ð Þ 2k + 2ð Þ

Γd + 12

Γ k +

32

Γ k +d2+ 1

:

The expression for E k a − Z k d follows from thefact that if Z is a d-variate standard normal randomvector, then k a−Z k2d has a noncentral chisquare dis-tribution χ2[ν; λ] with noncentrality parameterλ= jaj2d=2, and degrees of freedom ν = d + 2ψ , whereψ is a Poisson random variable with mean λ. Typi-cally the sum in E k a − Z k d converges to within asmall tolerance after 40–60 terms, except when a is atrue outlier of the standard multivariate normal dis-tribution (when k a k is very large). However,E k a−Z k converges to k a k as k a k ! ∞, so wecan evaluate E k a−Z k≊ k a k in that case. See thesource code in ‘energy.c’ of the energy package24 foran implementation.

The test is implemented for the multivariatenormal distribution as follows, in the energy pack-age24 with the mvnorm.etest function. If theobserved sample size is n, it is standardized using thesample mean vector and sample covariance, and theobserved test statistic En,d computed. A large numberM of standard multivariate normal samples of size

n are generated, j = 1, …, M, each standardizedusing the j-th sample mean vector and sample covari-

ance, and E jð Þn,d is computed for each of these samples

to obtain a reference distribution. An estimatedp-value is obtained by finding the proportion of repli-

cates E jð Þn,d that exceed the observed En,d statistic. An

example illustrating the test of normality for thefour-dimensional iris setosa data (included in the Rdistribution) follows. In this example, the hypothesisof normality is rejected at significance level 0.05.

Under the null hypothesis, n En,d convergesin distribution to a quadratic form Qd =

X∞

i = 1λiZ2

i ,as n ! ∞, where Zi are iid standard normal randomvariables and λi are non-negative constants thatdepend on the parameters of the null distribution.Figure 1 displays replicates for testing the iris datawith the observed En,d identified by the large blackdot. The density curve overlaid on the plot is anapproximation (n = 50) of the density of the asymp-totic distribution Qd.

E

Den

sity

0.8 1.0 1.2 1.4

0

1

2

3

4

FIGURE 1 | Replicates generated under the null hypothesis in atest of multivariate normality for the iris setosa data. The test statisticEn,d of the observed iris sample is located by the black dot.

WIREs Computational Statistics Energy distance

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 35

The energy goodness-of-fit test could alternatelybe implemented by evaluating the constants λi, butthat is a difficult problem except in special cases. Someanalytical results for univariate normality are given inRef 3 and a numerical approach investigated in Ref

14. See also Ref 8 for tabulated critical values of nbEn,d

for several (n, d) obtained by large scale simulation.The energy test of multivariate normality is

practical to apply via parametric bootstrap as illus-trated above for arbitrary dimension, and sample sizen < d is not a problem. Monte Carlo power compari-sons9 suggest that the energy test is a powerful com-petitor to other tests of multivariate normality.Indeed, there are very few other tests in the literaturefor multivariate normality like energy that are con-sistent, powerful, omnibus tests with practical imple-mentation; the BHEP tests,34,35 which also apply acharacterization of equality between distributions,share these properties, and have recently been imple-mented in an R package MVN. See Ref 9 forcomparisons.

GENERALIZATIONS

One might wonder what makes the functions | � |α,0 < α < 2 special in the definition above. One canshow that the key property is that | � |α, 0 < α < 2 isstrongly negative definite (see Lyons20). In this case,the generalized distance function remains a metric formeasuring the distance between probability distribu-tions F and G, it is nonnegative and equals zero ifand only if F = G. What makes the distance functionspecial in the class of strongly negative definite func-tions is that the distance function is scale equivariant.If we change the scale by replacing x by cx and y bycy, then the squared distance D2(F, G) is multipliedby c, and therefore the ratio of these functions doesnot depend on the constant c.

This property of invariance also holds for0 < α < 2, because if we change the measurementunits in D2(α)(F, G), replace X by cX and Y by cY,then D2(α)(F, G) is multiplied by cα and the ratio oftwo statistics of this type is again invariant withrespect to c. Hence the statistical decisions based onthese ratios do not depend on the choice of measure-ment units. This invariance property is essential.

One can further generalize energy distance toprobability distributions on metric spaces. Consider ametric space (M, d) which has Borel sigma algebraB(M), so (M, B(M)) is a measurable space. Let P Mð Þdenote the set of probability measures on (M, B(M)).Then for any pair of probability measures μ and ν inP Mð Þ, we can define the energy distance in terms ofthe associated random variables X and Y and themetric d as the square root of

D2 μ,νð Þ= 2E d X,Yð Þ½ �−E d X,X0ð Þ½ �−E d Y,Y 0ð Þ½ �;

provided that D2(μ, ν) ≥ 0. However, in generalD2(μ, ν) can be negative. In order that D is a metric,it is necessary and sufficient that (M,d) is stronglynegative definite (see Lyons20). When (M,d) isstrongly negative definite, then the energy distanceD(μ, ν) equals zero if and only if the distributions areequal. A commonly applied metric that is negativedefinite but not strongly negative definite is the taxi-cab metric in ℝ2. Lyons20 showed that all separableHilbert spaces (and in particular Euclidean spaces)have strong negative type.

CONCLUSION

Energy distance is a powerful tool for multivariateanalysis. It applies to random vectors in arbitrarydimensions, and the methodology requires only themild assumption of finite first moments or at leastfinite α > 0 moments for some positive α. Computingformulas are simple and the tests have been imple-mented by nonparametric methods using resamplingor Monte Carlo methods. We have illustrated the useof the functions in the energy package24 for severalof the methods. The package is open source and dis-tributed under general public license. To scale up tobig data problems, for problems such as cluster anal-ysis, one could apply a ‘divide and recombine’(D&R) analysis.

Readers may also refer to several interestingapplications in a variety of disciplines, under FurtherReading. Our review of the background and applica-tions of energy distance is not an exhaustive bibliog-raphy, but intended as a starting point.

FURTHER READINGDueck J, Edelmann D, Gneiting T, Richards D. The affinely invariant distance correlation. Bernoulli 2014, 20:2305–2330.

Feuerverger A. A consistent test for bivariate dependence. Int Stat Rev 1993, 61:419–433.

Advanced Review wires.wiley.com/compstats

36 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016

Székely GJ, Rizzo ML. On the uniqueness of distance covariance. Stat Probab Lett 2012, 82:2278–2282. doi:10.1016/j.spl.2012.08.007.

Wahba G. Positive definite functions, reproducing kernel Hilbert spaces and all that. The Fisher Lecture at JSM 2014,2014. Available at: http://www.stat.wisc.edu/wahba/talks1/fisher.14/wahba.fisher.7.11.pdf.

Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007, 102:359–378.doi:10.1198/016214506000001437.

Gretton A. A simpler condition for consistency of a kernel independence test, 2015. Available at: http://arxiv.org/abs/1501.06103v1.

Gretton A, Györfi L. Consistent nonparametric tests of independence. J Mach Learn Res 2010, 11:1391–1423.

Kong J, Wang S, Wahba G. Using distance covariance for improved variable selection with application to learning geneticrisk models. Stat Med 2015, 34/10:1097–1258. doi:10.1002/sim.6441.

Kong J, Klein BEK, Klein R, Lee KE, Wahba G. Using distance correlation and SS-ANOVA to assess associations of familialrelationships, lifestyle factors, diseases, and mortality. Proc Natl Acad Sci USA 2012, 109:20352–20357. doi:10.1073/pnas.1217269109.

Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc 2012, 107:1129–1139.doi:10.1080/01621459.2012.695654.

Martinez-Gomez E, Richards MT, Richards DSP. Distance correlation methods for discovering associations in large astro-physical databases. Astrophys J 2014, 781:39.

Kim AY, Marzban C, Percival DB, Stuetzle W. Using labeled data to evaluate change detectors in a multivariate streamingenvironment. Signal Process 2009, 89/12:2529–2536. doi:10.1016/j.sigpro.2009.04.011.

Menshenin DD, Zubkov AM. Properties of the Szekely-Mori symmetry criterion statistics in the case of binary vectors.Math Notes 2012, 91:62–72.

Székely GJ, Móri TF. A characteristic measure of asymmetry and its application for testing diagonal symmetry. CommunStat Theory Methods 2001, 30:1633–1639.

Székely GJ, Bakirov NK. Extremal probabilities for Gaussian quadratic forms. Probab Theory Relat Fields 2003,126:184–202.

Varin T, Bureau R, Mueller C, Willett P. Clustering files of chemical structures using the Szekely-Rizzo generalization ofWard’s method. J Mol Graphics Modell 2009, 28/2:187–195. doi:10.1016/j.jmgm.2009.06.006.

Zhou Z. Measuring nonlinear dependence in time-series, a distance correlation approach. J Time Ser Anal 2012,33:438–457.

REFERENCES1. Székely GJ. Potential and Kinetic Energy in Statistics,

Lecture Notes, Budapest Institute of Technology(Technical University), 1989.

2. Székely GJ. E-statistics: The Energy of Statistical Sam-ples, Technical Report, Bowling Green State Univer-sity, Department of Mathematics and StatisticsNo. 02–16 and Technical Reports by the same titlefrom 2000–2003, e.g. No.03-05 and NSA grant #MDA 904-02-1-s0091 (2000–2002), 2002.

3. Székely GJ, Rizzo ML. Energy statistics: statistics basedon distances. J Stat Plann Infer 2013, 143:1249–1272.doi:10.1016/j.jspi.2013.03.018.

4. Cramér H. On the composition of elementary errors.Skand Aktuar 1928, 11:141–180.

5. Zinger AA, Kakosyan AV, Klebanov LB. Characteriza-tion of distributions by means of mean values of

some statistics in connection with some probabilitymetrics, In: Stability Problems for StochasticModels. Moscow, VNIISI, 1989, 47–55. (in Russian),English Translation: A characterization ofdistributions by mean values of statistics and certainprobabilistic metrics, Journal of Soviet Mathematics(1992), 2012.

6. Mattner L. Strict negative definiteness of integrals viacomplete monotonicity of derivatives. Trans Am MathSoc 1997, 349:3321–3342.

7. Morgenstern D. Proof of a conjecture by Walter Deub-ner concerning the distance between points of twotypes in Rd. Discrete Math 2001, 226:347–349.

8. Rizzo ML. A new rotation invariant goodness-of-fittest, PhD dissertation, Bowling Green State Univer-sity, 2002.

WIREs Computational Statistics Energy distance

Volume 8, January/February 2016 © 2015 Wiley Per iodica ls , Inc. 37

9. Székely GJ, Rizzo ML. A new test for multivariate nor-mality. J Multivar Anal 2005, 93:58–80.

10. Baringhaus L, Franz C. On a new multivariate two-sample test. J Multivar Anal 2004, 88:190–206.

11. Rizzo ML. A test of homogeneity for two multivariatepopulations. In: 2002 Proceedings of the AmericanStatistical Association, Physical and EngineeringSciences Section. Alexandria, VA: American StatisticalAssociation, 2003.

12. Szekely GJ, Rizzo ML. Testing for equal distributionsin high dimension. InterStat 2004, Nov.

13. Rizzo ML, Székely GJ. DISCO analysis: a nonparamet-ric extension of analysis of variance. Ann Appl Stat2010, 4:1034–1055.

14. Yang G. The energy goodness-of-fit test for univariatestable distributions. PhD Thesis, Bowling Green StateUniversity, 2012.

15. Rizzo ML. New goodness-of-fit tests for Pareto distri-butions. ASTIN Bull 2009, 39:691–715.

16. Li Y. Goodness-of-fit tests for Dirichlet distributionswith applications. PhD Thesis, Bowling Green StateUniversity, 2015.

17. Székely GJ, Rizzo ML. Hierarchical clusteringvia joint between-within distances: extending ward’sminimum variance method. J Classif 2005,22:151–183.

18. Li S. k-groups: a generalization of k-means by energydistance. PhD Thesis, Bowling Green State Univer-sity, 2015.

19. Baringhaus L, Franz C. Rigid motion invariant two-sample tests. Stat Sin 2010, 20:1333–1361.

20. Lyons R. Distance covariance in metric spaces. AnnProbab 2013, 41:3284–3305.

21. Sejdinovic D, Gretton A, Sriperumbudur B,Fukumizu K. Hypothesis testing using pairwise dis-tances and associated kernels. In: Proceedings of the29th International Conference on Machine Learning,Edinburgh, Scotland, UK, 2012. Available at: http://arxiv.org/abs/1205.0411v2.

22. Sejdinovic D, Sriperumbudur B, Gretton A,Fukumizu K. Equivalence of distance-based andRKHS-based statistics in hypothesis testing. Ann Stat2013, 41:2263–2291.

23. Klebanov LB. N-distances and TheirApplications. Charles University, Prague: KarolinumPress; 2005.

24. Rizzo ML, Székely GJ. Energy: E-statistics (energy sta-tistics). R package version 1.6.2, 2014. Available at:http://CRAN.R-project.org/package=energy.

25. R Core Team. R: A language and environment for sta-tistical computing. R Foundation for Statistical Com-puting, Vienna, Austria, 2015. Available at: http://www.R-project.org/.

26. Efron B, Tibshirani RJ. An Introduction to theBootstrap. Boca Raton, FL: Chapman & Hall/CRC; 1993.

27. Davison AC, Hinkley DV. Bootstrap Methods andtheir Application. Oxford: Cambridge UniversityPress; 1997.

28. Székely GJ, Rizzo ML, Bakirov NK. Measuringand testing independence by correlation ofdistances. Ann Stat 2007, 35:2769–2794. doi:10.1214/009053607000000505.

29. Székely GJ, Rizzo ML. Brownian distance covariance.Ann Appl Stat 2009, 3:1236–1265. doi:10.1214/09-AOAS312.

30. Székely GJ, Rizzo ML. The distance correlation t-test of independence in high dimension. JMultivar Anal 2013, 117:193–213. doi:10.1016/j.jmva.2013.02.012.

31. Székely GJ, Rizzo ML. Partial distance correlationwith methods for dissimilarities. Ann Stat 2014,42:2382–2412.

32. Huo X, Székely G. Fast computing for distancecovariance. Technometrics 2015. doi:10.1080/00401706.2015.1054435.

33. Rizzo ML, Székely GJ. pdcor: Partial distance correla-tion. R package version 1.0.0, 2014.

34. Henze N, Zirkler B. A class of invariant consistenttests for multivariate normality. Commun Stat TheoryMethods 1990, 19:3595–3618.

35. Henze N, Wagner T. A New Approach to the BHEPtests for multivariate normality. J Multivar Anal 1997,62:1–23.

Advanced Review wires.wiley.com/compstats

38 © 2015 Wiley Per iodicals , Inc. Volume 8, January/February 2016