Partial Distance Correlation

27
Partial Distance Correlation Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, May 2, 2014

description

Partial Distance Correlation. Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, May 2, 2014. Abstract. - PowerPoint PPT Presentation

Transcript of Partial Distance Correlation

Page 1: Partial Distance Correlation

Partial Distance Correlation

Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences

Columbia University, May 2, 2014

Page 2: Partial Distance Correlation

Abstract

Invariance problems of classical dependence measures like Pearson's classical correlation led me about ten years ago to introduce a new dependence measure, distance correlation, based on a generalization of Newton ’s gravitational potential energy. Distance covariance is the covariance of double centered distances between data in metric spaces. For random variables with finite expectation the population value of distance covariance and the corresponding distance correlation is zero if and only if the variables are independent. Until recently the definition of partial distance correlation remained an open problem. This can be solved by defining a new Hilbert space of distance matrices where the inner product corresponds to distance covariance. In this Hilbert space classical theorems on partial correlation of multivariate Gaussian random variables can be revitalized and proved for the general case. Applications include variable selection, dissimilarities, dimension reduction, etc.

References

Székely, G.J. (1985-2005) Technical Reports on Energy (E-)statistics and on distance correlation.Székely, G.J. and Rizzo, M. L. and Bakirov, N.K. (2007) Measuring and testing independence by correlation of

distances, Ann. Statistics 35/6, 2769-2794.Székely, G. J. and Rizzo, M. L (2009) Brownian distance covariance, Discussion paper, Ann. Applied Statistics. 3 /4

1236-1265.Lyons, R. (2013) Distance covariance in metric spaces, Ann. Probability. 41/5, 3284-3305.Székely, G. J. and Rizzo, M. L. (2013) Energy statistics: statistics based on distances, Invited Paper, J. Statistics

Planning and Inf. , 143/8, 1249-1272

Page 3: Partial Distance Correlation

A. N. Kolmogorov: “Independence is the most important

notion of probability theory”What is Pearson’s correlation? Sample: (Xk ,Yk ) k=1,2,…,n, Centered sample: Ak,=Xk-X. Bk=Yk-Y.

cov(x,y)=(1/n)ΣkAkBk

r:=cor(x,y) = cov(x,y)/[cov(x,x) cov(y,y)]1/2

(i) De Moivre (1738) The Doctrine of Chances introduces the notion of independent events(ii) Gauss (1823) – normal surface with n correlated variables – for Gauss this was just one of the several parameters(iii) Auguste Bravais(1846) referred to one of the parameters of the bivariate normal distribution as « une correlation” but like Gauss he did not recognize the importance of correlation as a measure of dependence between variables. [Analyse mathématique sur les probabilités des erreurs de situation d'un point. Mémoires présentés par divers savants à l'Académie royale des sciences de l'Institut de France, 9, 255-332.]

(iv) Francis Galton (1885-1888) (v) Karl Pearson (1895) product-moment rLIII. On lines and planes of closest fit to systems of points in spacePhilosophical Magazine Series 6, 1901. Pearson had no unpublished thoughts.Why do we (NOT) like Pearson’s correlation? What is the remedy?

Page 4: Partial Distance Correlation

A. Rényi (1959)7 natural axioms of dependence measures.

Axiom 4. ρ(X, Y) = 0 iff X, Y are independent.Axiom 5. For 1-1 f and g, ρ(X,Y) = ρ(f(X),g(Y)). Axiom 7. For bivariate normal ρ = |cor|.

Thm (Rényi) The 7 axioms are satisfied by the maximal correlation only. Definition of max cor: sup f,g Cor(f(X), g(Y)) for all f,g Borel functions with

0 < Var f(X) , Var g(Y) < ∞. Corollary of Rényi’s thm. Forget the topic of dependence measures! I did it until 2005.

Why should we (not) like max cor?For partial sums if iid maxcor2(Sm,Sn)=m/n for m≤n

For 0 ≤ i ≤ j ≤ n, for the ordered statistics maxcor2(Xi:n,Xj:n) = i(n+1-j)/[j(n+1-i)] (Székely, G.J. Mori, T.F. 1985, Letters).Hint: Jacobi polynomials.Sarmanov(1958) Dokl. Nauk. SSSR

Page 5: Partial Distance Correlation

What is wrong with max cor ?

Page 6: Partial Distance Correlation

Székely (2005) Distance correlation

Data for k=1,2,…,n we have (Xk , Yk).

(i) compute their distances (this is the next level of abstraction)

ak,l:= |Xk – Xl| bk,l:= |Yk – Yl| for k,l=1,2,…,n

(ii) Double center these distances:

Ak,l:= ak,l–ak.–a. l + a. . and Bk,l:= bk,l–bk .–b. l + b. .(iii) Distance Covariance: dCov²(X,Y) :=V²(X,Y):=

dcov(X,Y):=(1/n2)Σk lAk,l Bk,l ≥ 0 (!?!)

See Székely, G.J. , Bakirov, N. K., Rizzo, M.L. (2007) Ann. Statist. 35/7

Page 7: Partial Distance Correlation

Population (probability) definition of dCov

(X,Y) , (X’,Y’), (X”, Y”) are iid

dcov(X,Y)=E[|X–X’||Y-Y’|] +E|X-X’|E|Y-Y’| -E[|X–X’||Y-Y’’|] - E[|X–X’’||Y-Y’|]

dcov=cov(|X–X’|,|Y–Y’|)–2cov(|X-X’|,|Y-Y”|)

Declaration of Dependence: we have dependence iff dcov is not zero.

Page 8: Partial Distance Correlation

Pearson vs Distance Correlation• Pearson's correlation (cor)• Constraints of • 1 Linear dependence• 2 Two random variables• 3 Under normality, = 0 , independenceDistance correlation R is more effective: • 1 Any dependence• 2 dcor(X;Y ) is defined for X and Y in arbitrary dimensions• 3 dcor(X;Y ) = 0 , independence for arbitrary distribution• 4 If first we take the α>0 powers of distances then for the existence of the population

value it is enough to suppose that we have finite α moments. • 5 dcor(X,Y) has the same geometric interpretation as Pearson’s cor = cos φ (φ =

angle between X and Y), dcor = cos φ where φ = angle between the distance matrices in their Hilbert space.

dcor=R is easy to compute even in high school --- Teach It!

Page 9: Partial Distance Correlation

Why distance ? Why distance correlation?

Why distance? Distance eliminates dimension problems.

Distance Correlation has the following properties:• 0 ≤ dcor(X,Y) ≤ 1 and =0 iff X, Y are independent =1 iff X,

Y linearly dependent• dcor is rigid motion and scale invariant• dcor is simple to compute, O (n^2) operations

Why not maximal correlation? Too invariant! (=1 too often even for uncorrelated variables)

Distance correlation ≤ 1/√2< 0.71 for uncorrelated variables.

Prove it or disprove it!

Page 10: Partial Distance Correlation

The dual space

Thm: dCov(X,Y)=||f(s,t)-f(s)f(t)||

where ||.|| is the L2-norm withthe singular kernel w(s,t):= c/(st)²

WHY is this true?

Page 11: Partial Distance Correlation

A beautiful theorem of Fourier transforms

∫(1-cos tx)/t2dt= c|x|The Fourier transform of any power of |t| is a constant times a power of |x|

Gel’fand, I. M. – Shilov, G. E. (1958, 1964), Generalized FunctionsSee also Feuerverger, A. (1993) for a bivariate test of independence

Page 12: Partial Distance Correlation

Uniqueness of the kernel

If X is p-dimensional, Y is q-dimensional then the kernel, w(s,t):= c/(|s|p+1|t|q+1),is unique if dcov(X,Y) is rigid motion invariant and scale equivariant (implying thatdcor(X,Y) is invariant) with respect to rigid motions and with respect to similarities.Proof: G. J. Szekely and M. L. Rizzo (2012). On the uniqueness of distance covariance. Statistics & Probability Letters, Volume 82, Issue 12, 2278-2282.

Page 13: Partial Distance Correlation

Why is pdCor difficult?

pdcor is more complex than pcor becausesquared distance covariance is NOT an inner product in the usual linear space

The “residuals” (differences of certain distance matrices) are typically not distance matrices

We need to introduce a new Hilbert space where we can “interpret” the residuals.

Page 14: Partial Distance Correlation
Page 15: Partial Distance Correlation

ak,l:= |Xk – Xl| bk,l:= |Yk – Yl| for k,l=1,2,…,n

Ak,l := ak,l–ak.–a. l + a. . Bk,l:= bk,l–bk .–b. l + b..

(Biased) dcovn(X,Y) :=(1/n2)Σ k lAk,l Bk,l A*k,k := 0 and for k≠lA*k,l :=ak,l–n/(n-2) ak.–n/(n-2) a. l + n²/[(n-1)(n-2)]a. .

Unbiased dcovn*(X,Y):= [1/n(n-3)]Σk l A*k,l B*k,l The

corresponding distance correlation is R*(X,Y)

Page 16: Partial Distance Correlation

Bias corrected distance correlation

The power of dCor test for independence is very good especially for high dimensions p,q

Denote the unbiased version by dcov*n

The corresponding bias corrected distance correlation is R*n This is the correlation for the 21st century.Theorem. In high dimension if the CLT holds for the coordinates thenTn:=[M-1] 1/2 R*n/[1-(R*n)2]1/2 , M=n(n-3)/2), is t-distributed with d.f. M-1.

Page 17: Partial Distance Correlation

Additive constant invariance

A*k,l :=ak,l–n/(n-2) ak.–n/(n-2) a. l + n²/[(n-1)(n-2)]a. .

Add a constant c to all off-diagonal elements: c – (n-1)/(n-2) c – (n-1)/(n-2) c + n(n-1)/[(n-1)(n-2)] c = 0

Every symmetric 0 diagonal matrix (dissimilarity matrix) + big enough c for off-diagonal is a distance matrix

Denote by Hn the Hilbert space of nxn symmetic, 0 diagonal matrices matrices where the inner product is dcovn(X,Y). In Hn we can project, we have orthogonal residuals and their dcorn is pdcorn .

Page 18: Partial Distance Correlation

Dissimilarities

Thm. All dissimilarities are Hn equivalent to distance matrices.Proof. Multidimensional scaling combined with the additive constant theorem.

Cailliez, F (1983). The analytical solution of the additive constant problem. Psychometrika, 48, 343-349.

Page 19: Partial Distance Correlation

Mantel testHow to “Dismantel” the Mantel test (1967)?

Mantel: test of the correlation between two dissimilarity matrices of the same rank. This is commonly used in ecology.The various papers introducing the Mantel test and its extension the partial Mantel test lack a clear statistical framework specifying fully the null and alternative hypotheses.

dcov(X,Y) = cov(|X–X’|, |Y–Y’|) – 2cov(|X-X’|, |Y-Y”|) The first term is what Mantel applies but cov(|X–X’|, |Y–Y’|) = 0 does not characterize independence of X and Y: |f(s,t)|-|f(s)f(t)| ≡ 0 does not imply f(s,t)-f(s)f(t) ≡ 0.

Instead of Mantel apply the bias corrected R*n .

Page 20: Partial Distance Correlation

Population coefficientsA* (x,x):=0

and for x≠x’ define A* (x,x’):= |x-x’|–E|x-X’|/Pr(X’≠x)–

E|X-x’|/Pr(X≠x’) + E|X-X’|/ Pr(X≠X’) whenever the denominators are not 0. If any of the denominators is 0 then by definition A*

(x,x’):= 0.A*:= A* (X,X’).

Finally, dCov*(X,Y):= E(A* B*).

The population Hilbert space generated by A*’s with this inner product is H.

The bias corrected distance correlation computed from dCov* is R*.

Page 21: Partial Distance Correlation

How to compute pdCor?

Exactly the same way as we compute pcor:

pdCor(X,Y;Z) =[R*(X,Y) – R*(X,Z)R*(Y,Z)]/...

but in case of pcor this formula is valid only for real X, Y, Z. The pdCor formula is valid for all X, Y, Y in arbitrary (not necessarily the same) dimensions.

Page 22: Partial Distance Correlation

Conditional independence and pdCor = 0 ?

Are they equivalent? In case of multivariate normal pCor = 0 is equivalent to conditional independence but this cannot be expected in general even for pdCor = 0 because

pdcor = 0 is a global property while conditional independence is local: pdcor = 0 or pcor=0 has no close ties with conditional independence. Exception: multivariate normal and pcor=0.Example: Let Z1, Z2, Z be iid standard normal. Then (X:= Z1+Z, Y:= Z2+Z, Z) is multivariate normal cov(X,Y) = ½ , cov(X,Z) = cov(Y,Z) = 1/√2 thus cov(X,Y) - cov(X,Z)cov(Y,Z) = 0, hence pCor = 0 thus X and Y are conditionally independent given Z. In case of bivariate normal we have a computing formula of dcor from cor. By this formula pdcor(X,Y;Z) = 0.0242. Similarly, pdcor can easily be 0 but pcor ≠0.But who wants to apply distance based methods for multivariate normal where cor, pcor are ideal?

Page 23: Partial Distance Correlation

What if (X, Y, Z) is not Gaussian?

For Gaussian: pcor(X,Y;Z) = 0 implies that the residuals, X – aZ and Y – bZ, are independent. What if (X, Y, Z) is not Gaussian?New idea: Two rv’s or dissimilarities are equivalent (~) if they have the same A*(x,x’).

Thm. pdcov(X,Y;Z):=0 implies there exist l2-valued rv’s X* ~ AX – AaZ

and Y* ~ BY – BbZ such that X* and Y* are independent.

For more details see the preprint: Partial Distance Correlation with Methods for Dissimilarities in ArXive.

Page 24: Partial Distance Correlation

Applications of pdcor

• Variable selection• (i) select xi that maximizes dcor(y,xi)• (ii) select xj that maximizes pdcor(y,xj;xi),

etc.• Continue until all remaining pdcor = 0 or

epsilonExample: prostate cancer and age / Gleason

Page 25: Partial Distance Correlation

My Erlangen program in Statistics

Klein, Felix 1872. "A comparative review of recent researches in geometry". This is a classification of geometries via invariances (Euclidean, Similarity, Affine, Projective,…) Klein was then at Erlangen.

Energy statistics are always rigid motion invariant, dcor is also invariant wrt scaling i.e. invariant wrt the units of measurements (angles remain invariant like in Thales’ geometry of similarities). Thus energy statistics are functions of distances and invariant wrt the ratios of distances, thus they are “rational” statistics. Pythagoras: harmony depends on ratios (of integers). (Affine invariance ,etc.??)

Rank statistics are invariant wrt univariate monotone transformations. The importance of a given invariance can be time dependent, e.g. before computers, distribution-free was a crucial invariance.

In case of testing for normality affine invariance is natural but not in testing for independence. Multivariate affine/projective invariant continuous statistics are constant.

BUT dcor = 0 is invariant with respect to all Borel functions. Invariance of the population value is different from invariance of the test statistics.

Maximal correlation is too invariant. Why? Max correlation can easily be 1 for uncorrelated rv’s but the max of dCor for uncorrelated variables is < 2-1/2 <0.71 (X= -1, 0, 1 with probabilities ε, 1-2 ε, ε, Y:=|X|)

Page 26: Partial Distance Correlation

Symmetries – invariances -- Energy

Page 27: Partial Distance Correlation

Thank you

THANK YOU!