Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for...

23
Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University Magdeburg, Germany The 8th Tartu Conference on Multivariate Statistics The 6th Conference on Multivariate Distributions with Fixed Marginals Tartu, Estonia, 26-29 June 2007

Transcript of Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for...

Page 1: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Multivariate Tests Based on Pairwise Distance or Similarity Measures

Siegfried KropfInstitute for Biometry and Medical Informatics

Otto von Guericke University Magdeburg, Germany

The 8th Tartu Conference on Multivariate Statistics

The 6th Conference on Multivariate Distributions with Fixed Marginals

Tartu, Estonia, 26-29 June 2007

Page 2: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Contents

• Introduction• Example data - microbial fingerprints • “Usual” way as multivariate test based on

spherically distributed scores (PC scores) • Test based on pairwise similarity measures• Comparison of results in example data• Application to other data• Extensions of permutation test• Parametric “rotation” test for small n• Simulation studies on robustness• Summary

Page 3: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Introduction

Consider global multivariate comparisons between two or morepopulations of high-dimensional data• Gene expression data (all genes, known groups of genes)• Neuroimaging• Genetical fingerprints (e.g. microbial DNA in soil samples)• …

Formal description: independent sample vectors

xkj ~ Np(k,) , k = 1, 2, …, K; j = 1, …, nk , p >> n = n1 + … + nK

or more general

xkj ~ Fk(x) , k = 1, 2, …, K; j = 1, …, nk , p >> n

Wanted: test for H0: 1 = ... = K or F1(x) = … = FK(x) x

Page 4: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

• Question: What is the impact of different (natural or genetically

modified) plant cultures to the soil microbial population?

• Extraction of bacterial samples, DNA parts amplified by PCR

• Several samples investigated together in electrophoresis gels (e.g.

denaturing gradient gel electrophoresis, DGGE)

• Gels scanned, analyzed with GelCompar,

vector of hundreds or thousands greyscale values per lane.

Example data - microbial fingerprints

Page 5: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

M1 B1 B2 B3 B4 S1 S2 S3 M2 P1 P2 P3 P4 R1 R2 R3 R4 M3

Denaturing gradient gel with fingerprints of bacterial communities from rhizo-sphere soil (lanes S1 to S3: strawberry, P1 to P4: potato, R1 to R4: oilseed rape) and unplanted soil (lanes B1 to B4). Lanes M1 to M3: standard bacterial mix

Page 6: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

“Usual” way as multivariate test based onspherically distributed scores (PC scores)

Exact test based on spherically distributed scores: PCq test (Läuter et al., 1998)

1. Transformation of raw data vectors into q-dimensional score vectors:

xkj zkj = D xkj (k = 1, ..., K; j = 1, ..., nk)

with (p q)-matrix D (q << p) from EVP

or better from dual EVP

2. Multivariate test (here Wilks‘ ) with q-dimensional scores zkj

x

x

X

x

x

XDDXXXX , ,11

KKn

~~and )0(

~with

~~~

2/1

DXXD

DDXXXX

iii

Page 7: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

1. Calculate pairwise similar. measures between sample elements, e.g., Pearson’s r

Test based on pairwise similarity measures

4

3

2

1

4

3

2

1

3

2

1

4

3

2

1

4 3 2 1 4 3 2 1 3 2 1 4 3 2 1

RRRRPPPPSSSBBBB

1974.945.953.930.929.926.918.832.893.832.864.876.854.866.974.1932.955.947.939.929.923.827.895.825.864.873.840.855.945.932.1962.950.977.976.972.858.946.900.943.946.927.949.953.955.962.1948.955.960.959.840.922.858.902.902.884.898.930.947.950.948.1962.967.967.874.945.891.935.935.903.915.929.939.977.955.962.1983.982.847.946.883.954.952.921.945.926.929.976.960.967.983.1991.870.962.904.958.962.942.957.918.923.972.959.967.982.991.1861.959.898.957.958.932.954.832.827.858.840.874.847.870.861.1920.907.849.862.833.845.893.895.946.922.945.946.962.959.920.1954.953.960.933.948.832.825.900.858.891.883.904.898.907.954.1907.916.890.904.864.864.943.902.935.954.958.957.849.953.907.1982.948.977.876.873.946.902.935.952.962.958.862.960.916.982.1964.982.854.840.927.884.903.921.942.932.833.933.890.948.964.1965.866.855.949.898.915.945.957.954.845.948.904.977.982.965.1

R R R R P P P P S S S B B B B

R

Page 8: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

(2. Investigate similarities by cluster analyses, supported by GelCompar)

Page 9: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.
Page 10: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

3. Calculate the test statistic

and use it as basis for a permutation test,

where in each new permutation step the n = n1 + … + nK sample elements are randomly allocated to the K groups of sizes n1, …, nK (simultaneous exchanges of rows and columns in correlation matrix).

• Test can be carried out with all groups or pairwise.

• We used random permutations.

• Problems occur with small samples because of restricted number of permutations, e.g. with two samples of size 4 different permutations.

• The permutation test in its basic form is a special case of the Mantel test (Mantel, 1967), similar application to electropheresis data by Aittokallio et al. (2000).

betweenwithin rrd

352/4

44

Page 11: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Comparison of results in example data

Groups d PC1 PC2 PC3 PC4 PC5

all groups <.001 .136 .016 .001 <.001 <.001

B – S .029 .116 .221 .454 .260 .129

B – P .029 .038 .011 .042 .046 .150

B – R .029 .023 .015 .031 .045 .055

S – P .029 .264 .035 .100 .174 .360

S – R .029 .392 .175 .135 .159 .016

P – R .057 .442 .251 .056 .168 .062

p-values for global test and unadjusted pairwise tests

d version performs quite well, but here at its limits – no Bonferroni possible

Page 12: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Groups d PC1 PC2 PC3 PC4 PC5

all groups <.001 .116 .011 <.001 <.001 <.001

B – S .029 .098 .119 .288 .155 .290

B – P .029 .003 .019 .036 .021 .068

B – R .029 .058 .001 .006 .002 .017

S – P .029 .837 .104 .091 .253 .451

S – R .029 .736 .245 .024 .107 .358

P – R .029 .773 .151 .051 .156 .166

p-values for global test and unadjusted pairwise tests

The same data with transformation x:= ln(1+x)

Page 13: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

• Other microbiological fingerprints (DGGE), from soil of four different regions, each four samples

• Gene expression analyses from microarrays

permutation test based on pairwise correlation coefficients of sample elements performed very well, outperformed PC test in examples (Kropf et al. 2007).

Application to other data

Page 14: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Extensions of permutation test

The correlation based test can be extended in different ways (Kropf et al., 2004):

• Inclusion of block designs (e.g., use of several geles, where lanes may not

be compared across different geles).

• Comparison of dependent samples (e.g., the same soil samples analyzed

with different types of geles).

• Use of other distance or similarity measures instead of r

(e.g., z-dot transformation of r, squared Euclidean distance, other distances

for binary or ordinal data, …).

High flexibility for applications.

Page 15: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

other each from tindependen,...,,1,...,,1),(~ llplj njklN x

),(~:Hunder 0

plj Nx

Parametric ‚rotation‘ test for small n

Usual assumptions:

As the distribution of the test statistic might be too complicated for a ‚closed‘ solution, we are looking for a Monte Carlo version:

The test statistic is traced back to a left-spherically distributed matrix (particularly an iid multivariate normal rows with expectation zero), which under H0 is distributional invariant to random orthogonal rotations.

Use infinite no. of random rotations instead of restricted no. of permutations.

“Rotation” test (cf. Langsrud, 2005; Läuter et al. 2005)

Page 16: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

1 … p 1

n1

1

nK

)1(11 xx

)(1 11 nn xx

)(nKnKxx

)1(1 KnnK xx

data matrix X

dist./sim. matrix R = (rij)

rij = r(x(i), x(j))

test statistic d = d(R)

)...,,1(),(~

:H

)(

0

njiidN pj

x

Page 17: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

reduced data matrix X*1 … p

data matrix X

dist./sim. matrix R = (rij)

rij = r(x(i), x(j))

1 … p 1

2…

n1n

xxx )1()1( *

xxx )2()2( *

xxx )1()1( * nnxxx )()( * nn

),(~* )1( BX 0pnN

n

j j1 )(xn

1x

omitted redundant,

**1

1 )()(

n

j jn xx

))1

1(,(~*

:Hunder

)(

0

n

Nj 0x

n

ji

ji

1*)*,Cov(

for but

)()(

xx

nn

nn

N pn

11

1

111

),(~* )1(

B

BX 0

test statistic d = d(R)

1 … p 1

n1

1

nK

)1(11 xx

)(1 11 nn xx

)(nKnKxx

)1(1 KnnK xx

data matrix X

)...,,1(),(~

:H

)(

0

njiidN pj

x

Page 18: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

reduced data matrix X*1 … p

data matrix X

dist./sim. matrix R = (rij)

rij = r(x(i), x(j))

1 … p 1

2…

n1n

xxx )1()1( *

xxx )2()2( *

xxx )1()1( * nnxxx )()( * nn

),(~* )1( BX 0pnN

test statistic d = d(R)

1 … p 1

n1

1

nK

)1(11 xx

)(1 11 nn xx

)(nKnKxx

)1(1 KnnK xx

data matrix X

)...,,1(),(~

:H

)(

0

njiidN pj

x

„decorrelated“ matrix X+

),(~

root suitable *

1

1

nN IX

XBX

0

random rotations: X + := X + = *(* *)1/2

*(n-1)(n-1) from iid

standard normal elements

repeatedly

R has to be invariantwith respect to a constant vector shift in argumentsr(x(i), x(j)) = r(x(i) + a, x(j) + a),e.g. squared Eucl. distance,no longer Pearson‘s r !

Page 19: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Gruppen d PC1 PC2 PC3 PC4 PC5 dEuk2 drot

alle 4 <.001 .136 .016 .001 <.001 <.001 <.001 <.001

1 – 2 .029 .116 .221 .454 .260 .129 .029 .024

1 – 3 .029 .038 .011 .042 .046 .150 .029 .001

1 – 4 .029 .023 .015 .031 .045 .055 .029 .004

2 – 3 .029 .264 .035 .100 .174 .360 .029 .017

2 – 4 .029 .392 .175 .135 .159 .016 .057 .043

3 – 4 .057 .442 .251 .056 .168 .062 .057 .051

Example data (4 groups: bulk soil, strawberry, potato, oilseed rape)

p-values from global test and with unadjusted pairwise comparisons

Page 20: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Groups d PC1 PC2 PC3 PC4 PC5

all groups <.001 .116 .011 <.001 <.001 <.001

B – S .029 .098 .119 .288 .155 .290

B – P .029 .003 .019 .036 .021 .068

B – R .029 .058 .001 .006 .002 .017

S – P .029 .837 .104 .091 .253 .451

S – R .029 .736 .245 .024 .107 .358

P – R .029 .773 .151 .051 .156 .166

p-values for global test and unadjusted pairwise tests

The same data with transformation x:= ln(1+x)

dEuk2 drot

<.001 <.001

.029 .012

.029 <.001

.029 .006

.029 .002

.029 .028

.086 .036

Page 21: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Simulation studies on robustness

e.g. p indep. components from expontial distribution

others:

uniform distribution:slightly anticonservative

sum of normal and one of above: nearly exact

Page 22: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

Summary

• Tests based on pairwise similarity or distance measures show a high power in high-dimensional data.

• The permutation tests for the pairwise methods are not dependent on normality assumptions and performed surprisingly well in many situations.

• The basic idea is not new (cf. Mantel, 1967), but might have lost attention, at least in the field of medical biometry.

• Extensions for other designs are possile to some degree.

• Similar (partly asymptotic) methods in Software “CANOCO” (Canonical Community Ordination) by ter Braak und Šmilauer (2002).

• Small number of possible permutations restricts application for very small samples. In this case the rotation test can help.

• It is, however, dependent on the parametric assumptions, so variables should be checked and – if necessary – transformed.

Page 23: Multivariate Tests Based on Pairwise Distance or Similarity Measures Siegfried Kropf Institute for Biometry and Medical Informatics Otto von Guericke University.

References• Aittokallio, T., Ojala, P., Nevalainen, T.J., Nevalainen, O. (2000). Analysis of similarity

of electrophoretic patterns in mRNA differential display. Electrophoresis 21, 2947– 2956.

• Kropf, S., Heuer, H., Grüning, M., Smalla, K. (2004). Significance test for comparing complex microbial community fingerprints using pairwise similarity measures. Journal of Microbiological Methods 57/2, 187-195.

• Kropf, S., Lux, A., Eszlinger, M., Heuer, H., Smalla, K. (2007). Comparison of independent samples of high-dimensional data by pairwise distance measures. Biometrical Journal 49, 230-241.

• Langsrud, Ø. (2005). Rotation Tests, Statistics and Computing, 15, 53-60.

• Läuter, J., Glimm, E., Kropf, S. (1998). Multivariate Tests Based on Left-Spherically Distributed Linear Scores. Annals of Statistics 26, 1972-1988. Erratum: Annals of Statistics 27, 1441.

• Läuter, J., Glimm, E., Eszlinger, M. (2005). Search for Relevant Sets of Variables in a High-Dimensional Setup Keeping the Familywise Error Rate. Submitted to Statistica Neerlandica.

• Mantel, N., 1967. The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Res. 27, 209-220.

• ter Braak, C.J.F., Šmilauer, P. (2002). CANOCO Reference Manual and CanoDraw for Windows User’s Guide: Software for Canonical Community Ordination (Version 4.5). Microcomputer Power, Ithaca NY, USA.