Statistical analysis of compositional data · Statistical analysis of compositional data G....

59
compositional data Aitchison geometry exploratory analysis distributions on S D conclusions Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Inform` atica, Matem ` atica Aplicada i Estad´ ıstica Universitat de Girona February 26, 2014

Transcript of Statistical analysis of compositional data · Statistical analysis of compositional data G....

Page 1: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Statistical analysis of compositional data

G. Mateu-Figueras

Dep. d’Informatica, Matematica Aplicada i EstadısticaUniversitat de Girona

February 26, 2014

Page 2: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Outline

1 compositional data

2 Aitchison geometry of the simplex

3 exploratory analysis

4 distributions on SD

5 conclusions

Page 3: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

introduction

compositional data

compositional data are parts of some whole which onlycarry relative informationthe simplex (for κ a constant)

SD =

x = (x1, . . . , xD) ∈ RD

∣∣∣∣∣ xi > 0,D∑

i=1

xi = κ

standard representationfor D = 3: ternary diagram

Page 4: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

introduction

compositional data

compositional data are parts of some whole which onlycarry relative informationthe simplex (for κ a constant)

SD =

x = (x1, . . . , xD) ∈ RD

∣∣∣∣∣ xi > 0,D∑

i=1

xi = κ

standard representationfor D = 3: ternary diagram

Page 5: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

introduction

compositional data

compositional data are parts of some whole which onlycarry relative informationthe simplex (for κ a constant)

SD =

x = (x1, . . . , xD) ∈ RD

∣∣∣∣∣ xi > 0,D∑

i=1

xi = κ

standard representationfor D = 3: ternary diagram

Page 6: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

examples

some compositional problems

MN blood system: frequencies of MM, NN and MN bloodtypes and the ethnic population. Despite the hightvariability, is there any stability in the data? do they followany genetic law?elections to the Parlament de Catalunya: the total votesachieved by each party in each counties. To characterizethe regions.skye lavas: relative proportions of A (Na2O + K2O), F(Fe2O3) and M (MgO) of 23 basalt specimens from the Isleof Skye. To describe the variability of the geochemicalcomposition.

Page 7: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

examples

some compositional problems

MN blood system: frequencies of MM, NN and MN bloodtypes and the ethnic population. Despite the hightvariability, is there any stability in the data? do they followany genetic law?elections to the Parlament de Catalunya: the total votesachieved by each party in each counties. To characterizethe regions.skye lavas: relative proportions of A (Na2O + K2O), F(Fe2O3) and M (MgO) of 23 basalt specimens from the Isleof Skye. To describe the variability of the geochemicalcomposition.

Page 8: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

examples

some compositional problems

MN blood system: frequencies of MM, NN and MN bloodtypes and the ethnic population. Despite the hightvariability, is there any stability in the data? do they followany genetic law?elections to the Parlament de Catalunya: the total votesachieved by each party in each counties. To characterizethe regions.skye lavas: relative proportions of A (Na2O + K2O), F(Fe2O3) and M (MgO) of 23 basalt specimens from the Isleof Skye. To describe the variability of the geochemicalcomposition.

Page 9: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

difficulties

spurious correlations (Pearson, 1897)

x = (x1, . . . , xD)∑D

i=1 xi = κ cov(xi , x1)+· · ·+cov(xi , xD) = 0

sample x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.2 0.3 0.33 0.3 0.3 0.1 0.3

cov x1 x2 x3 x4

x1 0.007 0.003 0.000 -0.010x2 0.003 0.002 -0.002 -0.003x3 0.000 -0.002 0.009 -0.007x4 -0.010 -0.003 -0.007 0.020

corr x1 x2 x3 x4

x1 1.000 0.866 0.000 -0.866x2 0.866 1.000 -0.500 -0.500x3 0.000 -0.500 1.000 -0.500x4 -0.866 -0.500 -0.500 1.000

Page 10: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

difficulties

spurious correlations (Pearson, 1897)

x = (x1, . . . , xD)∑D

i=1 xi = κ cov(xi , x1)+· · ·+cov(xi , xD) = 0

sample x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.2 0.3 0.33 0.3 0.3 0.1 0.3

cov x1 x2 x3 x4

x1 0.007 0.003 0.000 -0.010x2 0.003 0.002 -0.002 -0.003x3 0.000 -0.002 0.009 -0.007x4 -0.010 -0.003 -0.007 0.020

corr x1 x2 x3 x4

x1 1.000 0.866 0.000 -0.866x2 0.866 1.000 -0.500 -0.500x3 0.000 -0.500 1.000 -0.500x4 -0.866 -0.500 -0.500 1.000

Page 11: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

difficulties

subcompositional incoherence (Aitchison, 1997)

Example. Scientists A and B record the composition of aliquots of soilsamples: A records (animal, vegetable, mineral, water) compositions;B records (animal, vegetable, mineral) after drying the sample. Both areabsolutely accurate [adapted from Aitchison, 2005]

sample A x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.1 0.2 0.53 0.3 0.3 0.1 0.3

sample B x∗1 x∗2 x∗31 0.25 0.50 0.252 0.40 0.20 0.403 0.43 0.43 0.14

corr A x1 x2 x3 x4

x1 1.00 0.50 0.00 -0.98x2 1.00 -0.87 -0.65x3 1.00 0.19x4 1.00

corr B x∗1 x∗2 x∗3x∗1 1.00 -0.57 -0.05x∗2 1.00 -0.79x∗3 1.00

Page 12: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

difficulties

subcompositional incoherence (Aitchison, 1997)

Example. Scientists A and B record the composition of aliquots of soilsamples: A records (animal, vegetable, mineral, water) compositions;B records (animal, vegetable, mineral) after drying the sample. Both areabsolutely accurate [adapted from Aitchison, 2005]

sample A x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.1 0.2 0.53 0.3 0.3 0.1 0.3

sample B x∗1 x∗2 x∗31 0.25 0.50 0.252 0.40 0.20 0.403 0.43 0.43 0.14

corr A x1 x2 x3 x4

x1 1.00 0.50 0.00 -0.98x2 1.00 -0.87 -0.65x3 1.00 0.19x4 1.00

corr B x∗1 x∗2 x∗3x∗1 1.00 -0.57 -0.05x∗2 1.00 -0.79x∗3 1.00

Page 13: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

difficulties

subcompositional incoherence (Aitchison, 1997)

Example. Scientists A and B record the composition of aliquots of soilsamples: A records (animal, vegetable, mineral, water) compositions;B records (animal, vegetable, mineral) after drying the sample. Both areabsolutely accurate [adapted from Aitchison, 2005]

sample A x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.1 0.2 0.53 0.3 0.3 0.1 0.3

sample B x∗1 x∗2 x∗31 0.25 0.50 0.252 0.40 0.20 0.403 0.43 0.43 0.14

corr A x1 x2 x3 x4

x1 1.00 0.50 0.00 -0.98x2 1.00 -0.87 -0.65x3 1.00 0.19x4 1.00

corr B x∗1 x∗2 x∗3x∗1 1.00 -0.57 -0.05x∗2 1.00 -0.79x∗3 1.00

Page 14: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

principles

scale invariance: the analysis should not depend on theclosure constant κ

f (αx) = f (x) , α > 0

subcompositional coherence: studies performed onsubcompositions should not stand in contradiction withthose performed on the full composition

Page 15: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

principles

scale invariance: the analysis should not depend on theclosure constant κ

f (αx) = f (x) , α > 0

subcompositional coherence: studies performed onsubcompositions should not stand in contradiction withthose performed on the full composition

Page 16: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Euclidean space structure of SD

for x,y ∈ SD, α ∈ R, and C is the closure operation

perturbation: x⊕ y = C(x1y1, . . . , xDyD)

powering: α x = C(xα1 , . . . , xαD )

inner product:

〈x,y〉a =1D

∑i<j

lnxi

xjln

yi

yj

associated norm and distance:

‖x‖2a =1D

∑i<j

(ln

xi

xj

)2

; d2a (x,y) =

1D

∑i<j

(ln

xi

xj− ln

yi

yj

)2

Page 17: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Euclidean space structure of SD

for x,y ∈ SD, α ∈ R, and C is the closure operation

perturbation: x⊕ y = C(x1y1, . . . , xDyD)

powering: α x = C(xα1 , . . . , xαD )

inner product:

〈x,y〉a =1D

∑i<j

lnxi

xjln

yi

yj

associated norm and distance:

‖x‖2a =1D

∑i<j

(ln

xi

xj

)2

; d2a (x,y) =

1D

∑i<j

(ln

xi

xj− ln

yi

yj

)2

Page 18: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

orthonormal coordinates

orthonormal basis on SD: e1,e2, . . . ,eD−1 (not unique)coordinates in this basis for x ∈ SD or ilr coordinatesx∗ = (〈x,e1〉a, . . . , 〈x,eD−1〉a)

example:e1 = C(exp( 1√

6, 1√

6, −2√

6)), e2 = C(exp( 1√

2, −1√

2,0))

x∗ =

(√23

ln(x1 · x2)1/2

x3,

1√2

lnx1

x2

)Egozcue et al. (2003)

compositional operations are reduced to ordinary vectoroperations when representing compositions by theircoordinatesthe principle of working on coordinates

Page 19: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

orthonormal coordinates

orthonormal basis on SD: e1,e2, . . . ,eD−1 (not unique)coordinates in this basis for x ∈ SD or ilr coordinatesx∗ = (〈x,e1〉a, . . . , 〈x,eD−1〉a)

example:e1 = C(exp( 1√

6, 1√

6, −2√

6)), e2 = C(exp( 1√

2, −1√

2,0))

x∗ =

(√23

ln(x1 · x2)1/2

x3,

1√2

lnx1

x2

)Egozcue et al. (2003)

compositional operations are reduced to ordinary vectoroperations when representing compositions by theircoordinatesthe principle of working on coordinates

Page 20: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

orthonormal coordinates

orthonormal basis on SD: e1,e2, . . . ,eD−1 (not unique)coordinates in this basis for x ∈ SD or ilr coordinatesx∗ = (〈x,e1〉a, . . . , 〈x,eD−1〉a)

example:e1 = C(exp( 1√

6, 1√

6, −2√

6)), e2 = C(exp( 1√

2, −1√

2,0))

x∗ =

(√23

ln(x1 · x2)1/2

x3,

1√2

lnx1

x2

)Egozcue et al. (2003)

compositional operations are reduced to ordinary vectoroperations when representing compositions by theircoordinatesthe principle of working on coordinates

Page 21: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

parallel lines

x2

x1

x3

n

-4

-2

0

2

4

-4 -2 0 2 4

in S3 coordinate representation

Page 22: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

circles and ellipses

-2

-1

0

1

2

3

4

-2 -1 0 1 2 3

in S3 coordinate representation

Page 23: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

the MN blood system

√23

ln(MM · NN)1/2

MN= −0.57

Page 24: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

the MN blood system

Hardy-Weinberg law: MN2 = 4MM · NN√23

ln(MM · NN)1/2

MN= −0.57

Page 25: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

building an orthonormal basis

using sequential binary partitions (SBP)

example: sequential binary partition for x ∈ S5;coordinates in the corresponding orthonormal basis

order x1 x2 x3 x4 x5 coordinate

1 +1 −1 +1 +1 −1 x∗1 =√

3·23+2 ln (x1·x3·x4)

1/3

(x2·x5)1/2

2 0 +1 0 0 −1 x∗2 =√

1·11+1 ln x2

x5

3 +1 0 −1 −1 0 x∗3 =√

1·21+2 ln x1

(x3·x4)1/2

4 0 0 +1 −1 0 x∗4 =√

1·11+1 ln x3

x4

Page 26: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

coordinates⇒ balances

coordinates in an orthonormal basis obtained from a sequentialbinary partition:

x∗i =

√ri · si

ri + siln

(∏

j∈Rixj)

1/ri

(∏`∈Si

x`)1/si

where i = order of partition, Ri and Si index sets,ri the number of indices in Ri , si the number in Si

Egozcue, Pawlowsky-Glahn (2005)

Page 27: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Log-ratio approach (Aitchison, 1980-86)

log-ratio transformations introduced by J. Aitchison:

alr: SD → RD−1, alr(x) =(

ln x1xD, . . . , ln xD−1

xD

)drawback: not an isometry

clr: SD → RD, clr(x) =(

ln x1g(x) , . . . , ln

xDg(x)

),

g(x) =D∏

i=1x1/D

i

drawback: a constrained transformed vector

Page 28: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

Log-ratio approach (Aitchison, 1980-86)

log-ratio transformations introduced by J. Aitchison:

alr: SD → RD−1, alr(x) =(

ln x1xD, . . . , ln xD−1

xD

)drawback: not an isometry

clr: SD → RD, clr(x) =(

ln x1g(x) , . . . , ln

xDg(x)

),

g(x) =D∏

i=1x1/D

i

drawback: a constrained transformed vector

Page 29: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

the treatment of zeros

case 1: the part with zeros is not important for the study⇒ the part should be omitted

case 2: the part is important, the zeros are essential⇒ divide the sample into two or more populations,according to the presence/absence of zeros

case 3: the part is important, the zeros are rounded zeros⇒ use imputation techniques

for a review, see Martın-Fernandez et al. (2011)

Page 30: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

center and variability

let X = xi = (xi1, . . . , xiD) ∈ SD : i = 1, . . . ,n

center (closed geometric mean) of X:

g = C(g1,g2, . . . ,gD), with gj =

(n∏

i=1

xij

)1/n

total variance of X: TotVar[X] = 1n

∑ni=1 d2

a (xi ,g)

variation array of X:

− var[ln x1

x2

]· · · var

[ln x1

xD

]E[ln x1

x2

]−

. . ....

.... . . − var

[ln xD−1

xD

]E[ln x1

xD

]· · · E

[ln xD−1

xD

]−

Page 31: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

center and variability

let X = xi = (xi1, . . . , xiD) ∈ SD : i = 1, . . . ,n

center (closed geometric mean) of X:

g = C(g1,g2, . . . ,gD), with gj =

(n∏

i=1

xij

)1/n

total variance of X: TotVar[X] = 1n

∑ni=1 d2

a (xi ,g)

variation array of X:

− var[ln x1

x2

]· · · var

[ln x1

xD

]E[ln x1

x2

]−

. . ....

.... . . − var

[ln xD−1

xD

]E[ln x1

xD

]· · · E

[ln xD−1

xD

]−

Page 32: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

center and variability

let X = xi = (xi1, . . . , xiD) ∈ SD : i = 1, . . . ,n

center (closed geometric mean) of X:

g = C(g1,g2, . . . ,gD), with gj =

(n∏

i=1

xij

)1/n

total variance of X: TotVar[X] = 1n

∑ni=1 d2

a (xi ,g)

variation array of X:

− var[ln x1

x2

]· · · var

[ln x1

xD

]E[ln x1

x2

]−

. . ....

.... . . − var

[ln xD−1

xD

]E[ln x1

xD

]· · · E

[ln xD−1

xD

]−

Page 33: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set

votes achieved by PP, CiU, SI, C’s, ERC, PSC, ICV

g = (0.097,0.505,0.044,0.017,0.102,0.179,0.056)

Page 34: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

clr biplot

graphical display of a multivariate data set (individuals andvariables)clr-biplotparticular rules of interpretation

‖ray‖ ≈ variance clr component‖link‖ ≈ variance logratioperpendicular links⇒ possible incorrelated logratiosparallel links⇒ possible hight correlated logratioscoincident vertices⇒ two redundant partscollinear vertices⇒ possible one-dimensional variability

Aitchison and Greenacre (2002)

Page 35: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set (explains 86% variance)

var(

ln(

ICVg

))= 0.0417 var

(ln(

C′sg

))= 0.2898

Page 36: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set (explains 86% variance)

var(

ln(

ICVg

))= 0.0417 var

(ln(

C′sg

))= 0.2898

Page 37: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set (explains 86% variance)

var(ln( SI

C′s

))= 0.8915 var

(ln( CiU

ERC

))= 0.0732

Page 38: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set (explains 86% variance)

corr(

ln(

C′sERC

), ln(PSC

ICV

))= −0.041

Page 39: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set (explains 86% variance)

Page 40: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

coda-dendrogram

to visualizesequential binary partitioncenter of each balanceproportion of the sample total variance corresponding toeach balance.summary statistics of each balance (box-plot ofpercentiles 5, 25, 50, 75, 95)adequate to represent different groups

Pawlowsky-Glahn and Egozcue (2011)

Page 41: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set

Page 42: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: ParlCat2010 data set

Page 43: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

logistic normal (Aitchison 1980-86)

x : Ω −→ SD

transform x to RD−1 using a log-ratio transformationdefine the density of the transformed vector and go backto SD using the change of variable theoremthe result is a density function for x with respect to λ on SD

⇓(Aitchison, 1997)

E [x] is not a meaningful measure of central locationcen[x] is the alternative which minimizes E[d2

a (x, cen[x])]

Page 44: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

logistic normal (Aitchison 1980-86)

x : Ω −→ SD

transform x to RD−1 using a log-ratio transformationdefine the density of the transformed vector and go backto SD using the change of variable theoremthe result is a density function for x with respect to λ on SD

⇓(Aitchison, 1997)

E [x] is not a meaningful measure of central locationcen[x] is the alternative which minimizes E[d2

a (x, cen[x])]

Page 45: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

densities and measures

on SD: density functions expressed with respect to theAitchison measure λa

density functions of the vector of coordinates with respectto λ.

dλ/dλa =√

D x1x2 · · · xD, λa(A) = λ(A∗)

Page 46: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

densities and measures

on SD: density functions expressed with respect to theAitchison measure λa

density functions of the vector of coordinates with respectto λ.

dλ/dλa =√

D x1x2 · · · xD, λa(A) = λ(A∗)

Page 47: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

normal on SD

x : Ω −→ SD

a random composition x is normally distributed on SD withparameters µ and Σ if its density function is

fx(x) = (2π)−(D−1)/2|Σ|−1/2 exp[−1

2(x∗ − µ∗)′Σ−1 (x∗ − µ∗)

]usual normal density applied to coordinates x∗ and fx = dP

dλa

µ = Ea[x] = cen[x]

Mateu-Figueras et al (2013)

Page 48: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

normal on SD

x : Ω −→ SD

a random composition x is normally distributed on SD withparameters µ and Σ if its density function is

fx(x) = (2π)−(D−1)/2|Σ|−1/2 exp[−1

2(x∗ − µ∗)′Σ−1 (x∗ − µ∗)

]usual normal density applied to coordinates x∗ and fx = dP

dλa

µ = Ea[x] = cen[x]

Mateu-Figueras et al (2013)

Page 49: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

normal on SD

x : Ω −→ SD

a random composition x is normally distributed on SD withparameters µ and Σ if its density function is

fx(x) = (2π)−(D−1)/2|Σ|−1/2 exp[−1

2(x∗ − µ∗)′Σ−1 (x∗ − µ∗)

]usual normal density applied to coordinates x∗ and fx = dP

dλa

µ = Ea[x] = cen[x]

Mateu-Figueras et al (2013)

Page 50: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

comparison

µ∗ = (0,0),Σ = Id

SD ⊂ RD SD as Euclidian spacex

1

x2

x3

x1

x2

x3

logistic normal normal on SD

Lebesgue measure λ Aitchison measure λa

Page 51: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

invariance under perturbation

p=(0.93, 0.05,0.02) x∗ =(

1√2

ln(

x1x2

), 1√

6ln(

x1x2x3x3

))x

1

x2

x3

−3 −2 −1 0 1 2 3 4

−2

−1

0

1

2

3

normal on S3 coordinate representation

µ∗ = (−0.5,−0.5), µ∗ = (1.5,1.5), Σ = Id

Page 52: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

tests of normality on SD

H0: the sample of coordinates comes from a multivariatenormal distribution

based on empirical distribution function (EDF) testsAnderson-Darling, Cramer-von Mises and Watson statisticsthree possible cases

all (D − 1) marginal, univariate distributionsall (D − 1)(D − 2)/2 bivariate angle distributionsthe (D − 1)-dimensional radius distribution

problem: dependence of the orthonormal basis

Page 53: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

tests of normality on SD

H0: the sample of coordinates comes from a multivariatenormal distribution

based on empirical distribution function (EDF) testsAnderson-Darling, Cramer-von Mises and Watson statisticsthree possible cases

all (D − 1) marginal, univariate distributionsall (D − 1)(D − 2)/2 bivariate angle distributionsthe (D − 1)-dimensional radius distribution

problem: dependence of the orthonormal basis

Page 54: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

tests of normality on SD

H0: the sample of coordinates comes from a multivariatenormal distribution

based on empirical distribution function (EDF) testsAnderson-Darling, Cramer-von Mises and Watson statisticsthree possible cases

all (D − 1) marginal, univariate distributionsall (D − 1)(D − 2)/2 bivariate angle distributionsthe (D − 1)-dimensional radius distribution

problem: dependence of the orthonormal basis

Page 55: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

example: aphyric Skye lavas

X=(A,F,M) composition of 23 basalt specimens from the Isle ofSkye (Aitchison,1986)

µ∗ = (0.555,0.639) Σ =

(0.126 −0.229−0.229 0.456

)F

A M

Page 56: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

kernel density estimation

the normal on SD for the kernel in the density estimatorinvariance with respect to the orthonormal basis

A

F M

90

90

90

75

75

75

50

50

50

25

25

25

10

10

10

−2 −1 0 1−1

0

1

2

3

y1

y2

Chacon et al (2010)

Page 57: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

other distributions on SD

the skew-normal distribution on SD

the Dirichlet distributionthe shifted-scaled Dirichlet distribution...

Page 58: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

conclusions

treat compositional data (CoDa) in the simplex, with itsspecific geometrydo not apply ordinary multivariate statistics directly toCoDathe simplex has an Euclidean structure: orthonormalcoordinates are availablemultivariate statistical models and methods work properlyon coordinates of CoDaproblem (or advantage): interpretation of coordinates

Page 59: Statistical analysis of compositional data · Statistical analysis of compositional data G. Mateu-Figueras Dep. d’Informatica, Matem` atica Aplicada i Estad` ´ıstica Universitat

compositional data Aitchison geometry exploratory analysis distributions on SD conclusions

referencesAitchison, J. (1986): The statistical analysis of compositional data. Monographs onstatistics and applied Probability: Chapman and Hall, London.Aitchison, J., Greenacre, M. (2002): Biplots for compositional data .Journal of theRoyal Statistical Society, Series C (Applied Statistics) 51 (4), 375–392. 2002Billheimer, D.; Guttorp, P.; Fagan, W. (2001): Statistical interpretation of speciescomposition.J. Am. Statistical Ass., 96(456), 1205–1214.Chacon, J.E.; Mateu-Figueras, G.; Martın-Fernandez, J.A. (2010): Gaussian kernels fordensity estimation with compositional data.Computers and Geosciences., 37, 702–711.Egozcue, J.J.; Pawlowsky-Glahn, V. (2005): Groups of parts and their balances incompositional data analysis. Math. Geol., 37(7), 795–828.Egozcue, J.J.; Pawlowsky-Glahn, V., Mateu-Figueras, G.; Barcelo-Vidal, C. (2003):Isometric logratio transformations for compositional data analysis. MathematicalGeology, 35(3), 279–300.Martın-Fernandez, J.A.; Palarea-Albaladejo, J.; Olea, R.A. (2011): Dealing with zeros.In Pawlowsky-Glahn, V. and Buccianti A. (Eds.) Compositional Data Analysis: Theoryand Applications, Wiley, Chichester UK.Mateu-Figueras, G.; Pawlowsky-Glahn, V.; Egozcue, J.J. (2013): The normaldistribution in some constrained sample spaces. SORT, 37(1),29-56.Pawlowsky-Glahn, V.; Egozcue, J.J. (2001): Geometric approach to statistical analysison the simplex. SERRA, 15(5), 384–398.Pawlowsky-Glahn, V.; Egozcue, J.J. (2011): Exploring Compositional Data with theCoda-Dendrogram, Austrian Journal of Statistics, 40, 1-2.