Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree...

Geometrical Methods forHeterogeneous Data

Susan Holmeshttp://www-stat.stanford.edu/˜susan/

Bio-X and Statistics, Stanford University

IMA, 2013 and funded by NIH-R01GM086884

ABabcdfghiejkl..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

What's happening in Biological Data Analysis?

1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

What's happening in Biological Data Analysis?1985 1998 2012

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Challenges for those of us working from the groundup

▶ Heterogeneity.▶ Structured high-dimensionality.▶ Graph or Tree integration.▶ Identifiability.▶ High quality graphics.▶ Reproducibility.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Part I

Heterogeneity

`Homogeneous data are all alike;

all heterogeneous data are heterogeneous

in their own way.'

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Heterogeneity of Data▶ Status : response/ explanatory.▶ Hidden (latent)/measured.▶ Types :

▶ Continuous▶ Binary, categorical▶ Graphs/ Trees▶ Images▶ Maps/ Spatial Information▶ Rankings

▶ Amounts of dependency: independent/time series/spatial.▶ Different technologies used (454, phyloseq, Illumina,

MassSpec, RNA-seq).

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Microbiome and other useful `ome'wordsJoshua Lederberg:`the ecological community of commensal,symbiotic, and pathogenic microorganisms that literally shareour body space and have been all but ignored asdeterminants of health and disease'Microbiome Complete collection of genes contained in the

genomes of microbes living in a givenenvironment.

Numbers Humans shelter 100 trillion microbes (1014), (weare made of 10 ×1012 cells)It is estimated that there are more than 1000`species' of bacteria living in the human gut.

Metagenome Composition of all genes present in anenvironment (soil, gut, seawater), regardless ofspecies.

Transciptome These are the mRNA transcripts in the cell, itreflects the genes that are being activelyexpressed at any given time.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Bacteria etc... and Us

The human microbiome or human microbiota is theassemblage of microorganisms that reside on the surface andin deep layers of skin, in the saliva and oral mucosa, in theconjunctiva, and in the gastrointestinal tracts.

▶ They include bacteria, fungi, and archaea.▶ Some of these organisms perform tasks that are useful

for the human host. (live in symbiosis)▶ Majority have no known beneficial or harmful effect.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

YK Lee and SK Mazmanian Science, 2010.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Human Microbiome: What are the data?

DNA The Genomic material present (16sRNA-geneespecially).

RNA What genes are being turned on (geneexpression).

Mass Spec Specific signatures of chemical compoundspresent.

Clinical Multivariate information about patients' clinicalstatus, medication, weight.

Environmental Location, nutrition, time.Domain Knowledge Metabolic networks, phylogenetic trees,

gene ontologies.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Heterogeneous Data ObjectsObject Oriented Data AnalysisInput and data manipulation with phyloseq(McMurdie and Holmes, 2013, Plos ONE)As always in R: object oriented data:

Taxonomy Table taxonomyTableslots: .Data

OTU Abundanceclass: otuTableslots: .Data, speciesAreRows

Sample VariablessampleDataslots: .Data,names,row.names,.S3Class

Phylogenetic Treeclass: phyloslots: see ape package

matrix matrixdata.frame

phyloseqslots:otuTablesampleDatataxTabtre

Experiment-level data object:

Component data objects:

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Questions from many Disciplines

Biomedical sciences Clinical variables, immune history.Genetics Host and bacterial genomes and transcriptomes.Bioinformatics Databases and formats.Biochemistry Compounds, Metabolic pathways involved.Ecology Ordination, diversity, environmental influences.Statistics Decompositions of variability, spatio-temperal

analyses.Systematic Biology Phylogenetic Trees, dynamics of evolution.Network science Microbial communities, gene interactions,

metabolic pathways,...Visualization Dimension reduction, rich representations.Mathematics Differential Geometry, multilinear algebra,

probability, topology.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Useful first order representation: Many Matrices

▶ Time series of abundance matrices.▶ Different types of data on same samples (taxa counts,

clinical variates, spatial location).▶ Networks and trees over time.▶ Explanatory (environmental) variables, Response variables.

Holmes (2005), Duality Diagrams.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Part II

Dimension Reduction: theEuclidean embedding workhorse:

MDS

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Principal Component Analysis: Dimension Reduction

PCA seeks to replace the original (centered) matrix X by amatrix of lower rank, this can be solved using the singularvalue decomposition of X:

X = USV′, with U′DU = In and V′QV = Ip and Sdiagonal

XX′ = US2U′, with U′DU = In and S2 = Λ

PCA is a linear nonparametric multivariate method fordimension reduction. D and Q are the relevant metrics on thedual row and column spaces of n samples and p variables.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Examples from Ecology

The Boomlake plant data:

Biplot representing both species and locationsBlue circles with letters are species scores

Sampling locations are green circles with numbers.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Sample 1 is actually in the lake, and sample 12 is far away.Species are located closely to the samples they occur in. Ifyou looked carefully into the data matrix, you would find thatspecies R and Q are strictly aquatic, while species F is adryland plant (cribs). There is an arch effect.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Psychological Data

Color confusion data (Ekman, 1954):

w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Results: Color Confusion

Screeplot

●

●

●

●

●●

● ● ● ●

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

Index

clas

s.co

l$ei

g

Planar configuration

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

cmdscale(colorc)[,1]cm

dsca

le(c

olor

c)[,2

]

1 2

34

5

6

78

9

10

11

1213

14

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Metric Multidimensional ScalingSchoenberg (1935)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

From Coordinates to Distances and Back

If we started with original data in Rp that are not centered:Y, apply the centering matrix

X = HY, with H = (I− 1

n11′), and 1′ = (1, 1, 1 . . . , 1)

Call B = XX′, if D(2) is the matrix of squared distancesbetween rows of X in the euclidean coordinates, we can showthat

−1

2HD(2)H = B

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Schoenberg's result: exact Euclidean distance

If B is positive semi-definite then D can be seen as a distancebetween points in a Euclidean space.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Reverse engineering an Euclidean embedding

We can go backwards from a matrix D to X by taking theeigendecomposition of B = −1

2HD(2)H in much the same waythat PCA provides the best rank r approximation for data bytaking the singular value decomposition of X, or theeigendecomposition of XX′.

X(r) = US(r)V′ with S(r) =

s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Multidimensional Scaling (MDS) also called PCoA

Simple classical multidimensional scaling.▶ Square D elementwise D(2) = D2.▶ Compute −1

2 HD2H = B.▶ Diagonalize B to find the principal coordinates SV′.▶ Choose a number of dimensions by inspecting the

eigenvalue's screeplot.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Malaria Data as seen using ape

Pre1

Pme2

Plo6

Pga11

Pma3

Pbe5

Pfr7

Pkn8

Pcy9

Pvi10

Pfa4

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Bootstrap of Malaria Data

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 20000 40000 60000 80000 120000

Eigenvalues of MDS for bootstrapped trees

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

−40 −20 0 20 40

−60

−40

−20

020

40

Bootstrapped trees

o1

2

3

4

5

6

7

8

9

1011

12

1314

15

1617

18

19

20

21

22

23

24

25 26

2728

29

30

31

32

33

34

35

36

373839

40

41

42

43

4445

46

47 4849

50

51

52

53

54

55

5657

5859

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

7576

77

78

7980

81

8283

84

85

86

8788

89

90

91

92

93

94

95

9697

98

99

100o

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Data Analysis: Geometrical Approach

i. The data are p variables measured on n observations.ii. X with n rows (the observations) and p columns (the

variables).iii. Dn is an n× n matrix of weights on the ``observations'',

which is most often diagonal.iv Symmetric definite positive matrix Q,often

Q =

1σ21

0 0 0 ...

0 1σ22

0 0 ...

0 0 1σ23

0 ...

... ... ... 0 1σ2

p

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Euclidean Spaces

These three matrices form the essential ``triplet" (X,Q,D)defining a multivariate data analysis.Q and D define geometries or inner products in Rp and Rn,respectively, through

xtQy =< x, y >Q x, y ∈ Rp

xtDy =< x, y >D x, y ∈ Rn.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

A Commutative Diagram Approach

Statisticians search for approximations with certainproperties, for the case of PCA for instance, we rephrase theproblem as follows:

▶ Q can be seen as a linear function from Rp toRp∗ = L(Rp), the space of scalar linear functions on Rp.

▶ D can be seen as a linear function from Rn toRn∗ = L(Rn).

▶Rp∗ −−−−→

XRn

Qx yV D

y xW

Rp ←−−−−Xt

Rn∗

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Properties of the Diagram

Rank of the diagram: X,Xt,VQ and WD all have the samerank.For Q and D symmetric matrices, VQ and WD arediagonalisable and have the same eigenvalues.

λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.

Eigendecomposition of the diagram: VQ is Q symmetric, thuswe can find Z such that

VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Comparing Two Diagrams: the RV coefficientMany problems can be rephrased in terms of comparison oftwo ``duality diagrams" or put more simply, two characterizingoperators, built from two ``triplets", usually with one of thetriplets being a response or having constraints imposed on it.Most often what is done is to compare two such diagrams,and try to get one to match the other in some optimal way.To compare two symmetric operators, there is either a vectorcovariance as inner productcovV(O1,O2) = Tr(Ot

1O2) =< O1,O2 > or a vector correlation(Escoufier, 1977)

RV(O1,O2) =Tr(Ot

1O2)√Tr(Ot

1O1)tr(Ot2O2)

.

If we were to compare the two triplets(Xn×1, 1,

1nIn)

and(Yn×1, 1,

1nIn)

we would have RV = ρ2.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

PCA: Approximating one diagram by another

PCA can be seen as finding the matrix Y which maximizes theRV coefficient between characterizing operators, that is,between

(Xn×p,Q,D

)and

(Yn×q, I,D

), under the constraint

that Y be of rank q < p .

RV(XQXtD,YYtD

)=

Tr(XQXtDYYtD

)√Tr (XQXtD)2 Tr (YYtD)2

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

This maximum is attained where Y is chosen as the first qeigenvectors of XQXtD normed so that YtDY = Λq. Themaximum RV is

RVmax =

∑qi=1 λ

2i∑p

i=1 λ2i.

Of course, classical PCA has D = 1nI, Q = I, but the extra

flexibility is often useful. We define the distance betweentriplets (X,Q,D) and (Z,Q,M) where Z is also n× p, as thedistance deduced from the RV inner product betweenoperators XQXtD and ZMZtD.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Part III

Combine and Compare Trees,Graphs and Contingent Data

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Find sub graphs which are well explained by variables

A subset of the skeleton gene interaction network. −→most perturbed subgraph

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Find sub graphs which are well explained by variables.

A subgraph related to a chemokine pathway. (GXNA,Serban et al, 2005, Bioinformatics). ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Discriminant Analysis as a duality diagram

Let A be the g× p matrix of group means in each of the pvariables. This satisfies

YtDX = ∆YA where ∆Y = YtDY = diag(w1,w2, . . . ,wg),

and wk =∑

i:yik=1 di, the wk's are the group weights, as theyare the sums of the weights as defined by D for all theelements in that group.Call T the matrix T = XtDX, in the standard case with alldiagonal elements of D equal to 1

n this is just the standardvariance-covariance, otherwise it is a generalization thereof.The generalized between group variance-covariance isB = At∆YA and call the between group variance covariancethe matrix W = (X− YA)tD(X− YA).

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Proposition 1

(a generalized Huyghens' formula):

T = B + W

Proof: Expanding W gives

W = XtDX− XtDYA− AtYtDX + AtYtDYA= T− A′∆YA− A′∆YA + A′∆YA = T− B

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Duality Diagram for LDA

The duality diagram for linear discriminant analysis is

Rp∗ −−→A

Rg

T−1

x yB ∆Y

y xAT−1At

Rp ←−−At

Rg∗

.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

This corresponds to the triple (A,T−1,∆Y), because

(XtDY)∆−1Y (YtDX) = At∆YA

and gives equivalent results to the triple (YtDX,T−1,∆−1Y ).

The discriminating variables are the eigenvectors of theoperator

At∆YAT−1.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Explaining one diagram by another

Principal Component Analysis with respect to InstrumentalVariables was a technique developed by C.R. Rao,(1964) tofind the best set of coefficients in the multivariate regressionsetting where the response is multivariate, given by a matrixY. In terms of diagrams and RV coefficients, this problem canbe rephrased as that of finding M to associate to X so that(X,M,D) is as close as possible to (Y,Q,D) in the RV sense.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

The answer is provided by defining M such that

YQYtD = λXMXtD.

If this is possible then the two eigendecompositions of thetriple give the same answers. We simplify notation by thefollowing abbreviations:

XtDX = Sxx YtDY = Syy XtDY = Sxy and R = S−1xx SxyQSyxS−1

xx .

Then

∥ YQYtD−XMXtD ∥2=∥ YQYtD−XRXtD ∥2 + ∥ XRXtD−XMXtD ∥2 .

The first term on the right hand side does not depend on M,and the second term will be zero for the choice M = R.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

If we add the extra constraint that we only allow ourselvesa rank q approximation, with q < min (rank (X), rank (Y)), theoptimal choice of a positive definite matrix M is to takeM = RBBtR where the columns of B are the eigenvectors ofXtDXR with:

B =

(1√λ1

β1, . . .1√λq

βq

)such that

XtDXRβk = λkβk

βtkRβk = λk, k = 1 . . . q

λ1 > λ2 > . . . > λq

The PCA with regards to instrumental variables of rank q isequivalent to the PCA of rank q of the triple (X,R,D) where

R = S−1xx SxyQSyxS−1

xx .

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Canonical Correlation Analysis

Canonical correlation analysis was introduced by Hotelling[?]to find the common structure in two sets of variables X1 andX2 measured on the same observations. MaximizingCovariance

cov(x, y) = 1

n∑

i(xi − x)(yi − y)

(v(x) cov(x, y)

cov(y, x) v(y)

)

cor(x, y) = cov(x, y)√v(x)

√v(y)

=< x, y >

√< x, x >

√< y, y >

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

One Diagram to replace Two Diagrams

Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2

measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt

1DX1)−1 and (Xt

2DX2)−1

Q =

(Xt

1DX1)−1 0

0 (Xt2DX2)

−1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

.

Rp1∗ −−−−→X1

Rn

Ip1

x yV1 Dy xW1

Rp1 ←−−−−Xt1

Rn∗

Rp2∗ −−−−→X2

Rn

Ip2

x yV2 Dy xW2

Rp2 ←−−−−Xt2

Rn∗

Rp1+p2∗ −−−−→[X1;X2]

Rn

Qx yV D

y xW

Rp1+p2 ←−−−−−[X1;X2]t

Rn∗

This analysis gives the same eigenvectors as the analysis ofthe triple (Xt

2DX1, (Xt1DX1)

−1, (Xt2DX2)

−1), also known as thecanonical correlation analysis of X1 and X2...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Incorporate taxa abundances and phylogenetic tree

Epulopiscium

Clostridium

Adlercreutzia

Lachnospira

Alistipes

Roseburia

Coprococcus

Clostridium

Blautia

Coprococcus

Dehalobacterium

Clostridium

Clostridium

Clostridium

Coprobacillus

Coprococcus

Clostridium

Clostridium

Moryella

Abundance

1

25

625

Class

Actinobacteria (class)

Bacilli

Bacteroidia

Clostridia

Erysipelotrichi

Gammaproteobacteria

Mollicutes

Verrucomicrobiae

YS2

Duality diagram methods that can use any dependencystructure. ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Rao's Distance

We start with a distance between samples.The heterogeneity of a population (Hi ) is the averagedistance between members of that population.The heterogeneity between two populations (Hij) is theaverage distance between a member of population i and amember of population j.The distance between two populations is

Dij = Hij −1

2(Hi + Hj)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Decomposition of Diversity

If we have populations 1, . . . , k with frequencies π1, . . . , πk,then the diversity of all the populations together is

H0 =

k∑i=1

πiHi +∑

i

∑j

πiπjDij = H(w) + D(b)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Double Principal Component AnalysisPavoine et al. 2004 developed a method known as DPCoAimplemented in ade4.Suppose we have n species in p samples/locations and amatrix ∆ giving the squares of the pairwise distancesbetween the species. (Think distance along the tree of thespecies) Then we can

▶ Use the distances between species to find an embeddingin n− 1 -dimensional space such that the euclideandistances between the species is the same as thedistances between the species defined in ∆.

▶ Place each of the p samples/locations at the barycenterof its species profile. The euclidean distances betweenthe samples/locations will be the same as the squareroot of the Rao dissimilarity between them.

▶ Use PCA to find a lower-dimensional representation ofthe locations.

Give the species and communities coordinates such that theinertia decomposes the same way the diversity does...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Double Principal Coordinate Analysis

Takes into account both the tree distances and theabundances.Pavoine, Dufour and Chessel (2004), Purdom (2010) andFukuyama et al. (2011).Takes the tree into account in computation of principalcoordinates.baselineDPCoA=ordinate(phy_pifn_glom, "DPCoA")p=plot_ordination(phy_pifn_glom, baselineDPCoA,type="taxa",color="Class")p+scale_colour_brewer(type = "qual", palette = "Set1")+geom_point(size=7,alpha=0.2)p=plot_ordination(phy_pifn_glom, baselineDPCoA, type="samples", color = "mousenames")+ geom_point(size = 1)+geom_text(aes(label=matn[-which(matn[,2]=="154"),2]),size=8)

p+scale_colour_hue(guide="none")baselineDPCoA$eig[1]/sum(baselineDPCoA$eig)0.88baselineDPCoA$eig[2]/sum(baselineDPCoA$eig)0.067sum(baselineDPCoA$eig[1:2])/sum(baselineDPCoA$eig)0.95

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

161

161

165165

174

174

175175175

33

33

33

591591 71

71 7171

73

7373

73

91

91

92

92

94

94

152

152

153

153

162

162

163

163

164

164

172

172173

173

34

34

3435

35

35

41

41

43

43

592592

593593

7474

74

74

82

82

8383

85

85

95

95

151

171

1713131

3232

32

42

4445

45

727272 7275

757575

8181

8484

93

93

−0.1

0.0

0.1

−0.6 −0.3 0.0 0.3Axis1

Axi

s2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

−0.2

−0.1

0.0

0.1

0.2

−0.5 0.0 0.5CS1

CS

2

Class

Actinobacteria (class)

Bacilli

Bacteroidia

Clostridia

Erysipelotrichi

Gammaproteobacteria

Mollicutes

Verrucomicrobiae

YS2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Comparing Graphs and Covariates

Example: Co-occurrence graph network.With the enterotype data we show the results of themakenetwork() function igraph package to create agraphical display of the ``connectedness'' of samples accordingto the profile of taxa that they share.

> data(enterotype)> entc <- subset_samples(enterotype, !is.na(ClinicalStatus))> ig <- newnet(entc, TRUE, 0.02, 0.4)> p <- gggraph(ig)# Make a new plotting data.frame from the ggplot-plot> DF <- sampleData(entc)[ig[[9]][[3]][["name"]], ]> DF <- data.frame( p$data, DF[p$data$value, ] )> p <- p + geom_point(aes(x, y, color=ClinicalStatus,

shape=Enterotype), data=DF, size=5, alpha=1)> print(p)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Network analysis plot produced by phyloseq'smakenetwork() wrapper on the enterotype dataset

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Bacteria `sharing' between mice

Using the Jaccard index that measures the co-incidence orco-occurrence of species between mice.

Jaccard Similarity =f11

f01 + f10 + f11

Jaccard Disimilarity =f01 + f10

f01 + f10 + f11mouse10 0 0 1 0 1 0 1 0 0 0 0 0 0 1mouse41 0 0 0 0 0 0 0 0 0 0 0 0 0 1vegdist(rbind(mouse1,mouse4),method="jaccard")0.8

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Bacteria `sharing' between mice as a network

netbaseline=make_network(phy_pifn_glom)p=plot_network(netbaseline,phy_pifn_glom,color="mousenames",label="mousenames",point_size=7)+geom_text(aes(label=mousenames),size=7)p+scale_colour_hue(guide="none")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

161

161

165165

174

174

175

175

175

33

33591

71

71

71

73

73

73

73

91

91

92

152

152153

153

163

163

164

172

172

173

34

34

34

35

41

592592

593

74

74

74

74

82

83

83

85

171

31

32

32

32

45

72

72

72

72

7575

75

75

81

81

84

84

93

93

161

161

165 165

174174

175

175175

3333591

71

7171

73

73

73

73

91 91

92

152152 153153

163

163

164

172

172

173

34

34

34

35

41

592592

59374

74

74

74

82

8383

85

171

31

3232

32

45

72

72

72727575

75

75

81

81

8484

9393

163

164

165

171

172

173

174

175

31

32

33

34

35

41

45

591

592

593

71

72

73

74

75

81

82

83

84

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Does the network relate to `communities'?

Friedman and Rafsky (1979) devised a nonparametric test formultivariate data using the minimum spanning tree with anymetric.Then compute the number of `pure' edging connecting labelsfrom the same groups compared to the mixed edgesconnecting labels from different groups, call Fo the observedstatistic.In our example: Fo = 82Keeping the graph fixed, permute the labels and recomputethe number of pure edges.All 1000 simulated values had Fs < 82 so p < 0.001.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Co-occurrence networks for taxa of the baselinemice

p=plot_network(netbasetaxa,phy_pifn_glom,color="Family",type="taxa",label=NULL)p+geom_text(aes(label=Class),size=3)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Mollicutes

Bacteroidia

Bacteroidia

BacteroidiaBacteroidia

Bacteroidia

Bacteroidia

Bacteroidia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Clostridia

Family

Anaeroplasmataceae

Catabacteriaceae

Clostridiales Family XIII. Incertae Sedis

Lachnospiraceae

Ruminococcaceae

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Changes of the network over time?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Differences between two graphs?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Bootstrap for dependency neighborhoods

Given a graph, we want to create a new one by bootstrappingand being able to compare the bootstrapped graphs with ouractual ones.We want a ``Block-Bootstrap'' for graphs.Following Holmes and Reinert, 2004 we use the notion ofneighborhood of dependency.This is what inspires our sampling approach.To create the neighborhood of dependencies,we take a smallrandom walk starting at a node (say 5 steps) and all nodesreached by the walk are sampled.This is an ego-centric sampling scheme. Take the subset ofnodes that have been walked upon and combine them into anew graph.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Part IV

Combining Many Data Tables

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Comparative Systems

functionalSamples tim

e

functionalSamples tim

e

geneticSamples tim

e

geneticSamples tim

e

taxonomicSamples tim

e

Samples tim

e

Uncontaminated (eg Sample A) Contaminated (eg Sample B)

taxonomic

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Multi-table methods

Inertia, Co-InertiaWe generalize it in several directions through the idea ofinertia.As in physics, we define inertia as a weighted sum ofdistances of weighted points.This enables us to use abundance data in a contingency tableand compute its inertia which in this case will be theweighted sum of the squares of distances between observedand expected frequencies, such as is used in computing thechisquare statistic.Another generalization of variance-inertia is the usefulPhylogenetic diversity index. (computing the sum of distancesbetween a subset of taxa through the tree).We also have such generalizations that cover variability ofpoints on a graph taken from standard spatial statistics.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Co-Inertia

When studying two variables measured at the same locations,for instance PH and humidity the standard quantification ofcovariation is the covariance.

sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)

if x and y co-vary -in the same direction this will be big.A simple generalization to this when the variability is morecomplicated to measure as above is done through Co-Inertiaanalysis (CIA).Co-inertia analysis (CIA) is a multivariate method thatidentifies trends or co-relationships in multiple datasets whichcontain the same samples or the same time points. That isthe rows or columns of the matrix have to be weightedsimilarly and thus must be matchable.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

RV coefficient

The global measure of similarity of two data tables asopposed to two vectors can be done by a generalization ofcovariance provided by an inner product between tables thatgives the RV coefficient, a number between 0 and 1, like acorrelation coefficient, but for tables.

RV(A,B) = Tr(A′B)√Tr(A′A)

√Tr(B′B)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

ExampleCurrent work with Joey McMurdie, Les Dethfelsen, DavidRelman, Angela Marcobal, Justin Sonnenburg. Data:

Taxa Read counts (3 patients taking cipro: two timecourses) : .

Mass-Spec Positive and Negative ion Mass Spec featuresand their intensities: .

RNA-seq Metagenomic data on genes :.

Here is the RV table of the three array types:

> fourtable$RVTaxa Kegg MassSpec+ MassSpec-

Taxa 1 0.565 0.561 0.670Kegg 0.565 1 0.686 0.644MassSpec+ 0.561 0.686 1 0.568MassSpec- 0.670 0.644 0.568 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Multiple table methods

In PCA we compute the variance-covariance matrix, inmultiple table methods we can take a cube of tables andcompute the RV coefficient of their characterizing operators.We then diagonalize this and find the best weighted`ensemble'.This is called the `compromise' and all the individual tablescan be projected onto it. file:///Users/susan/MultiTable/ReprodResearch/statis-demo.html

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Distances and Inertia

Variance is at the heart of ANOVA: decomposition ofvariability into parts that can be explained by a factor and aresidual part.

Test statistic:average ss explained by factoraverage of residual ss

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Features

1. Inertia : Trace(VQ) = Trace(WD)(inertia in the sense of Huyghens inertia formula forinstance). Huygens, C. (1657),

n∑i=1

pid2(xi, a)

Inertia with regards to a point a of a cloud of pi-weightedpoints.PCA with Q = Ip, D = 1

nIn, and the variables are centered,the inertia is the sum of the variances of all the variables.If the variables are standardized (Q is the diagonal matrix ofinverse variances), then the inertia is the number ofvariables p.For correspondence analysis the inertia is the Chi-squaredstatistic.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Goals already attained:

▶ Mathematical Generalizations using diagram andmultilinear approaches (cubes with statis).

▶ Compute distances between general trees distory anduse this in MDS.

▶ Data integration phyloseq, networks, trees, abundancetables.

▶ Threshold, sensitivity tests and modeling simulations.▶ Reproducibility: open source standards, publication of

source code and data. (R).Questions:More methods based on approximate diagrams (expectationsas in Stein's method).We know the distance between hierarchical clustering trees,what is the correct distance between barcodes?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Benefitting from the tools and schools ofStatisticians.......

Thanks to the R community: Chessel, Jombart, Dray,Thioulouse ade4 and Emmanuel Paradis ape.

Collaborators: David Relman, Alfred Spormann, LesDethfelsen, Justin Sonnenburg, Persi Diaconis, ElisabethPurdom.

Postdoctoral Fellows Paul (Joey) McMurdie.Angela Marcobal, Serban Nacu.Students: Alden Timme, Katie Shelef, Yana Hoy, JohnChakerian, Julia Fukuyama, Kris Sankaran.Funding from NIH/ NIGMS R01, NSF-VIGRE and NSF-DMS.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

phyloseq

Joey McMurdie (joey711 on github).Available in Bioconductor.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

L. Billera, S. Holmes, and K. Vogtmann.The geometry of tree space.Adv. Appl. Maths, 771--801, 2001.

J. Chakerian and S. Holmes.distory:Distances between trees, 2010.

Daniel Chessel, Anne Dufour, and Jean Thioulouse.The ade4 package - i: One-table methods.R News, 4(1):5--10, 2004.

P. Diaconis, S. Goel, and S. Holmes.Horseshoes in multidimensional scaling and kernelmethods.Annals of Applied Statistics, 2007.

Y. Escoufier.Operators related to a data matrix.In J.R. et al. Barra, editor, Recent developments inStatistics., pages 125--131. North Holland,, 1977.

Jerome H Friedman and Lawrence C Rafsky...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Multivariate generalizations of the Wald-Wolfowitz andSmirnov two-sample tests.The Annals of Statistics, pages 697--717, 1979.

Susan Holmes.Multivariate analysis: The French way.In D. Nolan and T. P. Speed, editors, Probability andStatistics: Essays in Honor of David A. Freedman,volume 56 of IMS Lecture Notes--Monograph Series.IMS, Beachwood, OH, 2006.

Purna C Kashyap, Angela Marcobal, Luke K Ursell,Samuel A Smits, Erica D Sonnenburg, Elizabeth K Costello,Steven K Higginbottom, Steven E Domino, Susan P Holmes,David A Relman, et al.Genetically dictated change in host mucus carbohydratelandscape exerts a diet-dependent effect on the gutmicrobiota.Proceedings of the National Academy of Sciences, page201306070, 2013.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

S.W. Kembel, P.D. Cowan, M.R. Helmus, W.K. Cornwell,H. Morlon, D.D. Ackerly, S.P. Blomberg, and C.O. Webb.Picante: R tools for integrating phylogenies and ecology.Bioinformatics, 26(11):1463--1464, 2010.

K. Mardia, J. Kent, and J. Bibby.Multiariate Analysis.Academic Press, NY., 1979.

P. J. McMurdie and S. Holmes.Phyloseq: A bioconductor package for handling andanalysis of high-throughput phylogenetic sequence data.

P. J. McMurdie and S. Holmes.Phyloseq: Reproduible research platform for bacterialcensus data.PlosONE, 2013.April 22,.

Serban Nacu, Rebecca Critchley-Thorne, Peter Lee, andSusan Holmes.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Gene expression network analysis and applications toimmunology.Bioinformatics, 23(7):850--8, Apr 2007.

Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, PierreLegendre, R. G. O'Hara, Gavin L. Simpson, Peter Solymos,M. Henry H. Stevens, and Helene Wagner.vegan:Community Ecology Package, 2010.

E. Paradis, J. Claude, and K. Strimmer.Ape: analyses of phylogenetics and evolution in Rlanguage.Bioinformatics, 20:289--290, 2004.

Sandrine Pavoine, Anne-Béatrice Dufour, and DanielChessel.From dissimilarities among species to dissimilarities amongcommunities: a double principal coordinate analysis.Journal of Theoretical Biology, 228(4):523--537, 2004.

Katherine S. Pollard, Houston N. Gilbert, Yongchao Ge,Sandra Taylor, and Sandrine Dudoit...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

multtest: Resampling-based multiple hypothesis testing,2010.R package version 2.4.0.

C. R. Rao.The use and interpretation of principal componentanalysis in applied research.Sankhya A, 26:329--359., 1964.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI).

Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Distances between Trees

▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves

4

0

2 31

0

4321

0

41 32

▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).

The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Boundary for trees with 3 leaves

1 2 3

1 2 3

3 2 1

3 1 2

0

0

0

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

The quadrant for one tree

(0,0) (1,0)

(0,1) (1,1)

0

1 2 3 4

11

{{

1 2 3 4

0

21 3 4

0

1 2 3 4

{1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Link of the originAll 15 quadrants for n = 4 share the same origin. If we takethe diagonal line segment x + y = 1 in each quadrant, weobtain a graph with an edge for each quadrant and atrivalent vertex for each boundary ray; this graph is calledthe link of the origin.

1 42 3

0

1 42 3

0

(1,0)

(0,1)

x+y=1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Cube complex of Euclidean Orthants

A path between two treesconsists of line segments through a sequence of orthants.This sequence of orthants is the path.A path is a geodesic when it has the smallest length of allpaths between two points.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

A Cone Path

A path between two trees T and T′ always exists. Since allorthants connect at the origin, any two trees T and T′ canbe connected by a two-segment path, this is called thecone-path.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Three orthants sharing a common boundary for n = 4 leaves.

1 42 3

0

1 42 3

0

1 42 3

0

1 43 2

0

2 43 1

0

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

c

a

c

b ba

Tree space with BHV metric is a CAT(0) space, that is, it hasnon-positive curvature. This implies there are geodesicbetween any two trees (Gromov).It is also not an Euclidean space.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Euclidean space (where through every point not on a line) isflat:(sum of angles of a triangle is 180 o),

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Hyperbolic space is ` negatively' curved:

Euclid's parallel postulate is replaced.In hyperbolic geometry there are at least two distinct linesthrough P which do not intersect l, so the parallel postulateis false.A characteristic property of hyperbolic geometry is that theangles of a triangle add to less than 180 o.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

c

a

c

b ba

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Geodesic metric space:If we have a distance defined between any two points of aspace, we call it a metric space.(The distance doesn't have to be defined through ordinarycoordinates)A geodesic metric space is a metric space where geodesicsare defined to be the shortest path between points in thespace.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

δ-hyperbolic space is a geodesic metric space in which everygeodesic triangle is δ-thin.δ-thin: pick three points and draw geodesic lines betweenthem to make a geodesic triangle. Then any point on any ofthe edges of the triangle is within a distance of δ from oneof the other two sides.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

For example, trees are 0-hyperbolic: a geodesic triangle in atree is just a subtree, so any point on a geodesic triangle isactually on two edges.

Normal Euclidean space is ∞-hyperbolic; i.e. not hyperbolic.Generally, the higher δ has to be, the less curved the spaceis.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Is it better to represent the distances by a tree or aEuclidean projection?

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Tree of Trees

A tree is a complete CAT(0) space.

c

a

c

b baa

c

b

Since BHV,2001 [1] have shown that the space of trees isnegatively curved (a CAT(0) space), the most naturalrepresentation of a collection of trees may be a tree.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Some Advantages of R

Cutting edge solutions Availability of > 4000 packages, allstatistical/machine learning methods available.Robust methods, sparse versions of all of theprojection methods.

Data Input and cleaning Multiple formats, (Newick, RNA-seq,Qiime, HTS, Micro-array, Mass-Spec, Genome,...).

Visualization and Presentation High quality graphics:ggplot2.

Reproducibilty Sweave, history, ...Open Source Community of users, documentation.Simulation Random number generators.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Some Difficulties in R

Main Drawback By default works with data in RAM.Another Problem Heterogeneity: Data, Packages,

Documentation.Open Source Community of users, documentation.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Best Platform for Designing a Statistical Analysis

▶ Comparing sophisticated methods.Almost 5,000 packages available.

▶ Compare 20 methods on a new type of data in a week.▶ High quality visualization.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Seamless Interaction with standardized Databases

▶ Through Bioconductor package AnnotationDbi wehave access to GenBank, the Gene OntologyConsortium, Entrez genes, UniGene, the UCSCHuman Genome Project KEGG

▶ metlin ChemmineR , rcdklibs, rpubchem

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

http://www.bioconductor.org/packages/2.8/bioc/html/KEGGSOAP.html

http://bioweb.ucr.edu/ChemMineV2/)

http://cran.r-project.org/web/packages/rpubchem/index.html

Packages Tailored to specific formats

.

▶ Maps.▶ Trees, Networks and Graphs▶ Microarrays, Mass Spectroscopy,▶ Texts.▶ Images.▶ DNA, AA, Sequencing Data, HTS, RNA-seq.▶ Qiime / mothur formats.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Plotting

.

High powered graphics packages with layered functionalities:▶ ade4 .▶ Philosophy: Leland Wilkinson's Grammar of Graphics.▶ Implementation :ggplot2

(Hadley Wickham youtube video)▶ Animated gifs, interactive graphics.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

http://youtu.be/TaxJwC_MP9Q

One Diagram to replace Two Diagrams

Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2

measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt

1DX1)−1 and (Xt

2DX2)−1

Q =

(Xt

1DX1)−1 0

0 (Xt2DX2)

−1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

.

Rp1∗ −−−−→X1

Rn

Ip1

x yV1 Dy xW1

Rp1 ←−−−−Xt1

Rn∗

Rp2∗ −−−−→X2

Rn

Ip2

x yV2 Dy xW2

Rp2 ←−−−−Xt2

Rn∗

Rp1+p2∗ −−−−→[X1;X2]

Rn

Qx yV D

y xW

Rp1+p2 ←−−−−−[X1;X2]t

Rn∗

This analysis gives the same eigenvectors as the analysis ofthe triple (Xt

2DX1, (Xt1DX1)

−1, (Xt2DX2)

−1), also known as the..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

canonical correlation analysis of X1 and X2.

Part VI

Confirmatory Analysis:Nonparametric tests

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Nonparametric Tests

▶ Canonical Correspondence Analysis tests on a factor.▶ Mantel's Test between distance matrices▶ Multiple testing correction.▶ Bootstrap tests.

Everything is done by shuffling labels

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Example:Is there a shedding effect in Relman/Hoy Mice data?> shed = scan("shed.txt")> shedf = as.factor(shed)[-(70:71)]> resca.shed = cca(t(pib.nz) ~ shedf)> anova(resca.shed)Permutation test for cca under reduced model

Model: cca(formula = t(pib.nz) ~ shedf)Df Chisq F N.Perm Pr(>F)

Model 6 0.3825 2.0627 199 0.005 **Residual 87 2.6886---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Example: Is there a subject effect in Katie's data?subjcc = vegan::cca(tnorepnz ~ subject)anova(subjcc)

Permutation test for cca under reduced model

Model: cca(formula = tnorepnz ~ subject)Df Chisq F N.Perm Pr(>F)

Model 7 1.2177 10.7999 199 0.005 **Residual 472 7.6027---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

> cca.cage = vegan::cca(t(tcmall) ~ cagef)> plot(cca.cage)> text(cca.cage, choices = c(1, 2), label = cagef, display = "lc")> anova(cca.cage)Permutation test for cca under reduced model

Model: cca(formula = t(tcmall) ~ cagef)Df Chisq F N.Perm Pr(>F)

Model 2 0.3722 7.6501 199 0.005 **Residual 178 4.3305---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

−4 −2 0 2 4

−2

02

4

CCA1

CC

A2

++

++ +

++

++

++++ +

++

+

+++ +

++

+

+

+ +

+

++

+

+

++

+ +

++

+

++

++

++

+

+

+

+

+

+

+

++

+

+

+

++

++++

+

+

+

+++

+

+

+++

+

++

++

++++

+

+

+

++

++

+

+

++

++

++

+

++

+++

+

+

++

+++

+

+

+

+

+

+

+++

++ ++

+

+ ++

+

++

+

++ ++

++

+

++ +

++

++

+

++

+ +++

++

+ +

++

+

+

++

+ ++

+

+

+

+++ +

++

++

+

+

++

++ +

+

+

+

+

+

++

++

+

+

++

+

+

++

+++

+++

++

++

++ +

++

+ + ++

+

+

+

+

+

+

+++ +

+

++

++++ +

+

+

++

+

+

+

+++

++

+

+

++

++

+

+

+

+++

+++

++++ ++

+

+

++ +

+

++

+++

++

+

++

++

+ +

++

++

+

+ ++

++

+

+

++

+

+

●

● ●

●

●

●

●

●

● ●

●

●●●

●

●

●

●

●●

●●

●●

●

●●●

●

● ●

●

●●●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●●

●●

●

●

●

●

●●

●

●

●

● ●●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●●

●

●

●

●

●

●

●

●●●

● ●

●

●

●

● ●

●

●

●

●

●●

●

●●

●●

●●

●●●

●●

●

●

●

●● ●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

x

x

x

1515151515151515151515

777777777777777777777777

4444444444444444444444444444

1515151515151515151515151515

4444444444444444444444444444

7777777777777777777777777777777777777777777777777777777777777777777777777777

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

> plot(cca.cage, scaling = 1)> text(cca.cage, scaling = 1, choices = c(1, 2), display = "sp",+ cex = 0.6)> title("Species, Cage effect, 1,2")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

−4 −2 0 2 4

−2

02

4

CCA1

CC

A2

+

+

++ +

+

+

+

+

+

+

++

+

+

+

+

++

++

+

+

+

+

++

+

++

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+++

+

+

+

+

+

++

++

+

+

+

+

+

+ ++

+

+

+

+ +

+

+

+

+

+

+

++

+

+

+

+

++

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++ +

+

+

+ + +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

++

++ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+

+

+

●● ●

●

●●

●●

● ●●

●● ●●●●

●●●

●● ●●●

●●●●● ●

●●●●

●●

●

●

●●●

●●●

●

●

●

●●

●●

●●

●●

●

●

●

●●●●

●

●●

●●●●

●

●

●●

●●

●

● ●● ●

●

●●●●●

●

●

●●

●

●

●

●

●

●

●

●●●

●● ●●

●

●

●●●

●●

●

●●

●●

●● ●● ●

●

●

●● ●

●●

●

●●●

●

●●

●●●●

●●●

●●

●●●

●●●●●

●●

●

●●●

●

●

●

●●●●

●

●

●

●●

●

●

●●●●●

●●●

x

xx

f__

Lachno

LachnoLachno Lachno

Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

LachnoRuminoc

f__

Lachno

Lachno

f__

LachnoLachno

LachnoRuminoc

f__

Lachno

f__

Lachno

Lachnof__

Lachno

LachnoLachno

Lachno

Lachno

Ruminoc

Lachno

RuminocLachno

f__

Ruminoc

Lachno

f__

Lachno

f__

Coriobact

Lachnof__

f__

f__

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Lachno

Ruminoc

Lachno

Lachno

LachnoLachnoLachno

Lachno

Lachno

Ruminoc

Lachno

Ruminoc

LachnoLachno

Lachno

Lachno

Ruminoc

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Ruminoc

f__

Lachno

Lachno

Lachno

Lachno

f__

Ruminoc

Lachno

Ruminoc

f__

Catabact

f__

AnaeroplCatabact

f__

Lachno

Ruminoc

Lachno

Catabact

f__Ruminoc

Lachno

Lachno

Ruminoc

Dehalobact

f__

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachnof__

Ruminoc

Ruminoc

Lactobaci

RuminocRuminoc

ClostridI

Lachno

Ruminoc

Lachno

f__

LachnoRuminocRuminoc

Lachno

Lachno

f__

Lachno

Lachno

LachnoRuminoc

LachnoLachno

Lachno

Lachno

Ruminoc

Ruminoc

Ruminoc

Lachno RuminocLachno

Ruminoc

Lachno

Lachno

RuminocLachno

Lachno

Lactobaci

Ruminoc

f__

f__

Erysip

LachnoLachno

Lachno

Lachno

Lachno

Ruminoc

LachnoLachno

RuminocLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

RuminocRuminoc

f__

Lachno

Lachno

Ruminoc

Lachno

Lachno

RuminocLachno

Lachno

Lachno

Lachno

Lachno

Lachno

f__

Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoEntero Lachno

Lachno

Lachno

Ruminoc Lachno Lachno

Ruminoc

Lachno

Lachno

Lachno

Lachno

Ruminoc

Lachno

Lachno

Lachnof__

Ruminoc

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

ClostridI

Leuconost

Lachno

Lachno

Ruminoc

f__

Lachno

Lachno

Lachno

Lachno

Lachno

LachnoLachno

Lachno

LachnoLachno

Lachno

RuminocLachno

Lachno

f__f__

Lachno

Lachno

Lachno

Lachno

Lachno

Ruminoc

Erysip

Lachno

Lachno

Catabact

Lachno

Lachno

Lachno

Ruminoc

Lachno

Lachno

Catabact

ClostridIII

Lachno

Lachno

f__Lachno

Lachno

Ruminoc

f__

Lachno f__

Lachno

Lachno

Erysip

Lachno

Verruc

Lachno

Lachno

Lachno

Lachno

Species, Cage effect, 1,2

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Over-representation of certain phyla

Set Over-represented UniverseMicrobiome Families/Phyla Species Present

Gene Expression Ontological groups Filtered GenesTest in both cases: hypergeometric / Fisher's exact test.We define the set of prefiltered species (species universe) asthose that passed the threshold test of being present (>6000) in at least 31 of the arrays.This method is especially relevant here as the tree does notshow equal representation of different families and phyla.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Hypergeometric Tests for over representation ofcertain phyla)

▶ IBS higher group had significantly more Bacteriodetes▶ overrepresentation of Firmicutes in the healthy controls.▶ At the family level, the results showed that the families

of Oxalobacteraceae, Prevotellaceae, Burkholderiaceae,Sphingobacteriaceae were significantly overrepresentedin IBS.

▶ Conversely, the most significantly enriched family incontrol rats were Lachnospiraceae, includingRuminococcus sp., followed by Erysipelotrichaeceae andClostridiaceae.

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Part VII

RobustnessHigh Breakdown (median,)

Sparse (L1 minimizing)

Nonparameteric (ranks, nmMDS, ..)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Tumor Cells

0 5000 10000 15000 20000

05000

10000

15000

20000

05e-0

40.0

010 1 1 0 -1 1 0 0 0 -1

0 1 1 0 0 0 0 0 0 1

0 1 -1 0 -1 0 0 0 0 -1

0 1 1 0 0 -1 1 0 1 1

0 1 1 0 0 0 0 1 0 1( (

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Solutions: Ways of Using Trees

Main tool: ape.Example:

> treeall=read.tree("/Users/susan/Dropbox/Yana/gg.tre")> todrop=which(!(treeall$tip.label %in% otu.id))>> todrop=which(!(treeall$tip.label %in% otu.id))> dropthese=treeall$tip.label[todrop]> yana1772tree=drop.tip(treeall,dropthese)>layout(matrix(c(1,1,1,2,2,2,3,4,5,5,5,5),ncol=12,nrow=1))>subtree = drop.tip(tree, which(!tree$tip.label %in% selected))>plot(tree, use.edge.length=F, no.margin=T, edge.color=colo, show.tip.label=F)###Relevant subtree>plot(subtree, edge.color='red', show.tip.label=T, no.margin=T, use.edge.length=F)

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Mapping Variables onto Phylogenies

Example using phyloseq package> data(esophagus)> es1 <- esophagus> sn <- sample.names(esophagus)> sampleData(es1) <- sampleData(data.frame(sample=sn, row.names=sn))

We create the tree graphic, grouping/coloring by our dummysample-name variable, and also labelling the number ofindividuals observed in each sample (if at all). The symbolsare slightly enlarged as the number of individuals increases.> plot_tree_phyloseq(es1, color_factor="sample",

type_abundance_value=TRUE, treeTitle="esophagus example data")

..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .

Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree...

Documents

Transcript of Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree...