Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree...
Transcript of Geometrical Methods for Heterogeneous Datastatweb.stanford.edu/~susan/IMA_Geometry.pdfGraph or Tree...
Geometrical Methods forHeterogeneous Data
Susan Holmeshttp://www-stat.stanford.edu/˜susan/
Bio-X and Statistics, Stanford University
IMA, 2013 and funded by NIH-R01GM086884
ABabcdfghiejkl..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
What's happening in Biological Data Analysis?
1985 1998 2012
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
What's happening in Biological Data Analysis?1985 1998 2012
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
What's happening in Biological Data Analysis?1985 1998 2012
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
What's happening in Biological Data Analysis?1985 1998 2012
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Challenges for those of us working from the groundup
▶ Heterogeneity.▶ Structured high-dimensionality.▶ Graph or Tree integration.▶ Identifiability.▶ High quality graphics.▶ Reproducibility.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Part I
Heterogeneity
`Homogeneous data are all alike;
all heterogeneous data are heterogeneous
in their own way.'
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Heterogeneity of Data▶ Status : response/ explanatory.▶ Hidden (latent)/measured.▶ Types :
▶ Continuous▶ Binary, categorical▶ Graphs/ Trees▶ Images▶ Maps/ Spatial Information▶ Rankings
▶ Amounts of dependency: independent/time series/spatial.▶ Different technologies used (454, phyloseq, Illumina,
MassSpec, RNA-seq).
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Microbiome and other useful `ome'wordsJoshua Lederberg:`the ecological community of commensal,symbiotic, and pathogenic microorganisms that literally shareour body space and have been all but ignored asdeterminants of health and disease'Microbiome Complete collection of genes contained in the
genomes of microbes living in a givenenvironment.
Numbers Humans shelter 100 trillion microbes (1014), (weare made of 10 ×1012 cells)It is estimated that there are more than 1000`species' of bacteria living in the human gut.
Metagenome Composition of all genes present in anenvironment (soil, gut, seawater), regardless ofspecies.
Transciptome These are the mRNA transcripts in the cell, itreflects the genes that are being activelyexpressed at any given time.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Bacteria etc... and Us
The human microbiome or human microbiota is theassemblage of microorganisms that reside on the surface andin deep layers of skin, in the saliva and oral mucosa, in theconjunctiva, and in the gastrointestinal tracts.
▶ They include bacteria, fungi, and archaea.▶ Some of these organisms perform tasks that are useful
for the human host. (live in symbiosis)▶ Majority have no known beneficial or harmful effect.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
YK Lee and SK Mazmanian Science, 2010.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Human Microbiome: What are the data?
DNA The Genomic material present (16sRNA-geneespecially).
RNA What genes are being turned on (geneexpression).
Mass Spec Specific signatures of chemical compoundspresent.
Clinical Multivariate information about patients' clinicalstatus, medication, weight.
Environmental Location, nutrition, time.Domain Knowledge Metabolic networks, phylogenetic trees,
gene ontologies.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Heterogeneous Data ObjectsObject Oriented Data AnalysisInput and data manipulation with phyloseq(McMurdie and Holmes, 2013, Plos ONE)As always in R: object oriented data:
Taxonomy Table taxonomyTableslots: .Data
OTU Abundanceclass: otuTableslots: .Data, speciesAreRows
Sample VariablessampleDataslots: .Data,names,row.names,.S3Class
Phylogenetic Treeclass: phyloslots: see ape package
matrix matrixdata.frame
phyloseqslots:otuTablesampleDatataxTabtre
Experiment-level data object:
Component data objects:
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Questions from many Disciplines
Biomedical sciences Clinical variables, immune history.Genetics Host and bacterial genomes and transcriptomes.Bioinformatics Databases and formats.Biochemistry Compounds, Metabolic pathways involved.Ecology Ordination, diversity, environmental influences.Statistics Decompositions of variability, spatio-temperal
analyses.Systematic Biology Phylogenetic Trees, dynamics of evolution.Network science Microbial communities, gene interactions,
metabolic pathways,...Visualization Dimension reduction, rich representations.Mathematics Differential Geometry, multilinear algebra,
probability, topology.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Useful first order representation: Many Matrices
▶ Time series of abundance matrices.▶ Different types of data on same samples (taxa counts,
clinical variates, spatial location).▶ Networks and trees over time.▶ Explanatory (environmental) variables, Response variables.
Holmes (2005), Duality Diagrams.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Part II
Dimension Reduction: theEuclidean embedding workhorse:
MDS
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Principal Component Analysis: Dimension Reduction
PCA seeks to replace the original (centered) matrix X by amatrix of lower rank, this can be solved using the singularvalue decomposition of X:
X = USV′, with U′DU = In and V′QV = Ip and Sdiagonal
XX′ = US2U′, with U′DU = In and S2 = Λ
PCA is a linear nonparametric multivariate method fordimension reduction. D and Q are the relevant metrics on thedual row and column spaces of n samples and p variables.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Examples from Ecology
The Boomlake plant data:
Biplot representing both species and locationsBlue circles with letters are species scores
Sampling locations are green circles with numbers.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Sample 1 is actually in the lake, and sample 12 is far away.Species are located closely to the samples they occur in. Ifyou looked carefully into the data matrix, you would find thatspecies R and Q are strictly aquatic, while species F is adryland plant (cribs). There is an arch effect.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Psychological Data
Color confusion data (Ekman, 1954):
w434 w445 w465 w472 w490 w504 w537 w555 w584 w600 w610 w628 w651 w6741 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02 0.07 0.09 0.12 0.13 0.162 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02 0.04 0.07 0.11 0.13 0.143 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02 0.01 0.02 0.01 0.05 0.034 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02 0.01 0.00 0.01 0.02 0.045 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07 0.02 0.02 0.01 0.02 0.006 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14 0.08 0.02 0.02 0.02 0.017 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22 0.14 0.05 0.02 0.02 0.008 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33 0.19 0.04 0.03 0.02 0.029 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00 0.58 0.37 0.27 0.20 0.2310 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58 0.00 0.74 0.50 0.41 0.2811 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74 0.00 0.76 0.62 0.5512 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76 0.00 0.85 0.6813 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85 0.00 0.7614 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76 0.00
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Results: Color Confusion
Screeplot
●
●
●
●
●●
● ● ● ●
2 4 6 8 10
0.0
0.5
1.0
1.5
2.0
Index
clas
s.co
l$ei
g
Planar configuration
−0.4 −0.2 0.0 0.2 0.4
−0.
4−
0.2
0.0
0.2
0.4
cmdscale(colorc)[,1]cm
dsca
le(c
olor
c)[,2
]
1 2
34
5
6
78
9
10
11
1213
14
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Metric Multidimensional ScalingSchoenberg (1935)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
From Coordinates to Distances and Back
If we started with original data in Rp that are not centered:Y, apply the centering matrix
X = HY, with H = (I− 1
n11′), and 1′ = (1, 1, 1 . . . , 1)
Call B = XX′, if D(2) is the matrix of squared distancesbetween rows of X in the euclidean coordinates, we can showthat
−1
2HD(2)H = B
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Schoenberg's result: exact Euclidean distance
If B is positive semi-definite then D can be seen as a distancebetween points in a Euclidean space.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Reverse engineering an Euclidean embedding
We can go backwards from a matrix D to X by taking theeigendecomposition of B = −1
2HD(2)H in much the same waythat PCA provides the best rank r approximation for data bytaking the singular value decomposition of X, or theeigendecomposition of XX′.
X(r) = US(r)V′ with S(r) =
s1 0 0 0 ...0 s2 0 0 ...0 0 ... ... ...0 0 ... sr ...... ... ... 0 0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Multidimensional Scaling (MDS) also called PCoA
Simple classical multidimensional scaling.▶ Square D elementwise D(2) = D2.▶ Compute −1
2 HD2H = B.▶ Diagonalize B to find the principal coordinates SV′.▶ Choose a number of dimensions by inspecting the
eigenvalue's screeplot.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Malaria Data as seen using ape
Pre1
Pme2
Plo6
Pga11
Pma3
Pbe5
Pfr7
Pkn8
Pcy9
Pvi10
Pfa4
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Bootstrap of Malaria Data
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 20000 40000 60000 80000 120000
Eigenvalues of MDS for bootstrapped trees
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
−40 −20 0 20 40
−60
−40
−20
020
40
Bootstrapped trees
o1
2
3
4
5
6
7
8
9
1011
12
1314
15
1617
18
19
20
21
22
23
24
25 26
2728
29
30
31
32
33
34
35
36
373839
40
41
42
43
4445
46
47 4849
50
51
52
53
54
55
5657
5859
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
7576
77
78
7980
81
8283
84
85
86
8788
89
90
91
92
93
94
95
9697
98
99
100o
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Data Analysis: Geometrical Approach
i. The data are p variables measured on n observations.ii. X with n rows (the observations) and p columns (the
variables).iii. Dn is an n× n matrix of weights on the ``observations'',
which is most often diagonal.iv Symmetric definite positive matrix Q,often
Q =
1σ21
0 0 0 ...
0 1σ22
0 0 ...
0 0 1σ23
0 ...
... ... ... 0 1σ2
p
.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Euclidean Spaces
These three matrices form the essential ``triplet" (X,Q,D)defining a multivariate data analysis.Q and D define geometries or inner products in Rp and Rn,respectively, through
xtQy =< x, y >Q x, y ∈ Rp
xtDy =< x, y >D x, y ∈ Rn.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
A Commutative Diagram Approach
Statisticians search for approximations with certainproperties, for the case of PCA for instance, we rephrase theproblem as follows:
▶ Q can be seen as a linear function from Rp toRp∗ = L(Rp), the space of scalar linear functions on Rp.
▶ D can be seen as a linear function from Rn toRn∗ = L(Rn).
▶Rp∗ −−−−→
XRn
Qx yV D
y xW
Rp ←−−−−Xt
Rn∗
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Properties of the Diagram
Rank of the diagram: X,Xt,VQ and WD all have the samerank.For Q and D symmetric matrices, VQ and WD arediagonalisable and have the same eigenvalues.
λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λr ≥ 0 ≥ · · · ≥ 0.
Eigendecomposition of the diagram: VQ is Q symmetric, thuswe can find Z such that
VQZ = ZΛ,ZtQZ = Ip, where Λ = diag(λ1, λ2, . . . , λp). (1)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Comparing Two Diagrams: the RV coefficientMany problems can be rephrased in terms of comparison oftwo ``duality diagrams" or put more simply, two characterizingoperators, built from two ``triplets", usually with one of thetriplets being a response or having constraints imposed on it.Most often what is done is to compare two such diagrams,and try to get one to match the other in some optimal way.To compare two symmetric operators, there is either a vectorcovariance as inner productcovV(O1,O2) = Tr(Ot
1O2) =< O1,O2 > or a vector correlation(Escoufier, 1977)
RV(O1,O2) =Tr(Ot
1O2)√Tr(Ot
1O1)tr(Ot2O2)
.
If we were to compare the two triplets(Xn×1, 1,
1nIn)
and(Yn×1, 1,
1nIn)
we would have RV = ρ2.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
PCA: Approximating one diagram by another
PCA can be seen as finding the matrix Y which maximizes theRV coefficient between characterizing operators, that is,between
(Xn×p,Q,D
)and
(Yn×q, I,D
), under the constraint
that Y be of rank q < p .
RV(XQXtD,YYtD
)=
Tr(XQXtDYYtD
)√Tr (XQXtD)2 Tr (YYtD)2
.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
This maximum is attained where Y is chosen as the first qeigenvectors of XQXtD normed so that YtDY = Λq. Themaximum RV is
RVmax =
∑qi=1 λ
2i∑p
i=1 λ2i.
Of course, classical PCA has D = 1nI, Q = I, but the extra
flexibility is often useful. We define the distance betweentriplets (X,Q,D) and (Z,Q,M) where Z is also n× p, as thedistance deduced from the RV inner product betweenoperators XQXtD and ZMZtD.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Part III
Combine and Compare Trees,Graphs and Contingent Data
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Find sub graphs which are well explained by variables
A subset of the skeleton gene interaction network. −→most perturbed subgraph
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Find sub graphs which are well explained by variables.
A subgraph related to a chemokine pathway. (GXNA,Serban et al, 2005, Bioinformatics). ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Discriminant Analysis as a duality diagram
Let A be the g× p matrix of group means in each of the pvariables. This satisfies
YtDX = ∆YA where ∆Y = YtDY = diag(w1,w2, . . . ,wg),
and wk =∑
i:yik=1 di, the wk's are the group weights, as theyare the sums of the weights as defined by D for all theelements in that group.Call T the matrix T = XtDX, in the standard case with alldiagonal elements of D equal to 1
n this is just the standardvariance-covariance, otherwise it is a generalization thereof.The generalized between group variance-covariance isB = At∆YA and call the between group variance covariancethe matrix W = (X− YA)tD(X− YA).
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Proposition 1
(a generalized Huyghens' formula):
T = B + W
Proof: Expanding W gives
W = XtDX− XtDYA− AtYtDX + AtYtDYA= T− A′∆YA− A′∆YA + A′∆YA = T− B
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Duality Diagram for LDA
The duality diagram for linear discriminant analysis is
Rp∗ −−→A
Rg
T−1
x yB ∆Y
y xAT−1At
Rp ←−−At
Rg∗
.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
This corresponds to the triple (A,T−1,∆Y), because
(XtDY)∆−1Y (YtDX) = At∆YA
and gives equivalent results to the triple (YtDX,T−1,∆−1Y ).
The discriminating variables are the eigenvectors of theoperator
At∆YAT−1.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Explaining one diagram by another
Principal Component Analysis with respect to InstrumentalVariables was a technique developed by C.R. Rao,(1964) tofind the best set of coefficients in the multivariate regressionsetting where the response is multivariate, given by a matrixY. In terms of diagrams and RV coefficients, this problem canbe rephrased as that of finding M to associate to X so that(X,M,D) is as close as possible to (Y,Q,D) in the RV sense.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
The answer is provided by defining M such that
YQYtD = λXMXtD.
If this is possible then the two eigendecompositions of thetriple give the same answers. We simplify notation by thefollowing abbreviations:
XtDX = Sxx YtDY = Syy XtDY = Sxy and R = S−1xx SxyQSyxS−1
xx .
Then
∥ YQYtD−XMXtD ∥2=∥ YQYtD−XRXtD ∥2 + ∥ XRXtD−XMXtD ∥2 .
The first term on the right hand side does not depend on M,and the second term will be zero for the choice M = R.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
If we add the extra constraint that we only allow ourselvesa rank q approximation, with q < min (rank (X), rank (Y)), theoptimal choice of a positive definite matrix M is to takeM = RBBtR where the columns of B are the eigenvectors ofXtDXR with:
B =
(1√λ1
β1, . . .1√λq
βq
)such that
XtDXRβk = λkβk
βtkRβk = λk, k = 1 . . . q
λ1 > λ2 > . . . > λq
The PCA with regards to instrumental variables of rank q isequivalent to the PCA of rank q of the triple (X,R,D) where
R = S−1xx SxyQSyxS−1
xx .
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Canonical Correlation Analysis
Canonical correlation analysis was introduced by Hotelling[?]to find the common structure in two sets of variables X1 andX2 measured on the same observations. MaximizingCovariance
cov(x, y) = 1
n∑
i(xi − x)(yi − y)
(v(x) cov(x, y)
cov(y, x) v(y)
)
cor(x, y) = cov(x, y)√v(x)
√v(y)
=< x, y >
√< x, x >
√< y, y >
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
One Diagram to replace Two Diagrams
Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2
measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt
1DX1)−1 and (Xt
2DX2)−1
Q =
(Xt
1DX1)−1 0
0 (Xt2DX2)
−1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
.
Rp1∗ −−−−→X1
Rn
Ip1
x yV1 Dy xW1
Rp1 ←−−−−Xt1
Rn∗
Rp2∗ −−−−→X2
Rn
Ip2
x yV2 Dy xW2
Rp2 ←−−−−Xt2
Rn∗
Rp1+p2∗ −−−−→[X1;X2]
Rn
Qx yV D
y xW
Rp1+p2 ←−−−−−[X1;X2]t
Rn∗
This analysis gives the same eigenvectors as the analysis ofthe triple (Xt
2DX1, (Xt1DX1)
−1, (Xt2DX2)
−1), also known as thecanonical correlation analysis of X1 and X2...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Incorporate taxa abundances and phylogenetic tree
Epulopiscium
Clostridium
Adlercreutzia
Lachnospira
Alistipes
Roseburia
Coprococcus
Clostridium
Blautia
Coprococcus
Dehalobacterium
Clostridium
Clostridium
Clostridium
Coprobacillus
Coprococcus
Clostridium
Clostridium
Moryella
Abundance
1
25
625
Class
Actinobacteria (class)
Bacilli
Bacteroidia
Clostridia
Erysipelotrichi
Gammaproteobacteria
Mollicutes
Verrucomicrobiae
YS2
Duality diagram methods that can use any dependencystructure. ..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Rao's Distance
We start with a distance between samples.The heterogeneity of a population (Hi ) is the averagedistance between members of that population.The heterogeneity between two populations (Hij) is theaverage distance between a member of population i and amember of population j.The distance between two populations is
Dij = Hij −1
2(Hi + Hj)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Decomposition of Diversity
If we have populations 1, . . . , k with frequencies π1, . . . , πk,then the diversity of all the populations together is
H0 =
k∑i=1
πiHi +∑
i
∑j
πiπjDij = H(w) + D(b)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Double Principal Component AnalysisPavoine et al. 2004 developed a method known as DPCoAimplemented in ade4.Suppose we have n species in p samples/locations and amatrix ∆ giving the squares of the pairwise distancesbetween the species. (Think distance along the tree of thespecies) Then we can
▶ Use the distances between species to find an embeddingin n− 1 -dimensional space such that the euclideandistances between the species is the same as thedistances between the species defined in ∆.
▶ Place each of the p samples/locations at the barycenterof its species profile. The euclidean distances betweenthe samples/locations will be the same as the squareroot of the Rao dissimilarity between them.
▶ Use PCA to find a lower-dimensional representation ofthe locations.
Give the species and communities coordinates such that theinertia decomposes the same way the diversity does...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Double Principal Coordinate Analysis
Takes into account both the tree distances and theabundances.Pavoine, Dufour and Chessel (2004), Purdom (2010) andFukuyama et al. (2011).Takes the tree into account in computation of principalcoordinates.baselineDPCoA=ordinate(phy_pifn_glom, "DPCoA")p=plot_ordination(phy_pifn_glom, baselineDPCoA,type="taxa",color="Class")p+scale_colour_brewer(type = "qual", palette = "Set1")+geom_point(size=7,alpha=0.2)p=plot_ordination(phy_pifn_glom, baselineDPCoA, type="samples", color = "mousenames")+ geom_point(size = 1)+geom_text(aes(label=matn[-which(matn[,2]=="154"),2]),size=8)
p+scale_colour_hue(guide="none")baselineDPCoA$eig[1]/sum(baselineDPCoA$eig)0.88baselineDPCoA$eig[2]/sum(baselineDPCoA$eig)0.067sum(baselineDPCoA$eig[1:2])/sum(baselineDPCoA$eig)0.95
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
161
161
165165
174
174
175175175
33
33
33
591591 71
71 7171
73
7373
73
91
91
92
92
94
94
152
152
153
153
162
162
163
163
164
164
172
172173
173
34
34
3435
35
35
41
41
43
43
592592
593593
7474
74
74
82
82
8383
85
85
95
95
151
171
1713131
3232
32
42
4445
45
727272 7275
757575
8181
8484
93
93
−0.1
0.0
0.1
−0.6 −0.3 0.0 0.3Axis1
Axi
s2
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
−0.2
−0.1
0.0
0.1
0.2
−0.5 0.0 0.5CS1
CS
2
Class
Actinobacteria (class)
Bacilli
Bacteroidia
Clostridia
Erysipelotrichi
Gammaproteobacteria
Mollicutes
Verrucomicrobiae
YS2
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Comparing Graphs and Covariates
Example: Co-occurrence graph network.With the enterotype data we show the results of themakenetwork() function igraph package to create agraphical display of the ``connectedness'' of samples accordingto the profile of taxa that they share.
> data(enterotype)> entc <- subset_samples(enterotype, !is.na(ClinicalStatus))> ig <- newnet(entc, TRUE, 0.02, 0.4)> p <- gggraph(ig)# Make a new plotting data.frame from the ggplot-plot> DF <- sampleData(entc)[ig[[9]][[3]][["name"]], ]> DF <- data.frame( p$data, DF[p$data$value, ] )> p <- p + geom_point(aes(x, y, color=ClinicalStatus,
shape=Enterotype), data=DF, size=5, alpha=1)> print(p)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Network analysis plot produced by phyloseq'smakenetwork() wrapper on the enterotype dataset
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Bacteria `sharing' between mice
Using the Jaccard index that measures the co-incidence orco-occurrence of species between mice.
Jaccard Similarity =f11
f01 + f10 + f11
Jaccard Disimilarity =f01 + f10
f01 + f10 + f11mouse10 0 0 1 0 1 0 1 0 0 0 0 0 0 1mouse41 0 0 0 0 0 0 0 0 0 0 0 0 0 1vegdist(rbind(mouse1,mouse4),method="jaccard")0.8
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Bacteria `sharing' between mice as a network
netbaseline=make_network(phy_pifn_glom)p=plot_network(netbaseline,phy_pifn_glom,color="mousenames",label="mousenames",point_size=7)+geom_text(aes(label=mousenames),size=7)p+scale_colour_hue(guide="none")
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
161
161
165165
174
174
175
175
175
33
33591
71
71
71
73
73
73
73
91
91
92
152
152153
153
163
163
164
172
172
173
34
34
34
35
41
592592
593
74
74
74
74
82
83
83
85
171
31
32
32
32
45
72
72
72
72
7575
75
75
81
81
84
84
93
93
161
161
165 165
174174
175
175175
3333591
71
7171
73
73
73
73
91 91
92
152152 153153
163
163
164
172
172
173
34
34
34
35
41
592592
59374
74
74
74
82
8383
85
171
31
3232
32
45
72
72
72727575
75
75
81
81
8484
9393
163
164
165
171
172
173
174
175
31
32
33
34
35
41
45
591
592
593
71
72
73
74
75
81
82
83
84
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Does the network relate to `communities'?
Friedman and Rafsky (1979) devised a nonparametric test formultivariate data using the minimum spanning tree with anymetric.Then compute the number of `pure' edging connecting labelsfrom the same groups compared to the mixed edgesconnecting labels from different groups, call Fo the observedstatistic.In our example: Fo = 82Keeping the graph fixed, permute the labels and recomputethe number of pure edges.All 1000 simulated values had Fs < 82 so p < 0.001.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Co-occurrence networks for taxa of the baselinemice
p=plot_network(netbasetaxa,phy_pifn_glom,color="Family",type="taxa",label=NULL)p+geom_text(aes(label=Class),size=3)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Mollicutes
Bacteroidia
Bacteroidia
BacteroidiaBacteroidia
Bacteroidia
Bacteroidia
Bacteroidia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Clostridia
Family
Anaeroplasmataceae
Catabacteriaceae
Clostridiales Family XIII. Incertae Sedis
Lachnospiraceae
Ruminococcaceae
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Changes of the network over time?
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Differences between two graphs?
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Bootstrap for dependency neighborhoods
Given a graph, we want to create a new one by bootstrappingand being able to compare the bootstrapped graphs with ouractual ones.We want a ``Block-Bootstrap'' for graphs.Following Holmes and Reinert, 2004 we use the notion ofneighborhood of dependency.This is what inspires our sampling approach.To create the neighborhood of dependencies,we take a smallrandom walk starting at a node (say 5 steps) and all nodesreached by the walk are sampled.This is an ego-centric sampling scheme. Take the subset ofnodes that have been walked upon and combine them into anew graph.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Part IV
Combining Many Data Tables
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Comparative Systems
functionalSamples tim
e
functionalSamples tim
e
geneticSamples tim
e
geneticSamples tim
e
taxonomicSamples tim
e
Samples tim
e
Uncontaminated (eg Sample A) Contaminated (eg Sample B)
taxonomic
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Multi-table methods
Inertia, Co-InertiaWe generalize it in several directions through the idea ofinertia.As in physics, we define inertia as a weighted sum ofdistances of weighted points.This enables us to use abundance data in a contingency tableand compute its inertia which in this case will be theweighted sum of the squares of distances between observedand expected frequencies, such as is used in computing thechisquare statistic.Another generalization of variance-inertia is the usefulPhylogenetic diversity index. (computing the sum of distancesbetween a subset of taxa through the tree).We also have such generalizations that cover variability ofpoints on a graph taken from standard spatial statistics.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Co-Inertia
When studying two variables measured at the same locations,for instance PH and humidity the standard quantification ofcovariation is the covariance.
sum(x1 ∗ y1 + x2 ∗ y2 + x3 ∗ y3)
if x and y co-vary -in the same direction this will be big.A simple generalization to this when the variability is morecomplicated to measure as above is done through Co-Inertiaanalysis (CIA).Co-inertia analysis (CIA) is a multivariate method thatidentifies trends or co-relationships in multiple datasets whichcontain the same samples or the same time points. That isthe rows or columns of the matrix have to be weightedsimilarly and thus must be matchable.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
RV coefficient
The global measure of similarity of two data tables asopposed to two vectors can be done by a generalization ofcovariance provided by an inner product between tables thatgives the RV coefficient, a number between 0 and 1, like acorrelation coefficient, but for tables.
RV(A,B) = Tr(A′B)√Tr(A′A)
√Tr(B′B)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
ExampleCurrent work with Joey McMurdie, Les Dethfelsen, DavidRelman, Angela Marcobal, Justin Sonnenburg. Data:
Taxa Read counts (3 patients taking cipro: two timecourses) : .
Mass-Spec Positive and Negative ion Mass Spec featuresand their intensities: .
RNA-seq Metagenomic data on genes :.
Here is the RV table of the three array types:
> fourtable$RVTaxa Kegg MassSpec+ MassSpec-
Taxa 1 0.565 0.561 0.670Kegg 0.565 1 0.686 0.644MassSpec+ 0.561 0.686 1 0.568MassSpec- 0.670 0.644 0.568 1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Multiple table methods
In PCA we compute the variance-covariance matrix, inmultiple table methods we can take a cube of tables andcompute the RV coefficient of their characterizing operators.We then diagonalize this and find the best weighted`ensemble'.This is called the `compromise' and all the individual tablescan be projected onto it. file:///Users/susan/MultiTable/ReprodResearch/statis-demo.html
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances and Inertia
Variance is at the heart of ANOVA: decomposition ofvariability into parts that can be explained by a factor and aresidual part.
Test statistic:average ss explained by factoraverage of residual ss
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Features
1. Inertia : Trace(VQ) = Trace(WD)(inertia in the sense of Huyghens inertia formula forinstance). Huygens, C. (1657),
n∑i=1
pid2(xi, a)
Inertia with regards to a point a of a cloud of pi-weightedpoints.PCA with Q = Ip, D = 1
nIn, and the variables are centered,the inertia is the sum of the variances of all the variables.If the variables are standardized (Q is the diagonal matrix ofinverse variances), then the inertia is the number ofvariables p.For correspondence analysis the inertia is the Chi-squaredstatistic.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Goals already attained:
▶ Mathematical Generalizations using diagram andmultilinear approaches (cubes with statis).
▶ Compute distances between general trees distory anduse this in MDS.
▶ Data integration phyloseq, networks, trees, abundancetables.
▶ Threshold, sensitivity tests and modeling simulations.▶ Reproducibility: open source standards, publication of
source code and data. (R).Questions:More methods based on approximate diagrams (expectationsas in Stein's method).We know the distance between hierarchical clustering trees,what is the correct distance between barcodes?
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Benefitting from the tools and schools ofStatisticians.......
Thanks to the R community: Chessel, Jombart, Dray,Thioulouse ade4 and Emmanuel Paradis ape.
Collaborators: David Relman, Alfred Spormann, LesDethfelsen, Justin Sonnenburg, Persi Diaconis, ElisabethPurdom.
Postdoctoral Fellows Paul (Joey) McMurdie.Angela Marcobal, Serban Nacu.Students: Alden Timme, Katie Shelef, Yana Hoy, JohnChakerian, Julia Fukuyama, Kris Sankaran.Funding from NIH/ NIGMS R01, NSF-VIGRE and NSF-DMS.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
phyloseq
Joey McMurdie (joey711 on github).Available in Bioconductor.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
L. Billera, S. Holmes, and K. Vogtmann.The geometry of tree space.Adv. Appl. Maths, 771--801, 2001.
J. Chakerian and S. Holmes.distory:Distances between trees, 2010.
Daniel Chessel, Anne Dufour, and Jean Thioulouse.The ade4 package - i: One-table methods.R News, 4(1):5--10, 2004.
P. Diaconis, S. Goel, and S. Holmes.Horseshoes in multidimensional scaling and kernelmethods.Annals of Applied Statistics, 2007.
Y. Escoufier.Operators related to a data matrix.In J.R. et al. Barra, editor, Recent developments inStatistics., pages 125--131. North Holland,, 1977.
Jerome H Friedman and Lawrence C Rafsky...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Multivariate generalizations of the Wald-Wolfowitz andSmirnov two-sample tests.The Annals of Statistics, pages 697--717, 1979.
Susan Holmes.Multivariate analysis: The French way.In D. Nolan and T. P. Speed, editors, Probability andStatistics: Essays in Honor of David A. Freedman,volume 56 of IMS Lecture Notes--Monograph Series.IMS, Beachwood, OH, 2006.
Purna C Kashyap, Angela Marcobal, Luke K Ursell,Samuel A Smits, Erica D Sonnenburg, Elizabeth K Costello,Steven K Higginbottom, Steven E Domino, Susan P Holmes,David A Relman, et al.Genetically dictated change in host mucus carbohydratelandscape exerts a diet-dependent effect on the gutmicrobiota.Proceedings of the National Academy of Sciences, page201306070, 2013.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
S.W. Kembel, P.D. Cowan, M.R. Helmus, W.K. Cornwell,H. Morlon, D.D. Ackerly, S.P. Blomberg, and C.O. Webb.Picante: R tools for integrating phylogenies and ecology.Bioinformatics, 26(11):1463--1464, 2010.
K. Mardia, J. Kent, and J. Bibby.Multiariate Analysis.Academic Press, NY., 1979.
P. J. McMurdie and S. Holmes.Phyloseq: A bioconductor package for handling andanalysis of high-throughput phylogenetic sequence data.
P. J. McMurdie and S. Holmes.Phyloseq: Reproduible research platform for bacterialcensus data.PlosONE, 2013.April 22,.
Serban Nacu, Rebecca Critchley-Thorne, Peter Lee, andSusan Holmes.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Gene expression network analysis and applications toimmunology.Bioinformatics, 23(7):850--8, Apr 2007.
Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, PierreLegendre, R. G. O'Hara, Gavin L. Simpson, Peter Solymos,M. Henry H. Stevens, and Helene Wagner.vegan:Community Ecology Package, 2010.
E. Paradis, J. Claude, and K. Strimmer.Ape: analyses of phylogenetics and evolution in Rlanguage.Bioinformatics, 20:289--290, 2004.
Sandrine Pavoine, Anne-Béatrice Dufour, and DanielChessel.From dissimilarities among species to dissimilarities amongcommunities: a double principal coordinate analysis.Journal of Theoretical Biology, 228(4):523--537, 2004.
Katherine S. Pollard, Houston N. Gilbert, Yongchao Ge,Sandra Taylor, and Sandrine Dudoit...... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
multtest: Resampling-based multiple hypothesis testing,2010.R package version 2.4.0.
C. R. Rao.The use and interpretation of principal componentanalysis in applied research.Sankhya A, 26:329--359., 1964.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances between Trees
▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI).
Rotation Moves
4
0
2 31
0
4321
0
41 32
▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).
The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances between Trees
▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).
The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances between Trees
▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).
The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances between Trees
▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).
The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Distances between Trees
▶ Robinson and Foulds, (bipartitions).▶ Nearest Neighbor Interchange (NNI). Rotation Moves
4
0
2 31
0
4321
0
41 32
▶ Subtree Prune Rebranch. (SPR)▶ Fill-in of NNI moves: Billera, Holmes, Vogtmann (BHV).
The boundaries between regions represent an area ofuncertainty about the exact branching order. Inbiological terminology this is called an `unresolved' tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Boundary for trees with 3 leaves
1 2 3
1 2 3
3 2 1
3 1 2
0
0
0
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
The quadrant for one tree
(0,0) (1,0)
(0,1) (1,1)
0
1 2 3 4
11
{{
1 2 3 4
0
21 3 4
0
1 2 3 4
{1
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
The quadrant for one tree
(0,0) (1,0)
(0,1) (1,1)
0
1 2 3 4
11
{{
1 2 3 4
0
21 3 4
0
1 2 3 4
{1
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
The quadrant for one tree
(0,0) (1,0)
(0,1) (1,1)
0
1 2 3 4
11
{{
1 2 3 4
0
21 3 4
0
1 2 3 4
{1
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
The quadrant for one tree
(0,0) (1,0)
(0,1) (1,1)
0
1 2 3 4
11
{{
1 2 3 4
0
21 3 4
0
1 2 3 4
{1
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Link of the originAll 15 quadrants for n = 4 share the same origin. If we takethe diagonal line segment x + y = 1 in each quadrant, weobtain a graph with an edge for each quadrant and atrivalent vertex for each boundary ray; this graph is calledthe link of the origin.
1 42 3
0
1 42 3
0
(1,0)
(0,1)
x+y=1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Cube complex of Euclidean Orthants
A path between two treesconsists of line segments through a sequence of orthants.This sequence of orthants is the path.A path is a geodesic when it has the smallest length of allpaths between two points.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Cube complex of Euclidean Orthants
A path between two treesconsists of line segments through a sequence of orthants.This sequence of orthants is the path.A path is a geodesic when it has the smallest length of allpaths between two points.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
A Cone Path
A path between two trees T and T′ always exists. Since allorthants connect at the origin, any two trees T and T′ canbe connected by a two-segment path, this is called thecone-path.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Three orthants sharing a common boundary for n = 4 leaves.
1 42 3
0
1 42 3
0
1 42 3
0
1 43 2
0
2 43 1
0
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
c
a
c
b ba
Tree space with BHV metric is a CAT(0) space, that is, it hasnon-positive curvature. This implies there are geodesicbetween any two trees (Gromov).It is also not an Euclidean space.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Euclidean space (where through every point not on a line) isflat:(sum of angles of a triangle is 180 o),
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Hyperbolic space is ` negatively' curved:
Euclid's parallel postulate is replaced.In hyperbolic geometry there are at least two distinct linesthrough P which do not intersect l, so the parallel postulateis false.A characteristic property of hyperbolic geometry is that theangles of a triangle add to less than 180 o.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
c
a
c
b ba
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Geodesic metric space:If we have a distance defined between any two points of aspace, we call it a metric space.(The distance doesn't have to be defined through ordinarycoordinates)A geodesic metric space is a metric space where geodesicsare defined to be the shortest path between points in thespace.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
δ-hyperbolic space is a geodesic metric space in which everygeodesic triangle is δ-thin.δ-thin: pick three points and draw geodesic lines betweenthem to make a geodesic triangle. Then any point on any ofthe edges of the triangle is within a distance of δ from oneof the other two sides.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
For example, trees are 0-hyperbolic: a geodesic triangle in atree is just a subtree, so any point on a geodesic triangle isactually on two edges.
Normal Euclidean space is ∞-hyperbolic; i.e. not hyperbolic.Generally, the higher δ has to be, the less curved the spaceis.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Is it better to represent the distances by a tree or aEuclidean projection?
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Tree of Trees
A tree is a complete CAT(0) space.
c
a
c
b baa
c
b
Since BHV,2001 [1] have shown that the space of trees isnegatively curved (a CAT(0) space), the most naturalrepresentation of a collection of trees may be a tree.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Some Advantages of R
Cutting edge solutions Availability of > 4000 packages, allstatistical/machine learning methods available.Robust methods, sparse versions of all of theprojection methods.
Data Input and cleaning Multiple formats, (Newick, RNA-seq,Qiime, HTS, Micro-array, Mass-Spec, Genome,...).
Visualization and Presentation High quality graphics:ggplot2.
Reproducibilty Sweave, history, ...Open Source Community of users, documentation.Simulation Random number generators.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Some Difficulties in R
Main Drawback By default works with data in RAM.Another Problem Heterogeneity: Data, Packages,
Documentation.Open Source Community of users, documentation.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Best Platform for Designing a Statistical Analysis
▶ Comparing sophisticated methods.Almost 5,000 packages available.
▶ Compare 20 methods on a new type of data in a week.▶ High quality visualization.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Seamless Interaction with standardized Databases
▶ Through Bioconductor package AnnotationDbi wehave access to GenBank, the Gene OntologyConsortium, Entrez genes, UniGene, the UCSCHuman Genome Project KEGG
▶ metlin ChemmineR , rcdklibs, rpubchem
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Packages Tailored to specific formats
.
▶ Maps.▶ Trees, Networks and Graphs▶ Microarrays, Mass Spectroscopy,▶ Texts.▶ Images.▶ DNA, AA, Sequencing Data, HTS, RNA-seq.▶ Qiime / mothur formats.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Plotting
.
High powered graphics packages with layered functionalities:▶ ade4 .▶ Philosophy: Leland Wilkinson's Grammar of Graphics.▶ Implementation :ggplot2
(Hadley Wickham youtube video)▶ Animated gifs, interactive graphics.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
One Diagram to replace Two Diagrams
Canonical correlation analysis was introduced by Hotelling tofind the common structure in two sets of variables X1 and X2
measured on the same observations. This is equivalent tomerging the two matrices columnwise to form a large matrixwith n rows and p1 + p2 columns and taking as the weightingof the variables the matrix defined by the two diagonalblocks (Xt
1DX1)−1 and (Xt
2DX2)−1
Q =
(Xt
1DX1)−1 0
0 (Xt2DX2)
−1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
.
Rp1∗ −−−−→X1
Rn
Ip1
x yV1 Dy xW1
Rp1 ←−−−−Xt1
Rn∗
Rp2∗ −−−−→X2
Rn
Ip2
x yV2 Dy xW2
Rp2 ←−−−−Xt2
Rn∗
Rp1+p2∗ −−−−→[X1;X2]
Rn
Qx yV D
y xW
Rp1+p2 ←−−−−−[X1;X2]t
Rn∗
This analysis gives the same eigenvectors as the analysis ofthe triple (Xt
2DX1, (Xt1DX1)
−1, (Xt2DX2)
−1), also known as the..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
canonical correlation analysis of X1 and X2.
Part VI
Confirmatory Analysis:Nonparametric tests
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Nonparametric Tests
▶ Canonical Correspondence Analysis tests on a factor.▶ Mantel's Test between distance matrices▶ Multiple testing correction.▶ Bootstrap tests.
Everything is done by shuffling labels
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Example:Is there a shedding effect in Relman/Hoy Mice data?> shed = scan("shed.txt")> shedf = as.factor(shed)[-(70:71)]> resca.shed = cca(t(pib.nz) ~ shedf)> anova(resca.shed)Permutation test for cca under reduced model
Model: cca(formula = t(pib.nz) ~ shedf)Df Chisq F N.Perm Pr(>F)
Model 6 0.3825 2.0627 199 0.005 **Residual 87 2.6886---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Example: Is there a subject effect in Katie's data?subjcc = vegan::cca(tnorepnz ~ subject)anova(subjcc)
Permutation test for cca under reduced model
Model: cca(formula = tnorepnz ~ subject)Df Chisq F N.Perm Pr(>F)
Model 7 1.2177 10.7999 199 0.005 **Residual 472 7.6027---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
> cca.cage = vegan::cca(t(tcmall) ~ cagef)> plot(cca.cage)> text(cca.cage, choices = c(1, 2), label = cagef, display = "lc")> anova(cca.cage)Permutation test for cca under reduced model
Model: cca(formula = t(tcmall) ~ cagef)Df Chisq F N.Perm Pr(>F)
Model 2 0.3722 7.6501 199 0.005 **Residual 178 4.3305---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
−4 −2 0 2 4
−2
02
4
CCA1
CC
A2
++
++ +
++
++
++++ +
++
+
+++ +
++
+
+
+ +
+
++
+
+
++
+ +
++
+
++
++
++
+
+
+
+
+
+
+
++
+
+
+
++
++++
+
+
+
+++
+
+
+++
+
++
++
++++
+
+
+
++
++
+
+
++
++
++
+
++
+++
+
+
++
+++
+
+
+
+
+
+
+++
++ ++
+
+ ++
+
++
+
++ ++
++
+
++ +
++
++
+
++
+ +++
++
+ +
++
+
+
++
+ ++
+
+
+
+++ +
++
++
+
+
++
++ +
+
+
+
+
+
++
++
+
+
++
+
+
++
+++
+++
++
++
++ +
++
+ + ++
+
+
+
+
+
+
+++ +
+
++
++++ +
+
+
++
+
+
+
+++
++
+
+
++
++
+
+
+
+++
+++
++++ ++
+
+
++ +
+
++
+++
++
+
++
++
+ +
++
++
+
+ ++
++
+
+
++
+
+
●
● ●
●
●
●
●
●
● ●
●
●●●
●
●
●
●
●●
●●
●●
●
●●●
●
● ●
●
●●●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●●
●
●
●
●
●●
●
●
●
● ●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
● ●
●
●
●
●
●●
●
●●
●●
●●
●●●
●●
●
●
●
●● ●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
x
x
x
1515151515151515151515
777777777777777777777777
4444444444444444444444444444
1515151515151515151515151515
4444444444444444444444444444
7777777777777777777777777777777777777777777777777777777777777777777777777777
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
> plot(cca.cage, scaling = 1)> text(cca.cage, scaling = 1, choices = c(1, 2), display = "sp",+ cex = 0.6)> title("Species, Cage effect, 1,2")
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
−4 −2 0 2 4
−2
02
4
CCA1
CC
A2
+
+
++ +
+
+
+
+
+
+
++
+
+
+
+
++
++
+
+
+
+
++
+
++
+
+
+
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
++
+
+
+
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
++
+
+
+
+
+
+++
+
+
+
+
+
++
++
+
+
+
+
+
+ ++
+
+
+
+ +
+
+
+
+
+
+
++
+
+
+
+
++
+ +
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
++ +
+
+
+ + +
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++
+
+
+
++
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
++
++ +
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+ +
+
+
+
+
+
+
+
+
+
●● ●
●
●●
●●
● ●●
●● ●●●●
●●●
●● ●●●
●●●●● ●
●●●●
●●
●
●
●●●
●●●
●
●
●
●●
●●
●●
●●
●
●
●
●●●●
●
●●
●●●●
●
●
●●
●●
●
● ●● ●
●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●●●
●● ●●
●
●
●●●
●●
●
●●
●●
●● ●● ●
●
●
●● ●
●●
●
●●●
●
●●
●●●●
●●●
●●
●●●
●●●●●
●●
●
●●●
●
●
●
●●●●
●
●
●
●●
●
●
●●●●●
●●●
x
xx
f__
Lachno
LachnoLachno Lachno
Lachno
Ruminoc
Lachno
Lachno
Lachno
Lachno
LachnoRuminoc
f__
Lachno
Lachno
f__
LachnoLachno
LachnoRuminoc
f__
Lachno
f__
Lachno
Lachnof__
Lachno
LachnoLachno
Lachno
Lachno
Ruminoc
Lachno
RuminocLachno
f__
Ruminoc
Lachno
f__
Lachno
f__
Coriobact
Lachnof__
f__
f__
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
f__
Lachno
Lachno
Ruminoc
Lachno
Lachno
LachnoLachnoLachno
Lachno
Lachno
Ruminoc
Lachno
Ruminoc
LachnoLachno
Lachno
Lachno
Ruminoc
LachnoLachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
f__
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Ruminoc
f__
Lachno
Lachno
Lachno
Lachno
f__
Ruminoc
Lachno
Ruminoc
f__
Catabact
f__
AnaeroplCatabact
f__
Lachno
Ruminoc
Lachno
Catabact
f__Ruminoc
Lachno
Lachno
Ruminoc
Dehalobact
f__
Ruminoc
Lachno
Lachno
Lachno
Lachno
Lachnof__
Ruminoc
Ruminoc
Lactobaci
RuminocRuminoc
ClostridI
Lachno
Ruminoc
Lachno
f__
LachnoRuminocRuminoc
Lachno
Lachno
f__
Lachno
Lachno
LachnoRuminoc
LachnoLachno
Lachno
Lachno
Ruminoc
Ruminoc
Ruminoc
Lachno RuminocLachno
Ruminoc
Lachno
Lachno
RuminocLachno
Lachno
Lactobaci
Ruminoc
f__
f__
Erysip
LachnoLachno
Lachno
Lachno
Lachno
Ruminoc
LachnoLachno
RuminocLachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
RuminocRuminoc
f__
Lachno
Lachno
Ruminoc
Lachno
Lachno
RuminocLachno
Lachno
Lachno
Lachno
Lachno
Lachno
f__
Lachno
Ruminoc
Lachno
Lachno
Lachno
Lachno
Lachno
LachnoLachno
Lachno
Lachno
Lachno
Lachno
Lachno
LachnoEntero Lachno
Lachno
Lachno
Ruminoc Lachno Lachno
Ruminoc
Lachno
Lachno
Lachno
Lachno
Ruminoc
Lachno
Lachno
Lachnof__
Ruminoc
Lachno
Lachno
Lachno
Lachno
Lachno
LachnoLachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
Lachno
LachnoLachno
Lachno
ClostridI
Leuconost
Lachno
Lachno
Ruminoc
f__
Lachno
Lachno
Lachno
Lachno
Lachno
LachnoLachno
Lachno
LachnoLachno
Lachno
RuminocLachno
Lachno
f__f__
Lachno
Lachno
Lachno
Lachno
Lachno
Ruminoc
Erysip
Lachno
Lachno
Catabact
Lachno
Lachno
Lachno
Ruminoc
Lachno
Lachno
Catabact
ClostridIII
Lachno
Lachno
f__Lachno
Lachno
Ruminoc
f__
Lachno f__
Lachno
Lachno
Erysip
Lachno
Verruc
Lachno
Lachno
Lachno
Lachno
Species, Cage effect, 1,2
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Over-representation of certain phyla
Set Over-represented UniverseMicrobiome Families/Phyla Species Present
Gene Expression Ontological groups Filtered GenesTest in both cases: hypergeometric / Fisher's exact test.We define the set of prefiltered species (species universe) asthose that passed the threshold test of being present (>6000) in at least 31 of the arrays.This method is especially relevant here as the tree does notshow equal representation of different families and phyla.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Hypergeometric Tests for over representation ofcertain phyla)
▶ IBS higher group had significantly more Bacteriodetes▶ overrepresentation of Firmicutes in the healthy controls.▶ At the family level, the results showed that the families
of Oxalobacteraceae, Prevotellaceae, Burkholderiaceae,Sphingobacteriaceae were significantly overrepresentedin IBS.
▶ Conversely, the most significantly enriched family incontrol rats were Lachnospiraceae, includingRuminococcus sp., followed by Erysipelotrichaeceae andClostridiaceae.
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Part VII
RobustnessHigh Breakdown (median,)
Sparse (L1 minimizing)
Nonparameteric (ranks, nmMDS, ..)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Tumor Cells
0 5000 10000 15000 20000
05000
10000
15000
20000
05e-0
40.0
010 1 1 0 -1 1 0 0 0 -1
0 1 1 0 0 0 0 0 0 1
0 1 -1 0 -1 0 0 0 0 -1
0 1 1 0 0 -1 1 0 1 1
0 1 1 0 0 0 0 1 0 1( (
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Solutions: Ways of Using Trees
Main tool: ape.Example:
> treeall=read.tree("/Users/susan/Dropbox/Yana/gg.tre")> todrop=which(!(treeall$tip.label %in% otu.id))>> todrop=which(!(treeall$tip.label %in% otu.id))> dropthese=treeall$tip.label[todrop]> yana1772tree=drop.tip(treeall,dropthese)>layout(matrix(c(1,1,1,2,2,2,3,4,5,5,5,5),ncol=12,nrow=1))>subtree = drop.tip(tree, which(!tree$tip.label %in% selected))>plot(tree, use.edge.length=F, no.margin=T, edge.color=colo, show.tip.label=F)###Relevant subtree>plot(subtree, edge.color='red', show.tip.label=T, no.margin=T, use.edge.length=F)
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .
Mapping Variables onto Phylogenies
Example using phyloseq package> data(esophagus)> es1 <- esophagus> sn <- sample.names(esophagus)> sampleData(es1) <- sampleData(data.frame(sample=sn, row.names=sn))
We create the tree graphic, grouping/coloring by our dummysample-name variable, and also labelling the number ofindividuals observed in each sample (if at all). The symbolsare slightly enlarged as the number of individuals increases.> plot_tree_phyloseq(es1, color_factor="sample",
type_abundance_value=TRUE, treeTitle="esophagus example data")
..... ..... ..... . ..................... ..................... ..................... ..... ..... . ..... ..... ..... .