0.1cm Statistical challenges and new data analysis methods ...
Transcript of 0.1cm Statistical challenges and new data analysis methods ...
;
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Statistical challenges and new dataanalysis methods for microbiome data
integration
Kim-Anh Lê Cao
NHMRC Career Development FellowCentre for Systems Genomics
School of Mathematics and Statistics
The University of MelbourneKim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Microbiome is a complex organism
Uncover the role of microbiome, aswell as its composition
Technological advances:culture-independent & NGS
Not only about cataloging organisms,but biological functions that affect hostand participate in disease processes
Require advanced bioinformatics andcomputational statistics
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
We need a holistic view
Morgan et al. 2013, Trends in Genetics, 26Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
We need cutting-edge statistical methods to study themicrobiome as a whole
Many organisms are co-dependent on each other within anecosystem limited insight when considered individually
Microbiome has a major impact on human health and diseasetreatment outcomes Understand microbiome - host interactions
Multivariate statistical methods for microbiome data analysis:
Consider all bacterial taxa togetherReduce data dimension and enable data visualisation (e.g PCA,PCoA)Enable the integration of multiple sources of biological data
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
We need cutting-edge statistical methods to study themicrobiome as a whole
Many organisms are co-dependent on each other within anecosystem limited insight when considered individually
Microbiome has a major impact on human health and diseasetreatment outcomes Understand microbiome - host interactions
Multivariate statistical methods for microbiome data analysis:
Consider all bacterial taxa togetherReduce data dimension and enable data visualisation (e.g PCA,PCoA)Enable the integration of multiple sources of biological data
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
We need open-source and user-friendly tools
The R toolkit (since 2009)
French’Oz team:4 core, 1 developer, students, collaborators
Today: 17 novel multivariate methods
21K downloads in 2016
14 multi-day workshops since 2014 (FR, AUS, NZ)
Our research program focuses on the development of multivariatestatistical methodologies, their applications in areas informed bybiology, and the training of the new generation of computationalbiologists. www.mixOmics.org
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Sparse data
Microbial count data
Example of data from 16S rRNA sequencing after taxonomicclassification:
→ we want to identify bacterial taxa that explain differencesbetween sample groups (beta diversity analyses)
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Sparse data
Data are sparse with skewed distribution
Many microbial organisms are absent from a large number ofsamples
’zero-inflated’ distribution
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Sparse data
Data are sparse with skewed distribution
Many microbial organisms are absent from a large number ofsamples
’zero-inflated’ distribution
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Variability
Data with high variability
need to be accounted for in the statistical analysis
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Variability
Variability also due to varying sequencing depth per sample
⇒
Pragmatic solution: transform count data into relative proportionsOther solution: rarefaction, although quite controversial (McMurdie &
Holmes (2014) PLoS ONE ; Weiss et al. (2017) Microbiome)
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Variability
Variability also due to varying sequencing depth per sample
⇒
Pragmatic solution: transform count data into relative proportionsOther solution: rarefaction, although quite controversial (McMurdie &
Holmes (2014) PLoS ONE ; Weiss et al. (2017) Microbiome)
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Compositional data
Relative proportion data are compositional
Relative proportion data ‘live’ ina constraint space (a ‘simplex’)
image adapted from http://qedinsight.com
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Compositional data
Compositional data: consequencesBacterial counts are relative to each other spurious correlation (Pearson in 1897!)
Compositional data are bounded data distribution is not normal most statistical methods assumenon-bounded data!
‘History has proved him [Pearson] correct: over the succeeding years andindeed right up to the present day, there has been no other form of dataanalysis where more confusion has reigned and where more improper andinadequate statistical methods have been applied’ (Aitchison, 2011).
potentially spurious statistical results!Aitchison J (2011), The Statistical Analysis of Compositional Data, SpringerPawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and Analysis of CompositionalData, Wiley
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Compositional data
Compositional data: consequencesBacterial counts are relative to each other spurious correlation (Pearson in 1897!)
Compositional data are bounded data distribution is not normal most statistical methods assumenon-bounded data!
‘History has proved him [Pearson] correct: over the succeeding years andindeed right up to the present day, there has been no other form of dataanalysis where more confusion has reigned and where more improper andinadequate statistical methods have been applied’ (Aitchison, 2011).
potentially spurious statistical results!Aitchison J (2011), The Statistical Analysis of Compositional Data, SpringerPawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and Analysis of CompositionalData, Wiley
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Batch effects
Batch effects: systematic sources of variation that may outweightbiological variation
e.g. Sequencing runs (left), batches of expt (right), cage, country,differences in protocols or laboratories
Sinha R [...], Knight R, Huttenhower C (2015), The microbiome quality control project: baseline studydesign and future directions, Genome Biology 16:275The microbiome Quality Control project http://www.mbqc.org/
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Sparse data
Data are sparse and skewed
• Solution 1: apply or develop non parametric methods:→ Univariate methods limited by small sample sizes→ Multivariate methods (focus of this talk)
• Solution 2: develop statistical models to handle zero-inflateddistributions (univariate)→ Generalised Linear Model not appropriate→ Zero Inflated Poisson, Zero Inflated Negative Binomial models
and variants∗
Review: Xu et al. (2015), Assessment and Selection of Competing Models for Zero-Inflated MicrobiomeData, PLOS ONE 10(7)
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Compositional data
Data are compositional
• Pragmatic solution 1: log ratio transformation→ Project data into a domain of real numbers (this talk)
from DOI:10.1029/2005GC001092
• Solution 2: novel univariate statistical models to respect theintrinsic data characteristics→ Constraint regression models∗
∗Review: Hongzhe Li (2015), Microbiome, Metagenomics, and High-Dimensional Compositional DataAnalysis, Annual Review of Statistics and Its Application and his latest arXiv tech report
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Batch effects
Data contain batch effects (sometimes)
• Solution 1: correct for batch effect prior to statistical analysis→ Batch Mean Centering, Combat∗ (originally developed for
microarray data), may also remove biological variation!
Solution 2: account for batch effect directly into the statisticalmodel→ univariate methods: include batch information in the
regression model→ multivariate methods: preliminary results in this talk
∗Johnson et al (2007), Adjusting batch effects in microarray expression data using empirical Bayesmethods, Biostatistics 8(1)
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Our multivariate methods for microbiome data analysis
Methods implemented in :Identification of microbial signatures to characterise aphenotypeIntegration with other high-throughput ‘omics data for abetter understanding of host - microbiome interactionsLongitudinal analyses to understand disease progression
Lê Cao K-A.†,Costello M-E†, Lakis VA, et al (2016) mixMC: a multivariate statisticalframework to gain insight into Microbial Communities. PLoS ONE 11(8).
Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for ’omicsfeature selection and multiple data integration, bioRxiv 108597
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method for batch effects
A multivariate method to account for batch effectsEva Wang, Helen Benham, Ranjeny Thomas, UQDI
Classical discriminant analysis todiscriminate 28 healthy controls vs86 rheumatoid arthritis patients
−4
0
4
8
−8 −4 0 4
Comp 1
Com
p 2
Patients
HC
RA
Batches
1
2
New method accounts for batcheffects and identifies a microbialsignature
−4
0
4
8
−8 −4 0 4
Comp 1
Com
p 2
Patients
HC
RA
Batches
1
2
RA study: Oral microbiomeunpublished preliminary results
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to identify a multivariate microbial signature
A predictive statistical model based on microbial signatureVanessa Lakis, Helen Benham, Ranjeny Thomas, UQDI
−6 −4 −2 0 2 4 6
−10
−50
5
sPLS−DA Component 1
sPLS−D
A C
ompo
nent
2
Method identifies a microbialsignature:• 94% accuracy to predict healthycontrols vs rheumatoid arthritis
• First Degree Relatives predictedas likely-HC (29) or likely-RA (61)
RA study: Faecal microbiomeunpublished preliminary results
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to identify a multivariate microbial signature
A predictive statistical model based on microbial signatureVanessa Lakis, Helen Benham, Ranjeny Thomas, UQDI
Microbial signature discriminates HCvs RA
HC and RA based on microbial signature
SA
6164S
A6167rep3
S9215
SA
6166S
A6163
S9220
SA
6169S
9216S
9208S
9217S
9218S
9210S
A6161
S9209
S9207
S9214
S9155
S9158
S9147
SA
6165S
A6120
S9112
SA
6109S
9219S
9108S
9206S
9148S
9205S
9153S
9125S
9203S
9117S
A6110
SA
6105S
9221S
9120S
A6111
S9212
S9113
S9213
SA
6101S
A6112
S9204
SA
6168rep3S
9163S
A6162
S9110
SA
6124S
9157S
9162S
9160S
9154S
9128S
A6099
SA
6100S
9150S
A6103
S9146
SA
6106S
9129S
A6097
SA
6108S
9126S
A6123
S9143
S9141
S9161
SA
6095S
A6115
S9135
SA
6113S
A6107
SA
6117S
9109S
A6170
S9156
SA
6125S
A6102
SA
6121S
9121S
9149S
9119S
9164S
A6094
S9142
S9116
S9123
S9152
S9131
SA
6122S
9122S
9130S
A6096
S9136
S9115
SA
6098S
9114S
9111S
A6104
S9145
S9137
S9144
SA
6114S
9165S
9118S
A6118
SA
6119S
9134S
9127S
9159S
9132S
9124S
9140S
A6116
RuminococcaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridialesRuminococcaceaeBifidobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeClostridialesRuminococcaceaeEnterococcaceaeRF32RF32BacteroidaceaeLachnospiraceaeClostridiaceaeVeillonellaceaeOxalobacteraceaeLachnospiraceaeAlcaligenaceaeRuminococcaceae[Odoribacteraceae]RuminococcaceaePorphyromonadaceaePseudomonadaceaePseudomonadaceaeEnterobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridiaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeBacteroidaceaeLachnospiraceaeVeillonellaceae
sPLSDA_ContinuumPatient_Class Patient_Class
HCRA
sPLSDA_Continuum0.005
−0.005
−2
−1
0
1
2
3
4
5
Prediction of FDR as a risk scorefrom likely-HC to likely-RA
FDR prediction based on microbial signature
S9191
S9178
SA
6143
S9170
SA
6145
S9189
S9198
S9190
S9196
S9186
S9172
SA
6146
S9181
S9194
S9167
S9176
S9173
S9184
S9169
S9179
SA
6142
S9182
S9166
SA
6138
SA
6139
S9188
S9199
S9201
SA
6140
S9174
S9177
SA
6144
S9193
S9200
S9202
S9168
S9180
SA
6147
S9197
S9187
SA
6141
S9192
S9171
S9175
S9183
RuminococcaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridialesRuminococcaceaeBifidobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeClostridialesRuminococcaceaeEnterococcaceaeRF32RF32BacteroidaceaeLachnospiraceaeClostridiaceaeVeillonellaceaeOxalobacteraceaeLachnospiraceaeAlcaligenaceaeRuminococcaceae[Odoribacteraceae]RuminococcaceaePorphyromonadaceaePseudomonadaceaePseudomonadaceaeEnterobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridiaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeBacteroidaceaeLachnospiraceaeVeillonellaceae
Predicted_ContinuumPredicted_Class Predicted_Class
HCRA
Predicted_Continuum0.12
−0.02
−2
−1
0
1
2
3
4
RA study: Faecal microbiomeunpublished preliminary results
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to integrate ‘omics data
A multivariate method to integrate ‘omics dataAmrit Singh (UBC), Florian Rohart (UQDI)
14 asthmatic individuals undergoing allergen inhalation challenge
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−50
−40
−30
−20
−10
0
10
0 25 50 75 100 125Time (minutes)
Perc
ent d
rop
in F
EV1
PRE inhalationchallenge POST
mRNA(15,683)
Genepathways(229)
Metabolitepathways
(60)
Cells(9)
PRE
POST
Metabolites(292)
Leukocyte gene expression and plasma metabolite abundance reduced to pathwaysusing eigengene summarisation (Langfelder et al. 2008).
DIABLO seeks for subsets of ‘omics variables maximally correlated.
Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, Lê Cao K-A. DIABLO - an integrative,multi-omics, multivariate method for multi-group classification. bioRxiv 067611
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to integrate ‘omics data
A multivariate method to integrate ‘omics dataAmrit Singh (UBC), Florian Rohart (UQDI)
Our multi-‘omics signature suggests mechanistic link with response toallergen challenge across different biological layers
−1 −0.5 0 0.5 1
Color key
Valine, leucine and isoleucine biosynthesis
Thyroid cancer
Glyoxylate and dicarboxylate metabolism
Valine, leucine and isoleucine metabolism
Tryptophan metabolism
Lysolipid
Butanoate metabolism
Purine metabolism, (hypo)xanthine/inosine containing
Relative.Eosinophils
Dipeptide
Relative.Basophils
Alanine and aspartate metabolism
gamma−glutamyl
Drug metabolism − other enzymes
Glycerolipid metabolism
Polypeptide
Pantothenate and CoA metabolism
Staphylococcus aureus infection
Autoimmune thyroid disease
Allograft rejection
Graft−versus−host disease
Asthma
●
●
●
CellsgeneModsmetMods
Correlation btw cells, genes/metab pathways
• Cell types: eosinophils and basophils(hallmarks of allergic asthma)
• Gene pathway: Valine, leucine andisoleucine biosynthesis (↑ postchallenge)
• Metabolite pathway: Valine, leucineand isoleucine metabolism (↑ postchallenge)
Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, Lê Cao K-A. DIABLO - an integrative,multi-omics, multivariate method for multi-group classification. bioRxiv 067611
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to integrate longitudinal data
Integrate longitudinal microbial profiles with islet autoantibody markersDIABIMMUNE T1D study: 33 genetically predisposed infants, faecalsamples from 50 to 1,500 days to elucidate the role of microbial and hostinflammatory activity in T1D (Kostic et al. 2015, Cell host & microbe)
−2
−1
0
1
2
300 600 900Time
Inte
nsity
−1.0
−0.5
0.0
0.5
1.0
Feature 1
Associations with feature 1 with shift Av cor pos= 0.93 (# 15 ),neg= −0.92 (# 44 )
E006574: seroconvert at 532Days, T1D at 1339Days
−2
−1
0
1
2
0 250 500 750 1000Time
Inte
nsity
−1.0
−0.5
0.0
0.5
1.0
Feature 1
Associations with feature 1 with shift Av cor pos= 0.92 (# 15 ),neg= −0.92 (# 12 )
E022137: seroconvert at 562Days
Methods from:Straube S, Huang BE, Lê Cao K-A (2017). DynOmics to identify delays and co-expression patternsacross time course experiments. Sci Rep, 7:40131;Straube J, Gorse AD, Huang BE and Lê Cao K-A (2015). A linear mixed model spline framework foranalysing time course ‘omics data. PLoS ONE 10(8).
unpublished preliminary resultsKim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
A method to integrate longitudinal data
Integrate longitudinal microbial profiles with islet autoantibody markersDIABIMMUNE T1D study: 33 genetically predisposed infants, faecalsamples from 50 to 1,500 days to elucidate the role of microbial and hostinflammatory activity in T1D (Kostic et al. 2015, Cell host & microbe)
−2
−1
0
1
2
300 600 900Time
Inte
nsity
−1.0
−0.5
0.0
0.5
1.0
Feature 1
Associations with feature 1 with shift Av cor pos= 0.93 (# 15 ),neg= −0.92 (# 44 )
E006574: seroconvert at 532Days, T1D at 1339Days
Microbial signature:• Blautia, Coprococcus andStreptococcus positively correlatedwith ↗ insulin autoantibody IAA• Bacteroides, members ofLachnospiraceae and Veillonellaceae,Faecalibacterium (prausnitzii)negatively correlated with IAA.
Methods from:Straube S, Huang BE, Lê Cao K-A (2017). DynOmics to identify delays and co-expression patternsacross time course experiments. Sci Rep, 7:40131;Straube J, Gorse AD, Huang BE and Lê Cao K-A (2015). A linear mixed model spline framework foranalysing time course ‘omics data. PLoS ONE 10(8).
unpublished preliminary resultsKim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
Statistics in the microbiome era
After some initial stalling, computational field is moving at a(too?) fast pace, yet no consensus (normalisation, analysis)
Rigorous statistics needed to address the data challenges andensure reliable results
Flexibility of multivariate methods and novel developments:
data exploration; classification; integration of multiple datasets; biomarker identificationprovide a deeper understanding of the microbiome as abiological system
www.mixOmics.org @mixOmics_team
[email protected] [email protected] Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium
...Context Data challenges Proposed solutions Multivariate statistical methods Conclusions
AcknowledgementsmixOmics team
France: Sébastien Déjean, FrançoisBartoloAustralia: K-A Lê Cao, Florian Rohart,Benoît Gautier
Many mixOmics users
Lê Cao past lab at UQStaff: Florian Rohart, NicholasMatigian, Anne Bernard, BenoîtGautier
PhD students: Ralph Patrick, ChaoLiu, Amrit Singh, Jasmin Straube,Aimee Hanson
Masters/Hons: Eva Wang, VanessaLakis, Zoe Welham, Nick d’Arcy, ThomCuddihy
My dear microbiome collaboratorsThomas lab: Helen Benham, LindaRehaume, Ranjeny Thomas (UQ)
Brown lab: Mary-Ellen Costello, MattBrown (QUT)
Hugenholtz lab (UQ)
two senior postdocs incomputational biostatistics,University of Melbourne
Kim-Anh Lê Cao May 19-20 2017
207 Falk Foundation Gut Microbiome Symposium