0.1cm Statistical challenges and new data analysis methods ...

;

Context Data challenges Proposed solutions Multivariate statistical methods Conclusions

Statistical challenges and new dataanalysis methods for microbiome data

integration

Kim-Anh Lê Cao

NHMRC Career Development FellowCentre for Systems Genomics

School of Mathematics and Statistics

The University of MelbourneKim-Anh Lê Cao May 19-20 2017

207 Falk Foundation Gut Microbiome Symposium


Microbiome is a complex organism

Uncover the role of microbiome, aswell as its composition

Technological advances:culture-independent & NGS

Not only about cataloging organisms,but biological functions that affect hostand participate in disease processes

Require advanced bioinformatics andcomputational statistics

Kim-Anh Lê Cao May 19-20 2017



We need a holistic view

Morgan et al. 2013, Trends in Genetics, 26Kim-Anh Lê Cao May 19-20 2017



We need cutting-edge statistical methods to study themicrobiome as a whole

Many organisms are co-dependent on each other within anecosystem limited insight when considered individually

Microbiome has a major impact on human health and diseasetreatment outcomes Understand microbiome - host interactions

Multivariate statistical methods for microbiome data analysis:

Consider all bacterial taxa togetherReduce data dimension and enable data visualisation (e.g PCA,PCoA)Enable the integration of multiple sources of biological data




We need open-source and user-friendly tools

The R toolkit (since 2009)

French’Oz team:4 core, 1 developer, students, collaborators

Today: 17 novel multivariate methods

21K downloads in 2016

14 multi-day workshops since 2014 (FR, AUS, NZ)

Our research program focuses on the development of multivariatestatistical methodologies, their applications in areas informed bybiology, and the training of the new generation of computationalbiologists. www.mixOmics.org



www.mixOmics.org


Sparse data

Microbial count data

Example of data from 16S rRNA sequencing after taxonomicclassification:

→ we want to identify bacterial taxa that explain differencesbetween sample groups (beta diversity analyses)




Sparse data

Data are sparse with skewed distribution

Many microbial organisms are absent from a large number ofsamples

’zero-inflated’ distribution




Variability

Data with high variability

need to be accounted for in the statistical analysis




Variability

Variability also due to varying sequencing depth per sample

⇒

Pragmatic solution: transform count data into relative proportionsOther solution: rarefaction, although quite controversial (McMurdie &

Holmes (2014) PLoS ONE ; Weiss et al. (2017) Microbiome)




Compositional data

Relative proportion data are compositional

Relative proportion data ‘live’ ina constraint space (a ‘simplex’)

image adapted from http://qedinsight.com




Compositional data

Compositional data: consequencesBacterial counts are relative to each other spurious correlation (Pearson in 1897!)

Compositional data are bounded data distribution is not normal most statistical methods assumenon-bounded data!

‘History has proved him [Pearson] correct: over the succeeding years andindeed right up to the present day, there has been no other form of dataanalysis where more confusion has reigned and where more improper andinadequate statistical methods have been applied’ (Aitchison, 2011).

potentially spurious statistical results!Aitchison J (2011), The Statistical Analysis of Compositional Data, SpringerPawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and Analysis of CompositionalData, Wiley




Batch effects

Batch effects: systematic sources of variation that may outweightbiological variation

e.g. Sequencing runs (left), batches of expt (right), cage, country,differences in protocols or laboratories

Sinha R [...], Knight R, Huttenhower C (2015), The microbiome quality control project: baseline studydesign and future directions, Genome Biology 16:275The microbiome Quality Control project http://www.mbqc.org/




Sparse data

Data are sparse and skewed

• Solution 1: apply or develop non parametric methods:→ Univariate methods limited by small sample sizes→ Multivariate methods (focus of this talk)

• Solution 2: develop statistical models to handle zero-inflateddistributions (univariate)→ Generalised Linear Model not appropriate→ Zero Inflated Poisson, Zero Inflated Negative Binomial models

and variants∗

Review: Xu et al. (2015), Assessment and Selection of Competing Models for Zero-Inflated MicrobiomeData, PLOS ONE 10(7)




Compositional data

Data are compositional

• Pragmatic solution 1: log ratio transformation→ Project data into a domain of real numbers (this talk)

from DOI:10.1029/2005GC001092

• Solution 2: novel univariate statistical models to respect theintrinsic data characteristics→ Constraint regression models∗

∗Review: Hongzhe Li (2015), Microbiome, Metagenomics, and High-Dimensional Compositional DataAnalysis, Annual Review of Statistics and Its Application and his latest arXiv tech report




Batch effects

Data contain batch effects (sometimes)

• Solution 1: correct for batch effect prior to statistical analysis→ Batch Mean Centering, Combat∗ (originally developed for

microarray data), may also remove biological variation!

Solution 2: account for batch effect directly into the statisticalmodel→ univariate methods: include batch information in the

regression model→ multivariate methods: preliminary results in this talk

∗Johnson et al (2007), Adjusting batch effects in microarray expression data using empirical Bayesmethods, Biostatistics 8(1)




Our multivariate methods for microbiome data analysis

Methods implemented in :Identification of microbial signatures to characterise aphenotypeIntegration with other high-throughput ‘omics data for abetter understanding of host - microbiome interactionsLongitudinal analyses to understand disease progression

Lê Cao K-A.†,Costello M-E†, Lakis VA, et al (2016) mixMC: a multivariate statisticalframework to gain insight into Microbial Communities. PLoS ONE 11(8).

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for ’omicsfeature selection and multiple data integration, bioRxiv 108597




A method for batch effects

A multivariate method to account for batch effectsEva Wang, Helen Benham, Ranjeny Thomas, UQDI

Classical discriminant analysis todiscriminate 28 healthy controls vs86 rheumatoid arthritis patients

−4

0

4

8

−8 −4 0 4

Comp 1

Com

p 2

Patients

HC

RA

Batches

1

2

New method accounts for batcheffects and identifies a microbialsignature

−4

0

4

8

−8 −4 0 4

Comp 1

Com

p 2

Patients

HC

RA

Batches

1

2

RA study: Oral microbiomeunpublished preliminary results




A method to identify a multivariate microbial signature

A predictive statistical model based on microbial signatureVanessa Lakis, Helen Benham, Ranjeny Thomas, UQDI

−6 −4 −2 0 2 4 6

−10

−50

5

sPLS−DA Component 1

sPLS−D

A C

ompo

nent

2

Method identifies a microbialsignature:• 94% accuracy to predict healthycontrols vs rheumatoid arthritis

• First Degree Relatives predictedas likely-HC (29) or likely-RA (61)

RA study: Faecal microbiomeunpublished preliminary results




A method to identify a multivariate microbial signature

A predictive statistical model based on microbial signatureVanessa Lakis, Helen Benham, Ranjeny Thomas, UQDI

Microbial signature discriminates HCvs RA

HC and RA based on microbial signature

SA

6164S

A6167rep3

S9215

SA

6166S

A6163

S9220

SA

6169S

9216S

9208S

9217S

9218S

9210S

A6161

S9209

S9207

S9214

S9155

S9158

S9147

SA

6165S

A6120

S9112

SA

6109S

9219S

9108S

9206S

9148S

9205S

9153S

9125S

9203S

9117S

A6110

SA

6105S

9221S

9120S

A6111

S9212

S9113

S9213

SA

6101S

A6112

S9204

SA

6168rep3S

9163S

A6162

S9110

SA

6124S

9157S

9162S

9160S

9154S

9128S

A6099

SA

6100S

9150S

A6103

S9146

SA

6106S

9129S

A6097

SA

6108S

9126S

A6123

S9143

S9141

S9161

SA

6095S

A6115

S9135

SA

6113S

A6107

SA

6117S

9109S

A6170

S9156

SA

6125S

A6102

SA

6121S

9121S

9149S

9119S

9164S

A6094

S9142

S9116

S9123

S9152

S9131

SA

6122S

9122S

9130S

A6096

S9136

S9115

SA

6098S

9114S

9111S

A6104

S9145

S9137

S9144

SA

6114S

9165S

9118S

A6118

SA

6119S

9134S

9127S

9159S

9132S

9124S

9140S

A6116

RuminococcaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridialesRuminococcaceaeBifidobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeClostridialesRuminococcaceaeEnterococcaceaeRF32RF32BacteroidaceaeLachnospiraceaeClostridiaceaeVeillonellaceaeOxalobacteraceaeLachnospiraceaeAlcaligenaceaeRuminococcaceae[Odoribacteraceae]RuminococcaceaePorphyromonadaceaePseudomonadaceaePseudomonadaceaeEnterobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridiaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeBacteroidaceaeLachnospiraceaeVeillonellaceae

sPLSDA_ContinuumPatient_Class Patient_Class

HCRA

sPLSDA_Continuum0.005

−0.005

−2

−1

0

1

2

3

4

5

Prediction of FDR as a risk scorefrom likely-HC to likely-RA

FDR prediction based on microbial signature

S9191

S9178

SA

6143

S9170

SA

6145

S9189

S9198

S9190

S9196

S9186

S9172

SA

6146

S9181

S9194

S9167

S9176

S9173

S9184

S9169

S9179

SA

6142

S9182

S9166

SA

6138

SA

6139

S9188

S9199

S9201

SA

6140

S9174

S9177

SA

6144

S9193

S9200

S9202

S9168

S9180

SA

6147

S9197

S9187

SA

6141

S9192

S9171

S9175

S9183

RuminococcaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridialesRuminococcaceaeBifidobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeRuminococcaceaeBacteroidaceaeRuminococcaceaeClostridialesRuminococcaceaeEnterococcaceaeRF32RF32BacteroidaceaeLachnospiraceaeClostridiaceaeVeillonellaceaeOxalobacteraceaeLachnospiraceaeAlcaligenaceaeRuminococcaceae[Odoribacteraceae]RuminococcaceaePorphyromonadaceaePseudomonadaceaePseudomonadaceaeEnterobacteriaceaeRuminococcaceaeLachnospiraceaeRuminococcaceaeClostridiaceaeRuminococcaceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeLachnospiraceaeBacteroidaceaeLachnospiraceaeVeillonellaceae

Predicted_ContinuumPredicted_Class Predicted_Class

HCRA

Predicted_Continuum0.12

−0.02

−2

−1

0

1

2

3

4

RA study: Faecal microbiomeunpublished preliminary results




A method to integrate ‘omics data

A multivariate method to integrate ‘omics dataAmrit Singh (UBC), Florian Rohart (UQDI)

14 asthmatic individuals undergoing allergen inhalation challenge

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−50

−40

−30

−20

−10

0

10

0 25 50 75 100 125Time (minutes)

Perc

ent d

rop

in F

EV1

PRE inhalationchallenge POST

mRNA(15,683)

Genepathways(229)

Metabolitepathways

(60)

Cells(9)

PRE

POST

Metabolites(292)

Leukocyte gene expression and plasma metabolite abundance reduced to pathwaysusing eigengene summarisation (Langfelder et al. 2008).

DIABLO seeks for subsets of ‘omics variables maximally correlated.

Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, Lê Cao K-A. DIABLO - an integrative,multi-omics, multivariate method for multi-group classification. bioRxiv 067611




A method to integrate ‘omics data

A multivariate method to integrate ‘omics dataAmrit Singh (UBC), Florian Rohart (UQDI)

Our multi-‘omics signature suggests mechanistic link with response toallergen challenge across different biological layers

−1 −0.5 0 0.5 1

Color key

Valine, leucine and isoleucine biosynthesis

Thyroid cancer

Glyoxylate and dicarboxylate metabolism

Valine, leucine and isoleucine metabolism

Tryptophan metabolism

Lysolipid

Butanoate metabolism

Purine metabolism, (hypo)xanthine/inosine containing

Relative.Eosinophils

Dipeptide

Relative.Basophils

Alanine and aspartate metabolism

gamma−glutamyl

Drug metabolism − other enzymes

Glycerolipid metabolism

Polypeptide

Pantothenate and CoA metabolism

Staphylococcus aureus infection

Autoimmune thyroid disease

Allograft rejection

Graft−versus−host disease

Asthma

●

●

●

CellsgeneModsmetMods

Correlation btw cells, genes/metab pathways

• Cell types: eosinophils and basophils(hallmarks of allergic asthma)

• Gene pathway: Valine, leucine andisoleucine biosynthesis (↑ postchallenge)

• Metabolite pathway: Valine, leucineand isoleucine metabolism (↑ postchallenge)

Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, Lê Cao K-A. DIABLO - an integrative,multi-omics, multivariate method for multi-group classification. bioRxiv 067611




A method to integrate longitudinal data

Integrate longitudinal microbial profiles with islet autoantibody markersDIABIMMUNE T1D study: 33 genetically predisposed infants, faecalsamples from 50 to 1,500 days to elucidate the role of microbial and hostinflammatory activity in T1D (Kostic et al. 2015, Cell host & microbe)

−2

−1

0

1

2

300 600 900Time

Inte

nsity

−1.0

−0.5

0.0

0.5

1.0

Feature 1

Associations with feature 1 with shift Av cor pos= 0.93 (# 15 ),neg= −0.92 (# 44 )

E006574: seroconvert at 532Days, T1D at 1339Days

−2

−1

0

1

2

0 250 500 750 1000Time

Inte

nsity

−1.0

−0.5

0.0

0.5

1.0

Feature 1


E022137: seroconvert at 562Days

Methods from:Straube S, Huang BE, Lê Cao K-A (2017). DynOmics to identify delays and co-expression patternsacross time course experiments. Sci Rep, 7:40131;Straube J, Gorse AD, Huang BE and Lê Cao K-A (2015). A linear mixed model spline framework foranalysing time course ‘omics data. PLoS ONE 10(8).

unpublished preliminary resultsKim-Anh Lê Cao May 19-20 2017



A method to integrate longitudinal data

Integrate longitudinal microbial profiles with islet autoantibody markersDIABIMMUNE T1D study: 33 genetically predisposed infants, faecalsamples from 50 to 1,500 days to elucidate the role of microbial and hostinflammatory activity in T1D (Kostic et al. 2015, Cell host & microbe)

−2

−1

0

1

2

300 600 900Time

Inte

nsity

−1.0

−0.5

0.0

0.5

1.0

Feature 1


E006574: seroconvert at 532Days, T1D at 1339Days

Microbial signature:• Blautia, Coprococcus andStreptococcus positively correlatedwith ↗ insulin autoantibody IAA• Bacteroides, members ofLachnospiraceae and Veillonellaceae,Faecalibacterium (prausnitzii)negatively correlated with IAA.

Methods from:Straube S, Huang BE, Lê Cao K-A (2017). DynOmics to identify delays and co-expression patternsacross time course experiments. Sci Rep, 7:40131;Straube J, Gorse AD, Huang BE and Lê Cao K-A (2015). A linear mixed model spline framework foranalysing time course ‘omics data. PLoS ONE 10(8).

unpublished preliminary resultsKim-Anh Lê Cao May 19-20 2017



Statistics in the microbiome era

After some initial stalling, computational field is moving at a(too?) fast pace, yet no consensus (normalisation, analysis)

Rigorous statistics needed to address the data challenges andensure reliable results

Flexibility of multivariate methods and novel developments:

data exploration; classification; integration of multiple datasets; biomarker identificationprovide a deeper understanding of the microbiome as abiological system

www.mixOmics.org @mixOmics_team

[email protected] [email protected] Lê Cao May 19-20 2017


www.mixOmics.org

https://twitter.com/mixOmics_team

...Context Data challenges Proposed solutions Multivariate statistical methods Conclusions

AcknowledgementsmixOmics team

France: Sébastien Déjean, FrançoisBartoloAustralia: K-A Lê Cao, Florian Rohart,Benoît Gautier

Many mixOmics users

Lê Cao past lab at UQStaff: Florian Rohart, NicholasMatigian, Anne Bernard, BenoîtGautier

PhD students: Ralph Patrick, ChaoLiu, Amrit Singh, Jasmin Straube,Aimee Hanson

Masters/Hons: Eva Wang, VanessaLakis, Zoe Welham, Nick d’Arcy, ThomCuddihy

My dear microbiome collaboratorsThomas lab: Helen Benham, LindaRehaume, Ranjeny Thomas (UQ)

Brown lab: Mary-Ellen Costello, MattBrown (QUT)

Hugenholtz lab (UQ)

two senior postdocs incomputational biostatistics,University of Melbourne



0.1cm Statistical challenges and new data analysis methods ...

Documents

Transcript of 0.1cm Statistical challenges and new data analysis methods ...