Recent Computational Advances in Metagenomics...

19
Recent Computational Advances in Metagenomics (RCAM’17) RCAM program committee Institut Pasteur Auditorium F. Jacob Oct. 9-10, 2017 RCAM is supported by the metaprogramme "Microbial Ecosystems and Metaomics (MEM)" www.mem.inra.fr from the French National Institute for Agricultural Research (INRA), by the "GdR Molecular Bioinformatics (GdR BiM)" www.gdr-bim.cnrs.fr and by the "Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI)" from Institut Pasteur www.pasteur.fr

Transcript of Recent Computational Advances in Metagenomics...

Page 1: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Recent Computational Advances in Metagenomics (RCAM’17)

RCAM program committee

Institut Pasteur

Auditorium F. Jacob

Oct. 9-10, 2017 RCAM is supported by the metaprogramme "Microbial Ecosystems and Metaomics (MEM)" www.mem.inra.fr from the French National Institute for Agricultural Research (INRA), by the "GdR Molecular Bioinformatics (GdR BiM)" www.gdr-bim.cnrs.fr and by the "Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI)" from Institut Pasteur www.pasteur.fr

Page 2: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

RCAM 2017 Workshop Schedule

9-10 October, 2017

Time Speaker TitleMonday 9 October

13:30 - 13:55 Arrival of attendees13:55 - 14:00 Opening Remarks14:00 - 15:00 Nicola Segata (Key) Computational metagenomics for large-

scale strain-resolved microbiome profiling15:00 - 15:35 Camille Marchet A highly scalable data structure for read

similarity computation and its applicationto marine plankton holobionts

15h35 - 16:10 Vitor Piro Distributed indices for metagenomic bigdata applications

16:10 - 16:40 Co�ee break16:40 - 17:15 Simone Pignotti ProPhyle: a phylogeny based metage-

nomic classifier using Burrows-WheelerTransform

17:15 - 17:50 Florian Plaza Oñate Abundance-based reconstitution ofmicrobial pan-genomes from whole-metagenome shotgun sequencing data

Tuesday 10 October09:30 - 10:30 Léa Sigwald & Ségolène Caboche (Key) Analytical biases in targeted metage-

nomics studies10:30 - 11:05 Ioannis Nicolis A comparison of methods to analyse mi-

crobiota profiles between paired samples.Motivating example: faeces storage condi-tions.

11:05 - 11:35 Co�ee break11:35 - 12:10 Julien Chiquet PCA for count data in microbial ecology.12:10 - 12:45 Noam Shental Towards a highly e�cient diversity census

of the prokaryotic biosphere: a group test-ing approach.

12:45 - 14:15 Lunch (not included)14:15 - 15:15 Vanessa Jurtz (Key) Identifying bacteriophage sequences in

metagenomic sample15:15 - 15:50 Clovis Galiez WIsH: Who is the host? Predict-

ing prokaryotic hosts from metagenomicphage contigs.

15:50 - 16:00 Closing Remarks

2

Page 3: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Contents

1 Computational metagenomics for large-scale strain-resolved microbiomeprofiling (keynote) 4

2 A highly scalable data structure for read similarity computation andits application to marine plankton holobionts 5

3 Distributed indices for metagenomic big data applications 6

4 ProPhyle: a phylogeny based metagenomic classifier using Burrows-Wheeler Transform 8

5 Abundance-based reconstitution of microbial pan-genomes from whole-metagenome shotgun sequencing data 9

6 Analytical biases in targeted metagenomics studies (keynote) 11

7 A comparison of methods to analyse microbiota profiles between pairedsamples. Motivating example: faeces storage conditions 12

8 PCA for count data in microbial ecology. 14

9 Towards a highly e�cient diversity census of the prokaryotic biosphere:a group testing approach. 15

10 Identifying bacteriophage sequences in metagenomic sample 17

11 WIsH: Who is the host? Predicting prokaryotic hosts from metage-nomic phage contigs. 18

12 Close-by Restaurants 19

3

Page 4: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Computational metagenomics for large-scale strain-resolvedmicrobiome profiling

Dr Nicola Segata1

1Centre for Integrative Biology, University of Trento, Italy

1

Page 5: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

A highly scalable data structure for read similarity computation andits application to marine plankton holobiontsCamille Marchet

?1,2†, Arnaud Meng3†, Erwan Corre

3, Stephane Le Crom

4,

Fabrice Not5,6

, Lucie Bittner3, Pierre Peterlongo

1

†Those authors contributed equally to this work.1 INRIA Rennes - Bretagne Atlantique/IRISA, EPI GenScale, Rennes, France ; 2 UnivRennes 1, Rennes, France ; 3 Sorbonne Univ, UPMC Univ Paris 06, Univ Antilles Guyane,Univ Nice Sophia Antipolis, CNRS, Evolution Paris Seine - Institut de Biologie ParisSeine, Paris, France ; 4 CNRS, UPMC, FR2424, ABiMS, Station Biologique, Rosco↵,France ; 5 CNRS, UMR 7144, Station Biologique de Rosco↵, Rosco↵, France ; 6 SorbonneUniv, Univ Pierre et Marie Curie, Paris 06, France.? corresponding author: [email protected]

Genome and transcriptome sequencing fields generate huge sets of sequences,

that are often chopped in voluminous sets of k-mers for their further analysis,

which brings its own share of high performance problems. Such a situation is

often majorated when sequencing techniques are used to obtain meta-genomes

or meta-transcriptomes. To extract relevant pieces of information from the

large data sets generated by current sequencing techniques, one must rely on

extremely scalable methods and solutions, in particular at early stages of anal-

ysis [2]. We present a lightweight indexing structure that scales to billions of

elements, and we meant to be helpful at comparing (meta-)genomics / tran-

scriptomics data sets.

Indexing has often been proven extremely expensive for large scale problems,

while being a fundamental need, for instance for mapping purposes. In order to

address the problem of scalability, we rely on an in-house implementation [3].

It is characterized by its ability to construct a Minimal Perfect Hash Function

for up to 100 billions elements in hours, with limited memory fingerprint (<4 bits per element). We combine this Minimal Perfect Hash Function with a

quasi-dictionary to associate information to the elements indexed. This quasi-

dictionary is the base for straightforward applications such as sequence similar-

ity computing between large sets using k-mer composition. Such application is

embodied in a tool dubbed Short Read Connector Linker (SRC) [4].

We present the general structure’s performances before proposing a focus on a

recent SRC application in marine biology. The overall project intends to charac-

terize the molecular bases of symbiotic relationships in a plankton holobiont [1].

SRC was integrated to the protocol in order to split the meta transcriptomic data

set into two distinct read subsets (host and symbiont), in order to facilitate two

independant transcriptome de novo assemblies used for further analysis. Results

show that the inclusion of SRC accelerated the procedure and allowed producing

unambiguous transcripts for the host. We advocate that similar initiative could

o↵er large scale comparison strategy to assemble and study holobionts.

References

[1] Balzano et al. Transcriptome analyses to investigate symbiotic relationships betweenmarine protists.

[2] Dubinkina et al. Assessment of k-mer spectrum applicability for metagenomic dissimilarityanalysis. December.

[3] Limasset et al. Fast and scalable minimal perfect hashing for massive key sets. 2017.[4] Marchet et al. A resource-frugal probabilistic dictionary and applications in bioinformat-

ics. 2017.

Page 6: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Distributed indices for metagenomic big data applications

Vitor C. Piro (1) Andreas Andrusch (2) Temesgen Dadi (3) Knut Reinert (3) Bernhard Renard (1) (1) MF1 Bioinformatics, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany (2) Centre for Biological Threats and Special Pathogens, Robert Koch Institute, Seestraße 10, 13353 Berlin, Germany (3) Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany New and more efficient methods are constantly being developed to overcome the problem of mapping hundred of millions of reads against ever growing sets of reference sequences, accounting for mismatches, insertions and deletions [1]. Although many advances in read mapping were achieved in the last decade, the huge amount of genomic data is pushing such algorithms to their limit in terms of memory and time for index construction, in particular in metagenomics. Raw data generation is steadily increasing and consequently more assembled reference sequences are available. The increase of read quality and length provided by evolving sequencing technologies [2] facilitates assembly and makes reference sequences repositories such as NCBI Whole Genome Shotgun database [3] and Assembly [4] to grow exponentially. In the last 2 years (June 2015-2017) the number of WGS bases deposited on NCBI repositories doubled [5] . In the realm of metagenomics analysis, some alternative methods were developed to overcome the computational burden of big data and read mapping to classify NGS reads. K-mer classification methods (e.g. kraken [6]) are widely used to produce approximate matches between reads and references. They are fast and scalable when compared to exact read mappers. However they do not generate full alignments and owing to the fragmentation of sequences into smaller parts they are not suitable to fully resolve indels and SNPs. Pseudo-alignment, an even faster variation of k-mer approach (e.g. metakallisto [7]), gives very low resolution on read assignment without producing alignments. Even though such approaches are very useful for some applications, they are not suitable for high resolution studies like strain classification, pathogen identification or SNP calling. In addition, such methods focus on fast processing of query data but in most cases rely on a single reference index to work. Reference sets for metagenomics analysis are usually very large in order to cover as much diversity as possible. Building a single index for complete reference sequences is becoming more and more prohibitive with the current methods and data sizes. The set of Bacterial complete genomes sums up to 60.6 Gbp while the whole set of unfinished assembled genomes reaches 782.4 Gbp (August, 2017). We propose a new method to generate distributed indices for read mapping against large sets of sequences aiming faster construction and update. References are split in several clusters by taxonomic proximity, sequence similarity based on canonical k-mer profile clustering and possibly other methods. For each cluster a k-mer and FM-index are generated for fast look-up and full read mapping, respectively. By splitting the references in a structured way, it is possible to obtain problem-oriented clusters that can optimize the solution for a specific need, for example: when analyzing a metagenomic sample, sequences clustered by taxonomic proximity will provide a reduced search space by segregating different groups of taxa, making search necessary only on relevant groups. Smaller cluster of sequences can

Page 7: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

also be quickly updated, allowing auto-updatable indices that can keep up with the fast influx of new sequences on public databases. Clusters with high sequence similarity can also be optimized by special data structures (e.g. Journaled string tree [8]), reducing storage and redundancy. We aim to allow high resolution metagenomic analysis with a distributed read mapping approach with very large sets of reference sequences, giving more accuracy to read assignment for further analysis. [1] Canzar, S., & Salzberg, S. L. (2017). Short Read Mapping: An Algorithmic Tour. Proceedings of the IEEE, 105(3), 436–458. http://doi.org/10.1109/JPROC.2015.2455551 [2] Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6), 333–351. http://doi.org/10.1038/nrg.2016.49 [3] www.ncbi.nlm.nih.gov/genbank/wgs/ [4] Kitts, P. a, Church, D. M., Thibaud-Nissen, F., Choi, J., Hem, V., Sapojnikov, V., … Kimchi, A. (2016). Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Research , 44 (D1), D73–D80. http://doi.org/10.1093/nar/gkv1226 [5] https://www.ncbi.nlm.nih.gov/genbank/statistics/ [6] Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3), R46. http://doi.org/10.1186/gb-2014-15-3-r46 [7] Schaeffer, L., Pimentel, H., Bray, N., Melsted, P., & Pachter, L. (2017). Pseudoalignment for metagenomic read assignment. Bioinformatics, 33(14), 2082–2088. http://doi.org/10.1093/bioinformatics/btx106 [8] Rahn, R., Weese, D., & Reinert, K. (2014). Journaled string tree--a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics, 30(24), 3499–3505. http://doi.org/10.1093/bioinformatics/btu438

Page 8: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

ProPhyle: a phylogeny-based metagenomic classifier using Burrows-Wheeler Transform

Karel Břinda 1, Kamil Salikhov2,3, Simone Pignotti 2, Gregory Kucherov2

1 Harvard T.H. Chan School of Public Health, Boston MA 02115, USA 2 LIGM/CNRS Université Paris-Est, 77454 Marne-la-Vallée, France 3 Mechanics and Mathematics Department, Lomonosov Moscow State University, Russia Metagenomics is a powerful approach to study the genetic content of environmental samples and it has been strongly promoted by NGS technologies. The metagenomic classification problem consists in assigning each sequence of the metagenome to a corresponding taxonomic unit, or to classify it as “novel”. To cope with increasingly large metagenomic projects, researchers resort to alignment-free methods. The most popular tool – Kraken – performs metagenomic classification of NGS reads based on the analysis of shared k-mers between an input read and each genome from a pre-compiled database. Kraken provides an extremely rapid read classification, but its index suffers from two major limitations. First, its enormous memory consumption, due to a large hash table, does not allow one to perform classification other than on high-performance clusters. This prohibits the use of Kraken when computational resources 1

are limited. For instance, point-of-care sequencing and real-time disease surveillance projects often have to rely on data analysis on laptops with little memory. Second, every k-mer in Kraken’s index is represented through its lowest common ancestor, which can result in an inaccurate classification, especially when many k-mers are present in multiple branches of the tree as it is common, e.g., in phylogenetic trees for a single species. We present Prophyle, a metagenomic classifier based on BWT-index. ProPhyle uses a classification algorithm similar to Kraken’s but with an indexing strategy based on a bottom-up propagation of k-mers in the tree, assembling contigs at each node and matching using a standard full-text search. The obtained index occupies only a fraction of RAM compared to Kraken – 13 GB instead of 120 GB for index construction and 14 GB instead of 75 GB for index querying. The resulting index is also more expressive as we can, for every queried k-mer, retrieve a list of all genomes in which the k-mer occurs. Overall, ProPhyle provides an index for resource-frugal metagenomic classification, which is accurate even with single-species phylogenetic trees. GitHub: https://github.com/karel-brinda/ProPhyle Documentation: http://prophyle.readthedocs.io/en/latest/

1 D. E. Wood and S. L. Salzberg, “Kraken: Ultrafast metagenomic sequence classification using exact alignments”, Genome Biology, vol. 15, no. 3, R46, 2014.

Page 9: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Abundance-based reconstitution of microbial pan-genomes from whole-metagenome shotgun sequencing data

Florian Plaza Oñate1,2, Alessandra C. L. Cervino1, Frédéric Magoulès3, S. Dusko Ehrlich2 & Matthieu Pichaud1

1 Enterome, 94-96 Avenue Ledru Rollin, 75011 Paris, France 2 MGP MetaGénoPolis, INRA, Université Paris-Saclay, 78350 Jouy en Josas, France 3 CentraleSupélec, Université Paris-Saclay, 92290 Châtenay-Malabry, France § Corresponding author

Email addresses: FPO: fplaza-onate [at] enterome [dot] com

Abstract Analysis toolkits for whole-metagenome shotgun sequencing data achieved strain-level characterization of complex microbial communities by capturing intra-species gene content variation [1,2]. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available.

Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique for discovering and reconstituting gene repertoire of microbial species [3]. While current methods accurately identify species core genes, they miss many accessory genes or split them in small separated clusters. However, characterizing accessory genes is crucial as they can significantly impact the functional potential of the studied stains by providing antibiotic resistance or pathogenicity [4,5], for instance.

We introduce MSPminer [6], the first reference-free tool that discovers, delineates and structures Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across large-scale shotgun metagenomics datasets. MSPminer relies on a new robust measure for grouping not only species core genes but accessory genes also. In MSPs, an empirical classifier distinguishes core from accessory and shared genes.

We applied MSPminer to the largest publicly available gene abundance table which is composed of 9.9M genes quantified in 1 267 stool samples [7]. We show that MSPminer successfully reconstitutes in a matter of several hours gene repertoire of > 1600 microbial species (some hitherto unknown) and detects many more accessory genes than existing tools. By compiling the information from thousands of samples, species gene content variability is better accounted for and their quantification is subsequently more precise

References 1. Scholz M, Ward D V, Pasolli E, Tolio T, Zolfo M, Asnicar F, et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods [Internet]. 2016;13:435–8. Available from: http://www.nature.com/doifinder/10.1038/nmeth.3802

2. Nayfach S, Rodriguez-Mueller B, Garud N, Pollard KS. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. [Internet]. 2016;26:1612–25. Available from: http://genome.cshlp.org/lookup/doi/10.1101/gr.201863.115

Page 10: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

3. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. [Internet]. 2014 [cited 2014 Jul 9];32:822–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24997787

4. Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ-M, Quick J, et al. A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104:H4. JAMA [Internet]. 2013;309:1502. Available from: http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.2013.3231

5. Scaria J, Ponnala L, Janvilisri T, Yan W, Mueller LA, Chang Y-F. Analysis of Ultra Low Genome Conservation in Clostridium difficile. Horsburgh MJ, editor. PLoS One [Internet]. 2010;5:e15147. Available from: http://dx.plos.org/10.1371/journal.pone.0015147

6. Plaza Oñate F, Cervino ACL, Magoulès F, Ehrlich SD, Pichaud M. Abundance-based reconstitution of microbial pan-genomes from whole-metagenome shotgun sequencing data. bioRxiv [Internet]. 2017; Available from: http://www.biorxiv.org/content/early/2017/08/08/173203

7. Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. [Internet]. 2014;32:834–41. Available from: http://www.nature.com/doifinder/10.1038/nbt.2942

Page 11: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Analytical biases in targeted metagenomics studies

Lea Siegwald & Segolene Caboche1

1Institut Pasteur de Lille, France

1

Page 12: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Acomparisonofmethodstoanalysemicrobiotaprofilesbetweenpairedsamples.Motivatingexample:faecesstorageconditions.

IoannisNicolis1&Anne-JudithWaligora21: EA 4064 «Épidémiologie environnementale: impact sanitaire des pollutions»; Faculté de

Pharmacie;UniversitéParisDescartes;4avenuedel’Observatoire;75270Pariscedex062: EA4065 – «Écosystème intestinal, probiotiques, antibiotiques»; Faculté de Pharmacie;

UniversitéParisDescartes;4avenuedel’Observatoire;75270Pariscedex06

Metagenomic analysis based on the 16S rRNA gene is a widely usedmethod for

studying the diversity and abundance of commensal bacteria in faeces. One critical

parameter in order to avoid bias when analysing microbial communities is the sample

storageconditions.Immediatefreezingat-80°Cislargelyacceptedastheoptimumstorage.

However, immediate freezing isnotalways feasible inparticularwhen the faeces samples

are collected at the participants’ residences. In this study we evaluate three alternative

storage conditions against immediate freezing at -80°C: storage at room temperature and

use of two common preservative buffers, RNAlater and OMNIgene.GUT. After 72 hours

storage, the faecalmicrobial compositionwasevaluatedbysequencingofV3-V5 regionof

the16SrDNA(Illumunia/MiSeq).

A large variety of methods exist to compare microbiota profiles1,2. For example,

ordination methods such as Principal Coordinates Analysis (PCoA), Redundancy analysis

(RDA),canonicalanalysisofprincipalcoordinates(CAP),distancebasedredundancyanalysis

(dbRDA) using Euclidean or other (Bray-Curtis, Jaccard…) distances allow to detect and

visualize clustering between groups, and subsequent tests such as permanova allow to

assess the statistical significance of the differences. Among the assumptions underlying

these approaches is the independence between samples. However in our experimental

setting, it is the same sample, which is split in four and assessed after storing in four

different conditions, thus the independence assumption is not satisfied. We are only

interested in differences between storage conditions but these can be masked by large

betweensubjects’differencesifignored.

Inthisworkwecomparedifferentmethodsinordertotakeintoaccountthispairing

between samples. We tested each of the three alternative conditions against the

standard-80°Cfreezingusingmultivariatemethodsinordertoconsidernotseparatelyeach

of the species present but the ensemble. We applied PCA on the differences, SPLS-DA

multilevelanalysis3,pairedHotellingT2testandmanovawiththesubjectasrandomeffect.

Univariatemethodswerealsousedonindividualphylaorfamiliesofbacteria, inparticular

mixed effects linearmodels and pairedWilcoxon tests. To take into account formultiple

comparisons,theresultsweresubsequentlycontrolledforfalsediscoveryrate.FinallyBland-

Altmanplots4 for themajorphyla and familieswere constructed to assess the agreement

betweeneachofthethreealternativestoragesvsthestandardfreezing.

Page 13: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Most of themethods used assume that the data are sampled frommultinormal

distributionsandthatvariablesare independent,assumptionsthatarenotalwayssatisfied

withmicrobiotadata.Inparticular,thecountsbeingnormalized,theycannotbeconsidered

asabsoluteamountsbutascompositionaldata5.Thus,werepeatedalltheabove-mentioned

analysesaftertransformingthedatausingtheisometriclogratio(ilr)transform6.Theeasier

to interpret centered log ratio (clr) transform7 was also tried, but as it yields singular

covariancematricessomemethodscannotbeapplied.

Methods used proved to be complementary: while most methods detected the

samegeneral difference trends, theydiffered sometimes in thedetails and in the typeof

information obtained such as the degree of the observed effect, the particular species

responsibleforthedifferencesortheconfidenceintervalsassociatedwiththesedifferences.

Furtherworkwithlargersimulateddatawillbepursuedinordertorefinethesepreliminary

conclusions.

(1) Paliy, O.; Shankar, V. Application of Multivariate Statistical Techniques in MicrobialEcology.Mol.Ecol.2016,25(5),1032–1057.

(2) Ramette,A.MultivariateAnalysesinMicrobialEcology.FEMSMicrobiol.Ecol.2007,62(2),142–160.

(3) Westerhuis,J.A.;vanVelzen,E.J.J.;Hoefsloot,H.C.J.;Smilde,A.K.MultivariatePairedDataAnalysis:MultilevelPLSDAversusOPLSDA.Metabolomics2010,6(1),119–128.

(4) Bland,J.M.;Altman,D.G.MeasuringAgreementinMethodComparisonStudies.Stat.MethodsMed.Res.1999,8(2),135–160.

(5) Gloor,G. B.;Wu, J. R.; Pawlowsky-Glahn, V.; Egozcue, J. J. It’s All Relative: AnalyzingMicrobiomeDataasCompositions.Ann.Epidemiol.2016,26(5),322–329.

(6) Egozcue, J. J.; Pawlowsky-Glahn, V.; Mateu-Figueras, G.; Barceló-Vidal, C. IsometricLogratio Transformations for Compositional Data Analysis.Math. Geol. 2003, 35 (3),279–300.

(7) Aitchison, J. The Statistical Analysis of Compositional Data;Monographs on statisticsandappliedprobability;ChapmanandHall:London ;NewYork,1986.

Page 14: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Variational Inference for Probabilistic Poisson PCARCAM 2017 - October 2017

Julien Chiquet1, Mahendra Mariadassou2, Stéphane Robin1

1 MIA 518, AgroParitech/INRA, Université Paris-Saclay, 75005 Paris, France2 MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France

Many application domains such as ecology or genomics have to deal with multivariate non Gaussianobservations. A typical example is the joint observation of the respective abundances of a set of species in aseries of sites, aimed at understanding the co-variations between these species. The Gaussian setting providesa canonical way to model such dependencies, but does not apply in general to such data.

We consider here the multivariate exponential family framework for which we introduce a generic hierarchicalmodel with multivariate Gaussian latent variables. This model can be seen as an extension of probabilisticPrincipal Component Analysis (pPCA) to non Gaussian settings and enables us to account for covariates ando�sets at little additional cost.

Unlike the purely Gaussian setting, the likelihood is generally not tractable in this framework. We resortinstead to a variational approximation for parameter inference and solve the corresponding optimizationproblem using gradient descent. Formal expression of the gradient depends on the exponential family at handand does not always have a analytical expression. However, coordinates of the gradient depend only on onedimensional integrals of the form

sR f(x)e≠x2

dx for smooth functions f , which can be evaluated e�cientlyusing Gauss-Hermite quadrature.

We then focus on the case of the Poisson-lognormal model, for which both the variational approximation andits gradient have closed-formed expressions, in the context of microbial ecology. We illustrate the importanceof accounting for o�sets and covariates on two datasets. Finally, we sketch some promising extensions of theframework, most notably to inference of co-occurrence networks.

Chiquet, Julien, Mahendra Mariadassou, and Stéphane Robin. 2017. “Variational Inference for ProbabilisticPoisson Pca.” ArXiv E-Prints, March. https://arxiv.org/abs/1703.06633.

1

Page 15: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Towards a highly efficient diversity census of the prokaryotic biosphere: a group testing approach

B. Shalem1, A. Amir2, E. Porat1, N. Shental3* 1Bar Ilan University, Department of Computer Science, Ramat Gan, Israel

2University of California, San Diego, Department of Pediatrics, La Jolla, CA 3The Open University of Israel, Department of Computer Science, Raanana, Israel

[email protected] * Corresponding Author

Abstract Exploring the microbial biosphere has grown exponentially in recent years, although we are

far from understanding its entirety. We present the "diversity census" problem of exploring

all bacterial species in a large cohort of specimens, and detecting a specimen that contains

each species (see Figure 1). The naive approach to this problem is to sequence each specimen,

thus requiring costly sample preparation steps.

We suggest an orders of magnitude more efficient approach for diversity censusing.

Specimens are pooled according to a predefined design and standard 16S rRNA sequencing

is performed over each pool. For each bacterial species, from the ultra-rare to the most

common, the algorithm detects a single specimen that contains the bacterial species. The

approach can be applied to large cohorts of monomicrobial cultures or to complex samples

containing a mixture of organisms (see a cartoon in Figure 2).

We model the experimental procedure and show via in silico simulations that the approach

enables censusing more than 95% of the species while taking 10-70 fold less resources.

Simulating experiments using real samples display the utility in censusing large cohorts of

samples.

Diversity censusing presents a novel problem in the mathematical field of group testing that

may also be applied in other biological problems and in other domains.

The manuscript is under review and a preprint appeared as

http://www.biorxiv.org/content/early/2017/07/23/167502

Page 16: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Figure 1: A bacterial diversity census The left column presents cartoons of the bacterial identities in a cohort of 16 monomicrobial cultures (upper subplot) or samples (lower subplot). This ground truth (shown as opaque) is, of course, provided only for demonstration purposes, and is unknown in practice. The right column shows the required output of a diversity census of each cohort of specimens. A. Diversity census of a monomicrobial culture repository results in detecting a single culture that contains each bacterium in the cohort (depicted by the bacterial species in the center of each dish). Since the cohort contains four bacterial species diversity censusing should detect four cultures. Cultures that contain the same bacterium are `symmetric' for the purpose of diversity censusing, e.g. each of the eight cultures that contain the `green-bacilli-like' bacterium may be equally detected. B. Diversity censusing a repository of samples, where each sample contains many species, should detect a sample for each bacterial species across the cohort. For each bacterium, censusing should preferably detect a sample in which the bacterium abundance is high, and hence samples that contain the same bacterium are not 'symmetric'. There are eight 'species' in this cartoon and seven detected samples, since two bacteria were detected in the same sample.

Figure 2: A diagram of the suggested approach A. Diversity censusing culture repositories. The input is a set of cultures ($C1-C4$), where each culture contains a single unknown species (i), species in each culture appear as opaque to designate that their identities are unknown). Equal amounts of material from each culture are pooled according to a predefined design (ii), then DNA is extracted and 16S rRNA NGS is performed (iii). The sets of bacteria identified in each pool (iv) are analyzed to detect one culture for each bacterial species, independent of the their abundance (v). B. Diversity censusing repositories of samples, where each sample (S1-S4) contains many species (i). The flowchart is similar to the case of cultures. The approach detects a representative sample for each bacterial species that appears in at least one sample and whose abundance is higher than some predefined threshold (v). The assigned sample to the 'triangle' and 'square' bacteria was the same one (S3).

Page 17: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

Identifying bacteriophage sequences in metagenomic sample

Vanessa Jurtz1

1Department for Bio and Health Informatics, Technical University of Denmark

1

Page 18: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

WIsH: Who is the host?

Predicting prokaryotic hosts from metagenomic phage contigs

Viruses are central for understanding microbial ecology and dynamics. Even though phages

(i.e. viruses infecting bacteria and archaea) represent the majority of the global virosphere, their

comprehensive study has been hampered by the necessity of isolating and cultivating their host.

Viral metagenomics circumvents this limitation, increasingly unveiling new viral genomic

sequences from a wide range of environments. As a drawback, the identity of the hosts remain

unknown for these newly discovered viruses, limiting our ecological understanding of the

microbiome.

We present here a new tool, WIsH [1], that predicts prokaryotic hosts of phages from their

genomic sequence. It can process hundred of thousand viral metagenomic contigs in few hours on a

single computer, and show much improved accuracy on contigs of short length (a few kbp), typical

of viral studies due to shallow coverage and intra-population variation. On the previously available

tools, it improves by 65% the accuracy of the prediction at genus-level for 3kbp-long phage contigs,

while running hundreds of times faster.

[1] : Galiez et al. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage

contigs, Bioinformatics, 2017, https://doi.org/10.1093/bioinformatics/btx383

Clovis Galiez,PostDoc in the Computational Biology group,

Max-Planck Institute for Biophysical Chemistry in Göttingen.

Contact: [email protected]

Page 19: Recent Computational Advances in Metagenomics (RCAM’17)maiage.jouy.inra.fr/sites/maiage.jouy.inra.fr/... · Recent Computational Advances in Metagenomics (RCAM’17) RCAM program

12 Close-by Restaurants

There are plenty of restaurants in the neighborhood but here are some nice ones.Name Address Phone NumberLe Marcab 225 Rue de Vaugirard 01 43 06 51 66Le clin d’œil 15 rue Copreaux 01 43 06 83 35Brasserie Le baribal 186 Rue de Vaugirard 01 47 34 15 32Les Jardins de Shah Jahan (Indian) 179 Rue de Vaugirard 01 47 34 09 62Kito Kito (Japanese) 45 Rue Mathurin Régnier 01 47 34 12 09

19