BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA … Work… · bioinformatics approaches for...

Post on 26-May-2020

9 views 1 download

Transcript of BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA … Work… · bioinformatics approaches for...

B I O I NFOR MAT I C S A P P ROACH ES F O R M E TAG ENOM I C S D ATA A N A LYS I S

A D I D O R O N - FA I G E N B O I M

P L A N T S C I E N C E S , V E G E TA B L E A N D F I E L D C R O P S A R O , T H E V O L C A N I C E N T E R , I S R A E L R I S H O N L E Z I O N 7 5 2 8 8 0 9

Metagenomics

o“Metagenomics is the study of the collective genomes of all microorganisms from an environmental sample”o Community

o Environmental

o Ecological

DNA sequencing & microbial profilingTraditional microbiology relies on isolation and culture of bacteria

o Cumbersome and labour intensive process

o Fails to account for the diversity of microbial life

o Great plate-count anomaly

Staley, J. T., and A. Konopka. 1985. Measurements of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346

Why environmental sequencing?Estimated 1000 trillion tons of bacterial/archeal life on Earth

o Only a small proportion of organisms have been grown in culture

o Species do not live in isolation

o Clonal cultures fail to represent the natural environment of a given organism

o Many proteins and protein functions remain undiscovered

Why environmental sequencing?

Human microbiomeRhizobiome Pollutant

sitesNon-human microbiomes

The revolution in sequencing technologiesHigh throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data.

Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015

HiSeqMiSeq

Experimental ApproachesCommunity composition

◦ Microbiome (16S rRNA gene, 18S, ITS, etc.)

Community composition and functional potential◦ Metagenomics

Functional genetic response◦ Metatranscriptomics

16s Vs. Shotgun Metagenomico16s – targeted sequencing of a single gene

◦ Marker for identification

◦ Well established

◦ Cheap

◦ Amplified what you want

oShotgun sequencing – sequence all the DNA◦ No primer bias

◦ Can identify all microbes

◦ Function information

16S rRNA sequencing

Erlandsen S L et al. J Histochem Cytochem2005;53:917-927

• 16S rRNA forms part of bacterial ribosomes.

• Contains regions of highly conserved and highly variable sequence.

• Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species.

• Large public databases available for comparison.–Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences.

• Conserved regions can be targeted to amplify broad range of bacteria from environmental samples.

• Not quantitative due to copy number variation

16S rRNA gene sequencingo Pros

◦ Well established

◦ Sequencing costs are relatively cheap (~50,000 reads/sample)

◦ Only amplifies what you want (no host contamination)

oCons◦ Primer choice can bias results towards certain organisms

◦ Usually not enough resolution to identify to the strain level

◦ Need different primers usually for archaea & eukaryotes (18S)

◦ Cannot identify viruses

◦ No direct functional profiling

Binning sequences to UTSoOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on

sequence divergence

oComposition-based binning− GC content

− Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison)

− Codon usage statistics

oSimilarity-based binning− Direct comparison of OTU sequence to a reference database

− Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97%

MEGAN Blast against NCBI database

Clustering of OTUs based on sequence similarity

Sample 2 Sample 1

OTU present 50:50 in both samples

Software for binningo Composition-based binning

o TETRA - Maximal-Order Markov Modelo PhyloPythia – Support Vectoro Seeded Growing Self-Organising Maps (S-GSOM)o TETRA + Codon based usage

o Similarity-based binningoRequires that most sequences in a sample are present in a primary or secondary reference

databaseoQIIME oMEGAN (comparison against Blast NCBI NR)oMothur (RDP)oCARMA (comparison against PFAM)oARB (linked with Silva database)

Sequences Databases

Measuring diversity of OTUsTwo primary measures for sequence based studies:

• Alpha diversity

−What is there? How much is there?

−Diversity within a sample

• Beta diversity

−How similar are two samples?

−Diversity between samples

Alpha diversity – human microbiome

C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234

Alpha diversityoSpecies count in the sampleowhat is a species ?

o OUTs

omissing level of evolutionary diversity

oPhylogenetic diversity (PD)o sum of the branch length covered by a sample

omissing the distribution of the species

Alpha diversityoSimpson’s diversity index (also Shannon, Chao indexes)o gives less weight to rarest species

S is the number of speciesN is the total number of organismsni is the number of organisms of species i

Whittaker, R.H. (1972). "Evolution and measurement of species diversity". Taxon(International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251

Beta diversity – human microbiome

C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234

Beta diversityoDiversity between samples

oUnifrac distance

oPhytogenic-based beta diversity

oPercentage observed branch length unique to either sample

Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228

Other useful data representationsSimple bar charts - what species are present?

Other useful data representationsRarefaction curves - How much of a community have we sampled?

Nu

mb

er

of

OT

Us

Number of sequences

Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2)

Shotgun whole metagenomeoUnlike 16S, metagenomic sequencing is no targeted to

a specific gene, but does an unbiased sample of the entire genomic DNA.

oTypically shorter sequence reads are usedto obtain >5Gb of data per sample.

oHiSeq or NextSeq platform are typically more costeffective for metagenomic sequencing

Shotgun metagenomicsPros

◦ No primer bias

◦ Can identify all microbes (e.g. eukaryotes, viruses)

◦ Direct functional profiling

• Cons◦ More expensive (millions of sequences needed)

◦ Host/site contamination can be significant

◦ May not be able to sequence “rare” microbes

◦ Required computational resources can be restrictive

◦ More complex bioinformatic analyses required◦ Chimera, unknown function

Sequence coverageComplexity

Diversity & Coverage

Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014Luis M Rodriguez-R and Konstantinos T Konstantinidis

Metagenomics' assembly

Metagenomics' assembly

Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362

Metagenomics' assembly

o Greedy assembler:o reads with maximum overlaps are iteratively merged into contigs

o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads

o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs.

de Bruijn graph approacho Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the grapho Using a short k-mer size

oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes

oComputational time and memory may be insufficient to complete such assemblies.

oMultiple k-mer approach

oSpread memory load over cluster of computer

Metagenome assembly tools

Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters!John Vollmers, Sandra Wiegand, Anne-Kristin Kaster

What we do with the assemblyoCharacterizing the contigs/scaffolds oMapping statistics

o Compositions (%GC, codon usage)

o Annotation - taxonomy & function assignments

oBinning

oComparative genomics

oMetabolic pathways

Binning over read mappingoPartition the metagenome to specieso Read coverage (multiple samples)

o compositions

Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362

GC%sample3

sample2

sample1

3460727scaffold1

3361629scaffold2

5120215scaffold3

5022207scaffold4

Binning over read mappingGC%sample

3sample

2sample

1

3460727scaffold1

3361629scaffold2

5120215scaffold3

5022207scaffold4

0

10

20

30

40

50

60

70

GCsample3sample2sample1

scaffold1

scaffold2

scaffold3

scaffold4

Binning contigsoCompletely automated approacho CONCOCT

o GroopM

oMetaBAT

oCompleteness of metagenome assembled genomes (MAGs)o single-copy core genes (tRNA synthetases , ribosomal proteins)

Genes annotationsoFinds bacterial genes in the contigs/scaffolds

◦ Prodigal◦ Prokka

oAnnotation of the genes◦ By homology searches (DIAMOND)◦ Domains finding

o Comparisons◦ Gene family◦ Distribution among the samples (CD-HIT)

Functional potential - The annotations suggest the functional potential of the community

No sure about the biology activity (may not be transcribed an translates)

Common functional databasesoNCBI

oCOGo Well known but original classification (not updated since 2003)

o PFAMo Focused more on protein domains based on hidden Markov models

oKEGGo Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways”o Full access now requires a license fee

o MetaCyco Similar to KEGG, but more microbe focused

o UniRefo Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50)o Most comprehensive and is constantly updatedo These gene families are typically less functionally informative

Metagenomic annotation systemWeb-based

◦ EBI

◦ MG-RAST

GUI-based◦ MEGAN

Local-based◦ Kraken

◦ MetAMOS

Post-processing analysisoData matrices of samples versus microbial featureso species

o genes

o Pathways

oUnsupervised methodso Clustering and correlations

o PCA

oStatistically different between sample typeso taxa or functional genes

A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence DataFront. Genet., 06 March 2017

Case study: the microbiome of fruit peel

Maria Vetcos Edoardo Piombo Shlomit Medina

Shiri Freilich

Samir Droby Michael Wisniewski

Case study: the microbiome of fruit peel

Read length: 150Total of 472 million quality reads

Sequencing output: files in FASTQ format

Assembly: MEGAHIT Format: FASTQTotal of 472 million quality reads Total of 71 Gbp

Format: FASTATotal number of contigs/contigs > 2k: 4,000,000/200,000Average contig length: 820/4,600 bpN50: 980/5000 bpTotal #bp: 3Gbp/1Gbp

Sample #raw reads #clean reads %clean reads #PE%mapping vs.

Filtered set

A1 26,692,151 22,638,404 84.81296243 45,276,808 75.59

A2 32,550,741 27,819,952 85.46641688 55,639,904 69.84

A3 24,083,541 20,677,583 85.85773579 41,355,166 82.77

C1W 29,722,008 25,416,861 85.51528887 50,833,722 78.32

C2W 24,125,961 20,451,024 84.76770728 40,902,048 76.01

C3W 24,956,733 21,353,952 85.56389172 42,707,904 87.48

M1 26,211,005 21,974,866 83.83831906 43,949,732 66.52

M2 5,640,819 4,765,939 84.49019548 9,531,878 62.97

M3 6,113,051 5,137,683 84.04449758 10,275,366 57.24

O1S 23,760,866 19,848,045 83.53249835 39,696,090 57.85

O2S 28,317,777 23,141,736 81.72158429 46,283,472 57.22

O3S 28,604,975 22,679,029 79.28351275 45,358,058 64.43

Total 280,779,628 235,905,074 84.02 471,810,148

Full contig set Contig > 2KTotal number of

sequences3,762,133 206,575

Total number of bps

3,085,995,440 945,480,334

Average sequence length

820.27 4,576.93

N50 979 4,926

Format: FASTATotal number of contigs > 2k pb: 200,000

Gene calling: Prodigal

Format: FASTATotal number of genes: 1,000,000

Genome/geneassembly

(pooled data)

Raw Genomic

Data

4 treatments X 3 repeats = 12 libraries

~45 million reads per libraryTotal of ~472 million quality

reads

~200,000 contigswith N50 of ~5000 bp

With 60% of reads mapped

Functional and taxonomic

annotations

AnnotationsGene calling

~1,000,000 genes

From sequence to gene: summary

JGI annotation platform

Annotation in MEGAN based DIAMOND similarity search

1,000,000

genes

Ncbi NR

DIAMOND

Similarity search

Detection of homologs

for 75 % of genesCondensation into

DAA binary format

Input daa file

SEED

KEGG

Taxonomy

Output filesTaxonPathTaxon IDetc

Output files

Output files

KEGGPathKEGGNameetc

SEEDPathSEEDNameetc

MEGAN annotation platform

Taxonomic annotations

Krona chart: dynamic representationMegan file- Taxonomy ID

assigned_Krona_All.html

Annotations of most genes on the same contigare consistent

SEED

KEGG

Functional annotations

Annotations statistic

%

genes Assigned assigned genes assigned genes

Taxa 759,353 570,702 0.75 75

Interpro2go 759,353 367,789 0.48 48

Eggnog 759,353 255,892 0.34 34

KEGG* 759,353 187,842 0.25 25

* from seed 2015 mapping file

Count data

The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each genes.

PCA & correlationsIsrael organic

Israel conventional

US conventional

compounds_contig_conventionalcompunds_contig_organic compunds_gene_conventional compunds_gene_organic

Cutin, suberine and wax biosynthesis 0 5 0 6

Biosynthesis of alkaloids derived from shikimate pathway 0 5 0 4

Drug metabolism - cytochrome P450 0 10 0 9

Glycerophospholipid metabolism 5 0 5 0

Tyrosine metabolism 2 6 2 6

Bisphenol degradation 0 4 0 4

Penicillin and cephalosporin biosynthesis 2 4 2 4

Chlorocyclohexane and chlorobenzene degradation 0 6 0 5

Steroid hormone biosynthesis 10 1 10 1

Inflammatory mediator regulation of TRP channels 3 1 3 0

Isoquinoline alkaloid biosynthesis 0 6 0 6

Arachidonic acid metabolism 17 0 17 0

Aminobenzoate degradation 0 7 0 7

Retinol metabolism 0 6 0 6

Flavonoid biosynthesis 8 0 8 0

Flavone and flavonol biosynthesis 7 1 6 1

Fluorobenzoate degradation 11 0 11 0

Anthocyanin biosynthesis 12 0 12 0

Betalain biosynthesis 8 0 8 0

Steroid biosynthesis 12 0 12 0

Polycyclic aromatic hydrocarbon degradation 0 21 0 21

Porphyrin and chlorophyll metabolism 14 0 14 0

Amino sugar and nucleotide sugar metabolism 0 9 0 9

Biosynthesis of plant secondary metabolites 4 2 4 1

Biosynthesis of type II polyketide products 5 0 5 0

Ubiquinone and other terpenoid-quinone biosynthesis 1 10 1 10

Linoleic acid metabolism 5 0 5 0

Biosynthesis of 12-, 14- and 16-membered macrolides 21 4 21 4

Glycine, serine and threonine metabolism 4 1 4 1

OrganicConventionalName

Differential abundance of enzymes in the KEGG metabolic pathway

Thank you