BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA … Work… · bioinformatics approaches for...
Transcript of BIOINFORMATICS APPROACHES FOR METAGENOMICS DATA … Work… · bioinformatics approaches for...
B I O I NFOR MAT I C S A P P ROACH ES F O R M E TAG ENOM I C S D ATA A N A LYS I S
A D I D O R O N - FA I G E N B O I M
P L A N T S C I E N C E S , V E G E TA B L E A N D F I E L D C R O P S A R O , T H E V O L C A N I C E N T E R , I S R A E L R I S H O N L E Z I O N 7 5 2 8 8 0 9
Metagenomics
o“Metagenomics is the study of the collective genomes of all microorganisms from an environmental sample”o Community
o Environmental
o Ecological
DNA sequencing & microbial profilingTraditional microbiology relies on isolation and culture of bacteria
o Cumbersome and labour intensive process
o Fails to account for the diversity of microbial life
o Great plate-count anomaly
Staley, J. T., and A. Konopka. 1985. Measurements of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39:321-346
Why environmental sequencing?Estimated 1000 trillion tons of bacterial/archeal life on Earth
o Only a small proportion of organisms have been grown in culture
o Species do not live in isolation
o Clonal cultures fail to represent the natural environment of a given organism
o Many proteins and protein functions remain undiscovered
Why environmental sequencing?
Human microbiomeRhizobiome Pollutant
sitesNon-human microbiomes
The revolution in sequencing technologiesHigh throughput technologies promote the accumulation of enormous volumes of genomic and metagenomics data.
Next-Generation Sequencing: A Review of Technologies and Tools for Wound Microbiome Research Brendan P. Hodkinson and Elizabeth A. Grice*. Adv Wound Care (New Rochelle). 2015
HiSeqMiSeq
Experimental ApproachesCommunity composition
◦ Microbiome (16S rRNA gene, 18S, ITS, etc.)
Community composition and functional potential◦ Metagenomics
Functional genetic response◦ Metatranscriptomics
16s Vs. Shotgun Metagenomico16s – targeted sequencing of a single gene
◦ Marker for identification
◦ Well established
◦ Cheap
◦ Amplified what you want
oShotgun sequencing – sequence all the DNA◦ No primer bias
◦ Can identify all microbes
◦ Function information
16S rRNA sequencing
Erlandsen S L et al. J Histochem Cytochem2005;53:917-927
• 16S rRNA forms part of bacterial ribosomes.
• Contains regions of highly conserved and highly variable sequence.
• Variable sequence can be thought of as a molecular “fingerprint” can be used to identify bacterial genera and species.
• Large public databases available for comparison.–Ribosomal Database Project (RDP) currently contains >1.5 million rRNA sequences.
• Conserved regions can be targeted to amplify broad range of bacteria from environmental samples.
• Not quantitative due to copy number variation
16S rRNA gene sequencingo Pros
◦ Well established
◦ Sequencing costs are relatively cheap (~50,000 reads/sample)
◦ Only amplifies what you want (no host contamination)
oCons◦ Primer choice can bias results towards certain organisms
◦ Usually not enough resolution to identify to the strain level
◦ Need different primers usually for archaea & eukaryotes (18S)
◦ Cannot identify viruses
◦ No direct functional profiling
Binning sequences to UTSoOperational Taxonomic Unit (OTU) An arbitrary definition of a taxonomic unit based on
sequence divergence
oComposition-based binning− GC content
− Di/Tri/Tetra/... nucleotide composition (kmer-based frequency comparison)
− Codon usage statistics
oSimilarity-based binning− Direct comparison of OTU sequence to a reference database
− Identity cut-off varies depending on resolution required Genus - 90% , Family - 80% , Species - 97%
MEGAN Blast against NCBI database
Clustering of OTUs based on sequence similarity
Sample 2 Sample 1
OTU present 50:50 in both samples
Software for binningo Composition-based binning
o TETRA - Maximal-Order Markov Modelo PhyloPythia – Support Vectoro Seeded Growing Self-Organising Maps (S-GSOM)o TETRA + Codon based usage
o Similarity-based binningoRequires that most sequences in a sample are present in a primary or secondary reference
databaseoQIIME oMEGAN (comparison against Blast NCBI NR)oMothur (RDP)oCARMA (comparison against PFAM)oARB (linked with Silva database)
Sequences Databases
Measuring diversity of OTUsTwo primary measures for sequence based studies:
• Alpha diversity
−What is there? How much is there?
−Diversity within a sample
• Beta diversity
−How similar are two samples?
−Diversity between samples
Alpha diversity – human microbiome
C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234
Alpha diversityoSpecies count in the sampleowhat is a species ?
o OUTs
omissing level of evolutionary diversity
oPhylogenetic diversity (PD)o sum of the branch length covered by a sample
omissing the distribution of the species
Alpha diversityoSimpson’s diversity index (also Shannon, Chao indexes)o gives less weight to rarest species
S is the number of speciesN is the total number of organismsni is the number of organisms of species i
Whittaker, R.H. (1972). "Evolution and measurement of species diversity". Taxon(International Association for Plant Taxonomy (IAPT)) 21 (2/3): 213–251
Beta diversity – human microbiome
C Huttenhower et al. Nature 486, 207-214 (2012) doi:10.1038/nature11234
Beta diversityoDiversity between samples
oUnifrac distance
oPhytogenic-based beta diversity
oPercentage observed branch length unique to either sample
Lozupone and Knight, 2005. Unifrac: A new phylogenetic method for comparing microbial communitieis. Appl Environ Microbiol 71:8228
Other useful data representationsSimple bar charts - what species are present?
Other useful data representationsRarefaction curves - How much of a community have we sampled?
Nu
mb
er
of
OT
Us
Number of sequences
Adapted from Wooley et al. A Primer on Metagenomics, PLoS Computational Biology, Feb 2010, Vol 6(2)
Shotgun whole metagenomeoUnlike 16S, metagenomic sequencing is no targeted to
a specific gene, but does an unbiased sample of the entire genomic DNA.
oTypically shorter sequence reads are usedto obtain >5Gb of data per sample.
oHiSeq or NextSeq platform are typically more costeffective for metagenomic sequencing
Shotgun metagenomicsPros
◦ No primer bias
◦ Can identify all microbes (e.g. eukaryotes, viruses)
◦ Direct functional profiling
• Cons◦ More expensive (millions of sequences needed)
◦ Host/site contamination can be significant
◦ May not be able to sequence “rare” microbes
◦ Required computational resources can be restrictive
◦ More complex bioinformatic analyses required◦ Chimera, unknown function
Sequence coverageComplexity
Diversity & Coverage
Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014Luis M Rodriguez-R and Konstantinos T Konstantinidis
Metagenomics' assembly
Metagenomics' assembly
Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362
Metagenomics' assembly
o Greedy assembler:o reads with maximum overlaps are iteratively merged into contigs
o Overlap-Layout-Consensus : o graph is constructed by finding overlaps between all pairs of reads
o Bruijn graph: o reads are chopped into short overlapping segments (k-mers) o K-mers are organized in a de Bruijn graph based on their co-occurrence across reads. o The graph is simplified to remove artifacts due to sequencing errors, o branch-less paths are reported as contigs.
de Bruijn graph approacho Low abundance genomes may end up fragmented if overall sequencing depth is insufficient to form connections in the grapho Using a short k-mer size
oThe assembler must strike a balance between recovering low abundance genomes and obtaining long, accurate contigs for high abundance genomes
oComputational time and memory may be insufficient to complete such assemblies.
oMultiple k-mer approach
oSpread memory load over cluster of computer
Metagenome assembly tools
Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters!John Vollmers, Sandra Wiegand, Anne-Kristin Kaster
What we do with the assemblyoCharacterizing the contigs/scaffolds oMapping statistics
o Compositions (%GC, codon usage)
o Annotation - taxonomy & function assignments
oBinning
oComparative genomics
oMetabolic pathways
Binning over read mappingoPartition the metagenome to specieso Read coverage (multiple samples)
o compositions
Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep; 89(3): 353–362
GC%sample3
sample2
sample1
3460727scaffold1
3361629scaffold2
5120215scaffold3
5022207scaffold4
Binning over read mappingGC%sample
3sample
2sample
1
3460727scaffold1
3361629scaffold2
5120215scaffold3
5022207scaffold4
0
10
20
30
40
50
60
70
GCsample3sample2sample1
scaffold1
scaffold2
scaffold3
scaffold4
Binning contigsoCompletely automated approacho CONCOCT
o GroopM
oMetaBAT
oCompleteness of metagenome assembled genomes (MAGs)o single-copy core genes (tRNA synthetases , ribosomal proteins)
Genes annotationsoFinds bacterial genes in the contigs/scaffolds
◦ Prodigal◦ Prokka
oAnnotation of the genes◦ By homology searches (DIAMOND)◦ Domains finding
o Comparisons◦ Gene family◦ Distribution among the samples (CD-HIT)
Functional potential - The annotations suggest the functional potential of the community
No sure about the biology activity (may not be transcribed an translates)
Common functional databasesoNCBI
oCOGo Well known but original classification (not updated since 2003)
o PFAMo Focused more on protein domains based on hidden Markov models
oKEGGo Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways”o Full access now requires a license fee
o MetaCyco Similar to KEGG, but more microbe focused
o UniRefo Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50)o Most comprehensive and is constantly updatedo These gene families are typically less functionally informative
Metagenomic annotation systemWeb-based
◦ EBI
◦ MG-RAST
GUI-based◦ MEGAN
Local-based◦ Kraken
◦ MetAMOS
Post-processing analysisoData matrices of samples versus microbial featureso species
o genes
o Pathways
oUnsupervised methodso Clustering and correlations
o PCA
oStatistically different between sample typeso taxa or functional genes
A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence DataFront. Genet., 06 March 2017
Case study: the microbiome of fruit peel
Maria Vetcos Edoardo Piombo Shlomit Medina
Shiri Freilich
Samir Droby Michael Wisniewski
Case study: the microbiome of fruit peel
Read length: 150Total of 472 million quality reads
Sequencing output: files in FASTQ format
Assembly: MEGAHIT Format: FASTQTotal of 472 million quality reads Total of 71 Gbp
Format: FASTATotal number of contigs/contigs > 2k: 4,000,000/200,000Average contig length: 820/4,600 bpN50: 980/5000 bpTotal #bp: 3Gbp/1Gbp
Sample #raw reads #clean reads %clean reads #PE%mapping vs.
Filtered set
A1 26,692,151 22,638,404 84.81296243 45,276,808 75.59
A2 32,550,741 27,819,952 85.46641688 55,639,904 69.84
A3 24,083,541 20,677,583 85.85773579 41,355,166 82.77
C1W 29,722,008 25,416,861 85.51528887 50,833,722 78.32
C2W 24,125,961 20,451,024 84.76770728 40,902,048 76.01
C3W 24,956,733 21,353,952 85.56389172 42,707,904 87.48
M1 26,211,005 21,974,866 83.83831906 43,949,732 66.52
M2 5,640,819 4,765,939 84.49019548 9,531,878 62.97
M3 6,113,051 5,137,683 84.04449758 10,275,366 57.24
O1S 23,760,866 19,848,045 83.53249835 39,696,090 57.85
O2S 28,317,777 23,141,736 81.72158429 46,283,472 57.22
O3S 28,604,975 22,679,029 79.28351275 45,358,058 64.43
Total 280,779,628 235,905,074 84.02 471,810,148
Full contig set Contig > 2KTotal number of
sequences3,762,133 206,575
Total number of bps
3,085,995,440 945,480,334
Average sequence length
820.27 4,576.93
N50 979 4,926
Format: FASTATotal number of contigs > 2k pb: 200,000
Gene calling: Prodigal
Format: FASTATotal number of genes: 1,000,000
Genome/geneassembly
(pooled data)
Raw Genomic
Data
4 treatments X 3 repeats = 12 libraries
~45 million reads per libraryTotal of ~472 million quality
reads
~200,000 contigswith N50 of ~5000 bp
With 60% of reads mapped
Functional and taxonomic
annotations
AnnotationsGene calling
~1,000,000 genes
From sequence to gene: summary
JGI annotation platform
Annotation in MEGAN based DIAMOND similarity search
1,000,000
genes
Ncbi NR
DIAMOND
Similarity search
Detection of homologs
for 75 % of genesCondensation into
DAA binary format
Input daa file
SEED
KEGG
Taxonomy
Output filesTaxonPathTaxon IDetc
Output files
Output files
KEGGPathKEGGNameetc
SEEDPathSEEDNameetc
MEGAN annotation platform
Taxonomic annotations
Krona chart: dynamic representationMegan file- Taxonomy ID
assigned_Krona_All.html
Annotations of most genes on the same contigare consistent
SEED
KEGG
Functional annotations
Annotations statistic
%
genes Assigned assigned genes assigned genes
Taxa 759,353 570,702 0.75 75
Interpro2go 759,353 367,789 0.48 48
Eggnog 759,353 255,892 0.34 34
KEGG* 759,353 187,842 0.25 25
* from seed 2015 mapping file
Count data
The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each genes.
PCA & correlationsIsrael organic
Israel conventional
US conventional
compounds_contig_conventionalcompunds_contig_organic compunds_gene_conventional compunds_gene_organic
Cutin, suberine and wax biosynthesis 0 5 0 6
Biosynthesis of alkaloids derived from shikimate pathway 0 5 0 4
Drug metabolism - cytochrome P450 0 10 0 9
Glycerophospholipid metabolism 5 0 5 0
Tyrosine metabolism 2 6 2 6
Bisphenol degradation 0 4 0 4
Penicillin and cephalosporin biosynthesis 2 4 2 4
Chlorocyclohexane and chlorobenzene degradation 0 6 0 5
Steroid hormone biosynthesis 10 1 10 1
Inflammatory mediator regulation of TRP channels 3 1 3 0
Isoquinoline alkaloid biosynthesis 0 6 0 6
Arachidonic acid metabolism 17 0 17 0
Aminobenzoate degradation 0 7 0 7
Retinol metabolism 0 6 0 6
Flavonoid biosynthesis 8 0 8 0
Flavone and flavonol biosynthesis 7 1 6 1
Fluorobenzoate degradation 11 0 11 0
Anthocyanin biosynthesis 12 0 12 0
Betalain biosynthesis 8 0 8 0
Steroid biosynthesis 12 0 12 0
Polycyclic aromatic hydrocarbon degradation 0 21 0 21
Porphyrin and chlorophyll metabolism 14 0 14 0
Amino sugar and nucleotide sugar metabolism 0 9 0 9
Biosynthesis of plant secondary metabolites 4 2 4 1
Biosynthesis of type II polyketide products 5 0 5 0
Ubiquinone and other terpenoid-quinone biosynthesis 1 10 1 10
Linoleic acid metabolism 5 0 5 0
Biosynthesis of 12-, 14- and 16-membered macrolides 21 4 21 4
Glycine, serine and threonine metabolism 4 1 4 1
OrganicConventionalName
Differential abundance of enzymes in the KEGG metabolic pathway
Thank you