CBC Data Therapy - University of Connecticut

21
CBC Data Therapy Metagenomics Discussion

Transcript of CBC Data Therapy - University of Connecticut

CBCDataTherapyMetagenomicsDiscussion

GeneralWorkflow

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

Microbial sample

Generate “Meta-omic” data

Process data (QC,

etc.)Analysis

MarkerGenes

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

Extract DNA

Amplify with

targeted primers

Filter errors, build

clusters

Diversity analysis

Metagenomics

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

Extract DNA

Sequence random

fragments

QC, assemble, annotate

Diversity, function analysis

Metatranscriptomics

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

Extract RNA,

subtract rRNA

SequencecDNA QC

Geneexpression,function

Sequencing

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

Sanger Ion Torrent Roche 454

Illumina *Seq Pacific Biosciences Nanopore

Resources(16S)

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

SILVA: Quast et al. NAR (2015)

rrnDB: Stoddard et al. NAR (2016)

RDP II: Cole et al. NAR (2013)

Resources(Genomes)

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

PATRIC (host and bacterial)GenBank Genomes

GOLD (JGI metagenomes) Ensembl Genomes

Resources(Metagenomes)

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

EBI metagenomicsMG-RAST

HMP DACC

Resources(Function)

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

KEGG

UniProtKB

CARD

Gene Ontology

GeneralChallenges/Considerations

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• Sequencing errors• Error rates, error type (PacBio: 10% random, Illumina – 0.1% substitution)

• Chimeras• Amplification artifacts, cloning of restriction fragments

• 16S: different V regions give different results• Different sequencing platforms / sampling conditions ALSO give

different results• Workflow complexity / plethora of tools

GeneralChallenges/Considerations

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

•Strain-level diversity in metagenomes will often be missed by amplicon (esp. short-read) and shotgun approaches• This may be especially important between samples

•Taxonomy•Database predictions (RDP)

•Functional Annotation• Coverage versus accuracy

MarkerGenes

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• EukaryoticOrganisms(protists,fungi)• 18S(http://www.arb-silva.de)• ITS(http://www.mothur.org/wiki/UNITE_ITS_database)

• Bacteria• CPN60(http://www.cpndb.ca/cpnDB/home.php)• ITS(Martiny,Env Micro2009)• RecA gene

• Viruses• Gp23forT4-likebacteriophage• RdRp forpicornaviruses

Fasterevolvingmarkersusedforstrain-leveldifferentiation

MarkerGenes

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• Focusoncontaminationreductionduringpreparation• 16SrRNA contains9hypervariableregions(V1-V9)• V4waschosenbecauseofitssize(suitableforIllumina150bppaired-endsequencing)andphylogeneticresolution

• DifferentVregionshave differentphylogeneticresolutions– givingrisetoslightlydifferentcommunitycompositionresults

• Sequencing:• MiSeq capacityallowsmultiplesamplestobecombinedintoasinglerun• Numberofreadsneededtodifferentiatesamplesdependsonthenatureofthestudies• UniqueDNAbarcodescanbeincorporatedintoyourampliconstodifferentiatesamples

MarkerGenes

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• QIIME(http://qiime.org)

• Mothur (http://www.mothur.org)

Bioinformatics

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

OverallBioinformaticsWorkflow

Sequencedata(fastq)

Metadata aboutsamples(mapping

file)

1)Preprocessing:removeprimers,

demultiplex,qualityfilter,decontamination

2)OTUPicking /RepresentativeSequences

5)SequenceAlignment

3)TaxonomicAssignment

4)BuildOTUTable(BIOMfile)

6)PhylogeneticAnalysis

PhylogeneticTreeOTUTable

7)DownstreamanalysisandVisualization

– knowledgediscovery

ProcessedData

Inputs

Outputs

QIIMEversusMOTHUR

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

QIIME Mothur

Apythoninterfaceto gluetogethermanyprograms Singleprogramwithminimalexternaldependency

Wrappersforexistingprograms Reimplementationofpopular algorithms

Large numberofdependencies/VMavailable Easy toinstallandsetup;workbestonsinglemulti-coreserverwithlotsofmemory

Morescalable Lessscalable

Steeperlearningcurvebutmoreflexibleworkflowifyou canwriteyourownscripts

Easytolearnandworksthebestwithbuilt-intools

http://www.ncbi.nlm.nih.gov/pubmed/24060131 http://www.mothur.org/wiki/MiSeq_SOP

Metagenomics

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• Goal:Identifytherelativeabundanceofdifferentmicrobesinasamplegivenusingmetagenomics• Problems:

• Readsareallmixedtogether• Readscanbeshort(~100bp)• Lateralgenetransfer

• Twobroadapproaches1. BinningBased2. MarkerBased

Metagenomics

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• Attemptsto“bin”readsintothegenomefromwhichtheyoriginated• Composition-based

• UsesGCcompositionork-mers (e.g.NaïveBayesClassifier)• Generallynotverypreciseandnotrecommended

• Sequence-based• ComparereadstolargereferencedatabaseusingBLAST(orsomeothersimilaritysearchmethod)• Readsareassignedbasedon“Best-hit”or“LowestCommonAncestor”approach

LCA

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• UseallBLASThitsaboveathresholdandassigntaxonomyatthelowestlevelinthetreewhichcoversthesetaxa.

• NotableExamples:• MEGAN:http://ab.inf.uni-tuebingen.de/software/megan5/

• Oneofthefirstmetagenomic tools• Doesfunctionalprofilingtoo!

• MG-RAST:https://metagenomics.anl.gov/• Web-basedpipeline(mightneedtowaitawhileforresults)

• Kraken:https://ccb.jhu.edu/software/kraken/• Fastestbinningapproachtodateandveryaccurate.• Largecomputingrequirements(e.g.>128GBRAM)

Metagenomic Assembly

Institute for Systems Genomics: Computational Biology Core

bioinformatics.uconn.edu

• “MetaSPAdes showedtheoverallbestassemblysizestatisticswhilealsocapturingarelativelylargefractionoftheexpecteddiversity.Theusageofthistoolisrelativelysimpleandconvenient,beingbasicallyidenticaltothatofSPAdes,andlargelyflexibleregardingtheformatoftheinputdata.Adrawbackmaybethereducedsensitivityformicrodiversity.However,forthemajorityofmetagenomeresearchquestions,accurateandrepresentativeconsensusgenomesofspeciesshouldbemorethansufficient. ”