CBC Data Therapy - University of Connecticut
Transcript of CBC Data Therapy - University of Connecticut
GeneralWorkflow
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
Microbial sample
Generate “Meta-omic” data
Process data (QC,
etc.)Analysis
MarkerGenes
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
Extract DNA
Amplify with
targeted primers
Filter errors, build
clusters
Diversity analysis
Metagenomics
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
Extract DNA
Sequence random
fragments
QC, assemble, annotate
Diversity, function analysis
Metatranscriptomics
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
Extract RNA,
subtract rRNA
SequencecDNA QC
Geneexpression,function
Sequencing
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
Sanger Ion Torrent Roche 454
Illumina *Seq Pacific Biosciences Nanopore
Resources(16S)
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
SILVA: Quast et al. NAR (2015)
rrnDB: Stoddard et al. NAR (2016)
RDP II: Cole et al. NAR (2013)
Resources(Genomes)
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
PATRIC (host and bacterial)GenBank Genomes
GOLD (JGI metagenomes) Ensembl Genomes
Resources(Metagenomes)
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
EBI metagenomicsMG-RAST
HMP DACC
Resources(Function)
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
KEGG
UniProtKB
CARD
Gene Ontology
GeneralChallenges/Considerations
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• Sequencing errors• Error rates, error type (PacBio: 10% random, Illumina – 0.1% substitution)
• Chimeras• Amplification artifacts, cloning of restriction fragments
• 16S: different V regions give different results• Different sequencing platforms / sampling conditions ALSO give
different results• Workflow complexity / plethora of tools
GeneralChallenges/Considerations
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
•Strain-level diversity in metagenomes will often be missed by amplicon (esp. short-read) and shotgun approaches• This may be especially important between samples
•Taxonomy•Database predictions (RDP)
•Functional Annotation• Coverage versus accuracy
MarkerGenes
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• EukaryoticOrganisms(protists,fungi)• 18S(http://www.arb-silva.de)• ITS(http://www.mothur.org/wiki/UNITE_ITS_database)
• Bacteria• CPN60(http://www.cpndb.ca/cpnDB/home.php)• ITS(Martiny,Env Micro2009)• RecA gene
• Viruses• Gp23forT4-likebacteriophage• RdRp forpicornaviruses
Fasterevolvingmarkersusedforstrain-leveldifferentiation
MarkerGenes
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• Focusoncontaminationreductionduringpreparation• 16SrRNA contains9hypervariableregions(V1-V9)• V4waschosenbecauseofitssize(suitableforIllumina150bppaired-endsequencing)andphylogeneticresolution
• DifferentVregionshave differentphylogeneticresolutions– givingrisetoslightlydifferentcommunitycompositionresults
• Sequencing:• MiSeq capacityallowsmultiplesamplestobecombinedintoasinglerun• Numberofreadsneededtodifferentiatesamplesdependsonthenatureofthestudies• UniqueDNAbarcodescanbeincorporatedintoyourampliconstodifferentiatesamples
MarkerGenes
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• QIIME(http://qiime.org)
• Mothur (http://www.mothur.org)
Bioinformatics
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
OverallBioinformaticsWorkflow
Sequencedata(fastq)
Metadata aboutsamples(mapping
file)
1)Preprocessing:removeprimers,
demultiplex,qualityfilter,decontamination
2)OTUPicking /RepresentativeSequences
5)SequenceAlignment
3)TaxonomicAssignment
4)BuildOTUTable(BIOMfile)
6)PhylogeneticAnalysis
PhylogeneticTreeOTUTable
7)DownstreamanalysisandVisualization
– knowledgediscovery
ProcessedData
Inputs
Outputs
QIIMEversusMOTHUR
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
QIIME Mothur
Apythoninterfaceto gluetogethermanyprograms Singleprogramwithminimalexternaldependency
Wrappersforexistingprograms Reimplementationofpopular algorithms
Large numberofdependencies/VMavailable Easy toinstallandsetup;workbestonsinglemulti-coreserverwithlotsofmemory
Morescalable Lessscalable
Steeperlearningcurvebutmoreflexibleworkflowifyou canwriteyourownscripts
Easytolearnandworksthebestwithbuilt-intools
http://www.ncbi.nlm.nih.gov/pubmed/24060131 http://www.mothur.org/wiki/MiSeq_SOP
Metagenomics
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• Goal:Identifytherelativeabundanceofdifferentmicrobesinasamplegivenusingmetagenomics• Problems:
• Readsareallmixedtogether• Readscanbeshort(~100bp)• Lateralgenetransfer
• Twobroadapproaches1. BinningBased2. MarkerBased
Metagenomics
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• Attemptsto“bin”readsintothegenomefromwhichtheyoriginated• Composition-based
• UsesGCcompositionork-mers (e.g.NaïveBayesClassifier)• Generallynotverypreciseandnotrecommended
• Sequence-based• ComparereadstolargereferencedatabaseusingBLAST(orsomeothersimilaritysearchmethod)• Readsareassignedbasedon“Best-hit”or“LowestCommonAncestor”approach
LCA
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• UseallBLASThitsaboveathresholdandassigntaxonomyatthelowestlevelinthetreewhichcoversthesetaxa.
• NotableExamples:• MEGAN:http://ab.inf.uni-tuebingen.de/software/megan5/
• Oneofthefirstmetagenomic tools• Doesfunctionalprofilingtoo!
• MG-RAST:https://metagenomics.anl.gov/• Web-basedpipeline(mightneedtowaitawhileforresults)
• Kraken:https://ccb.jhu.edu/software/kraken/• Fastestbinningapproachtodateandveryaccurate.• Largecomputingrequirements(e.g.>128GBRAM)
Metagenomic Assembly
Institute for Systems Genomics: Computational Biology Core
bioinformatics.uconn.edu
• “MetaSPAdes showedtheoverallbestassemblysizestatisticswhilealsocapturingarelativelylargefractionoftheexpecteddiversity.Theusageofthistoolisrelativelysimpleandconvenient,beingbasicallyidenticaltothatofSPAdes,andlargelyflexibleregardingtheformatoftheinputdata.Adrawbackmaybethereducedsensitivityformicrodiversity.However,forthemajorityofmetagenomeresearchquestions,accurateandrepresentativeconsensusgenomesofspeciesshouldbemorethansufficient. ”