Post on 25-Feb-2016
description
Overview of Vibrio vulnificus and V. navarrensis
Computational assembly for prokaryotic sequencing projectsLee Katz, Ph.D.Bioinformatician, Enteric Diseases Laboratory BranchJanuary 15, 2014
DisclaimersThe findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy.The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC
Partners in Public Health
Graduated Oct 2010CDC 2010 - present
Lee Katz, PresentCurrently in the National Enteric Reference LaboratoryVibrio, Campylobacter, Escherichia, Shigella, Yersinia, SalmonellaFocusing on Listeria and Vibrio
One of my projects is #2 on CDCs list of accomplishments for 2013!
#2http://www.cdc.gov/features/endofyear/OutlineSequencing1st gen2nd gen3rd genReadsQuality control (Q/C)Read metricsRead-cleaningAssemblyAlgorithmsAssembly metrics
8Prokaryotic Sequencing ProjectsStagesSequencingAssemblyFeature predictionFunctional annotationanalysisDisplay (Genome Browser)
ExamplesHaemophilus influenzaeNeisseria meningitidisBordetella bronchiscepticaVibrio choleraeListeria monocytogenes
Fleischman et al. (1995) Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd Science 269:5223Kislyuk et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects Bioinformatics 26:159Out with the old; in with the new:Two new technologies to the compgenomics class! 454Illumina single end readsIllumina paired end readsPacBioSanger Sequencing (1st gen)
Sequencing: first generationMargulies et al. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature 437:705712Sanger sequencing outputUsually .ab1/.scf file format
454 Sequencing (2nd Gen)
15454 PyrosequencingMix DNA library & capture beads(limited dilution)
Break micro-reactorsIsolate DNA containing beads
Create Water-in-oil emulsion + PCR Reagents + Emulsion Oil Perform emulsion PCR
A
B454 PyrosequencingLoad enzyme beads
44 m
Load beads into PicoTiterPlate PicoTiterPlateDiameter = 44 mDepth = 55 m Well size = 75 plWell density = 480 wells mm-21.6 million wells per slide454 Pyrosequencing
Reagent flowSequencing by synthesisPhotons generated are captured by CCD camera
Margulies et al., 2005454 sequencing outputFlow Order
TACG1-mer2-mer3-mer4-merKEY (TCAG)Measures the presence or absence of each nucleotide at any given positionFlowgram (.sff file format)19Illumina sequencing (2nd Gen)
Region complementary to P5 grafting primer Index 2P5 primerDNA insertP7 primerIndex 1P7 grafting primerP5 grafting primerFlow cell surfaceThe following animations are courtesy of Illumina, Inc.21SBS Sequencing Primer HybridizationThe following animations are courtesy of Illumina, Inc.22Sequence (Cycle 1)The following animations are courtesy of Illumina, Inc.23Sequence (Cycle 1)24Index 1 Seq Primer Hybridization25Index 1 read 8 cycles26Unblock27P5 grafting primer287 dark cyclesP5 grafting primer297 dark cyclesIndex 2 index read8 cyclesP5 grafting primer307 dark cyclesIndex 2 index read8 cyclesP5 grafting primerExtension31Original strandNew strandLinearizationLinearization32
Illumina sequencing videohttp://www.youtube.com/watch?v=womKfikWlxMPacBio sequencing* (3rd Gen)*Pacific Biosciences
http://www.youtube.com/watch?v=NHCJ8PtYCFcEid et al Science,January 2009/10.1126/science.1162986
Thanks to PacBio for donating some slide materials in this sectionSMRT BellZero-mode waveguide (ZMW), a very fancy and very small wellhttp://www.youtube.com/watch?v=NHCJ8PtYCFcEid et al Science,January 2009/10.1126/science.1162986
Eid et al Science,January 2009/10.1126/science.1162986
PacBio videohttp://www.youtube.com/watch?v=NHCJ8PtYCFcReadsQ/C + cleaning + metricsQ/CYou need to know if your data are good!Example softwareFastQCComputational Genomics Pipeline (CG-Pipeline)
Quality Control
FastQC outputQuality Control bioinformatics
FastQC outputThe CG-Pipeline wayrun_assembly_readMetrics.plFile avgReadLength totalBases minReadLength maxReadLength avgQuality tmp.fastq 80.00 177777760 80 80 35.39Read cleaningRead cleaning with CG-Pipeline(not validated; please use with caution)http://sourceforge.net/projects/cg-pipeline/F. ReadR. ReadRead
%ACGT
PhredGraphs made with FastqQC (AMOS)1. Trimming low-qual endsrun_assembly_trimLowQualEnds.plhttp://sourceforge.net/projects/cg-pipeline/F. ReadR. ReadRead
1A. %ACGT
1B. PhredGraphs made with FastqQC (AMOS)2a. Removing duplicate reads2b. Sometimes: downsamplingrun_assembly_removeDuplicateReads.plhttp://sourceforge.net/projects/cg-pipeline/Trimmed reads3. Trimming and filteringrun_assembly_trimClean.pl3A. trimming3B. filteringMin lengthMin avg. qualityMin lengthMin avg. qualityhttp://sourceforge.net/projects/cg-pipeline/MoreSoftwareFastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/EA-utils https://code.google.com/p/ea-utils/AMOS amos: SourceForge.net and more is out there!EvaluationFabbro et al 2013, An extensive evaluation of read trimming effects on Illumina NGS data analysis
AssemblYAlgorithms + metrics
Whole genome sequencing: WGSLarge pieces and de novo assembly52
Business dog http://www.buzzfeed.com/tiad/business-dogWhole genome sequencing: WGSSmall pieces and reference assembly53
Business cat http://www.quickmeme.com/Business-Cat/ NNN NAssemblyOverlaps between reads
Generate contigs (contiguous sequences)
Generate scaffolds54Derive consensus sequenceSlide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14-compgenomics-kislyuk.pdf
TAGATTACACAGATTACTGA-TTGATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG-TTACACAGATTATTGACTTCATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGCGTAA-CTATAGATTACACAGATTACTGACTTGATGGGGTAA-CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA
Derive each consensus base by weighted voting55NNNNNNNNNNRecap of assemblyScaffoldcontigsPaired end readsreads56CG-Pipeline way for Illuminarun_assembly reads.fastq.gz o assembly.fastaNo module yet in CGP for PacBio unfortunatelyBe on the look out for several papers that compare Illumina assemblers. PacBio AssemblyThe following slides are courtesy of PacBioFinishing Genomes Using Only PacBio Reads Utilizes all PacBio data from single, long-insert libraryLongest reads for continuity All reads for high consensus accuracyNow available through SMRT Portal in SMRT Analysis v2.0.1Hierarchical Genome Assembly Process (HGAP)
Chin et al (2013), Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data Nature Methods. doi 10.1038/nmeth.2474Hierarchical Genome Assembly Process significantly advances our understanding of microbial genomes using only PacBio reads. High-quality and high-accuracy microbial genomes can be obtained from genomic DNA to final assembly in a few days. 59Hierarchical Genome Assembly Process (HGAP)
Start with long seed readsAlign other reads
Build consensusConstruct accurate (>99%)pre-assembled readsHGAP Example - Meiothermus ruberpre-assemblyCelera AssemblerPolish, Quiver
250 Mb>5 kb
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)In a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells.
First a single, large-insert library (10 kb) was generated. From that, 3 SMRT Cells were run at the time this was done using C2-C2 chemistry and on a PacBio RS instrument. 250 Mb of data was generated with a read length profile shown on the right
61HGAP Example - Meiothermus ruberpre-assemblyCelera AssemblerPolish, Quiver
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)In a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells.
The PacBio reads >5 kb were selected as the seed reads. All the other reads were aligned to these long reads in a pre-assembly step.62
HGAP Example - Meiothermus ruberPre-assemblyCelera AssemblerPolish, Quiver
Collaboration with A. Clum, A. Copeland (Joint Genome Institute)In a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells.
Following pre-assembly to the 5 kb seed reads, the alignment identity of the >5 kb reads improved to close to ~99%. These pre-assembled long reads will be the input into assembly algorithms.
63HGAP Example - Meiothermus ruberPre-assembly
1 contigCelera AssemblerMinimus2QuiverCollaboration with A. Clum, A. Copeland (Joint Genome Institute)Single-contig assembly99.99965% concordance with reference99.3% genes predictedIn a collaboration with the Joint Genome Institute, we demonstrated that the HGAP assembly method could de novo assemble the M. Ruber genome in three SMRT Cells. From a single, large-insert library, a single contig assembly was generated with >99.999% concordance with JGIs reference.
64Polish with Quiver for High AccuracyOrganismAssembly size (bases)Differences with Sanger referenceConcordance with Sanger referenceNominal QVSNPs validated as correct PacBio callsRemaining differencesQVMeiothermus ruber3,098,7811199.99965%54.581(3)60M. ruber Sanger referencePacBio reads
Targeted Sanger validationTo characterize the remaining 11 differences that remained between the HGAP assembly and the original Sanger reference, targeted Sanger validation was done on the sequenced M. ruber sample. Of the 9 clones that could be amplified, eight were validated as being correct in the PacBio consensus sequence, and one was different. The remaining three could not be validated. The final QV for the PacBio consensus sequences was at least 60.65Estimated Coverage Targets for Finishing Smaller GenomesAssembly Approach /Software ToolRecommended PacBio CoverageAdditional Data SetsGenome Size ConstraintsHierarchicalSMRT Analysis implementation of HGAP (uses Celera Assembler 7.0)75-100X PacBio CLRNone< 10 MB (SMRT Portal)< 130 MB (Command Line)Celera Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752 75-100X PacBio CLRNoneSimilar to aboveHybridCelera Assembler 7.0 with PacBiotoCA (SMRT Analysis) 20-50X PacBio CLR50X short readsALLPATHS-LG50X PacBio 3 kb CLR- 50X Illumina PE- 50X Illumina jumping libraries 20 MBMIRA (with PacBiotoCA)20-50X PacBio CLR50X short readsScaffolding AHA (SMRT Analysis)10X PacBio CLRHigh-confidence contigs