A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome...

download A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome assembly of Atlantic cod and Atlantic salmon

If you can't read please download the document

description

A talk I gave for the 4th yearly seminar of the Norwegian Sequencinc Centre (www.sequencing.uio.no)

Transcript of A different kettle of fish entirely: bioinformatic challenges and solutions for whole de novo genome...

  • 1.A different kettle of fish entirelyBioinformatic challenges and solutions for whole de novogenome assembly of Atlantic cod and Atlantic salmon Lex Nederbragt, NSC and CEES [email protected] @lexnederbragt OK

2. Developments inHigh Throughput Sequencing 3. Developments in High Throughput Sequencing ABI 3730xl1000Roche/454 GSSeries3Hiseq Illumina HiSeq 100Life Tech SOLiDMiSeq SOLiDProtonIonTorrent PGM10PacBio RSGS Junior Gigabses per run (log scale)MiSeq 1 GS FLXIon ProtonPGM0.1 GA II GS Junior 0.01PacBio RS0.001 0.0001 Sanger0.0000110100100010000Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940 4. Developments in High Throughput Sequencing ABI 3730xl1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeqSOLiDProton10Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1GS FLX Ion ProtonSanger like PGM0.1GA IIGS Junior 0.01 PacBio RS Intermediate0.001 Short 0.0001Sanger0.0000110 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940 5. Developments in High Throughput Sequencing ABI 3730xl1000 Roche/454 GS Series3 Hiseq Illumina HiSeq 100 Life Tech SOLiD MiSeqSOLiDProton10Long IonTorrent PGM PacBio RS GS Junior Gigabses per run (log scale) MiSeq 1GS FLX Ion ProtonSanger like PGM0.1GA IIGS Junior 0.01 PacBio RS Intermediate0.001 Short 0.0001Sanger0.0000110 100 1000 10000 Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940 6. What is this thing called genome assembly? 7. Hierarchical structurereads contigs scaffolds 8. Sequence data Readsreadscontigsscaffoldsoriginal DNA fragmentsoriginal DNA fragmentsSequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml 9. Reads! reads contigs scaffoldshttp://www.sciencephoto.com/media/210915/enlarge 10. ContigsBuilding contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGCAligned readsGGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTAACCACGCGTAGCGCATTACACACGCGTAGCGCATTACACACGCGTAGCGCATTACACAGACGTAGCGCATTACACAGATTTAGCGCATTACACAGATTAGConsensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG 11. ContigsBuilding contigs readscontigsscaffolds Repeat copy 1Repeat copy 2Contig orienation?Contig order?Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml 12. Mate pairsOther read type reads contigs scaffolds Repeat copy 1 Repeat copy 2(much) longer fragments mate pair reads 13. Mate pairs Paired end reads 100-500 bp insertoriginal DNA fragments Sequenced ends Mate pairs 2-20 kb insert Repeat copy 1Repeat copy 2mate pair reads 14. Scaffolds Ordered, oriented contigsreadscontigsscaffoldsmate pairs contigsgap size estimate Scaffold gap contighttp://dx.doi.org/10.6084/m9.figshare.100940 15. Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGCAligned readsGGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTAcontigs ACCACGCGTAGCGCATTACACACGCGTAGCGCATTACACACGCGTAGCGCATTACACAGACGTAGCGCATTACACAGATAGCGCATTACACAGAConsensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGAScaffold contigscaffoldsgap 16. Why is genome assembly sucha difficult problem? 17. 1) Repeats Repeat copy 1Repeat copy 2 Repeats break up assemblyCollapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml 18. 2) Diploidy Differencesbetween sister* chromosomesheterozygosity**http://commons.wikimedia.org/wiki/File:Chromosome_1.svg 19. 2) Diploidy Polymorphic region 2Region 1Region 4 Polymorphic region 3Homozygous Heterozygous Homozygous 20. 2) Diploidyhttp://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpgand many other sites 21. 3) Polyploidyhttp://en.wikipedia.org/wiki/Polyploidy 22. 4) Many programs to choose from Zhang et al. PLoSOne 2011 23. The Atlantic salmon and Atlantic cod genome projects http://kettleoffish.net/ 24. Salmon: the playersThe%female%named% Sally% with%Sallydouble[haploid%genome% of% es>mated% length% Gbp.%3% 12% 25. Salmon: the genome Pseudotetraploid3 billion bases (Gbp )Double haploid The%female% named%Sally%with% double[haploid%genome% of%es>mated% length% Gbp.% 3%12% Repeat copy 1Repeat copy 2 30-35%: repetitive DNA DNA transposons ~ 1500 bp: 6-10% ** Davidson et al., 2010 http://genomebiology.com/2010/11/9/403 26. Salmon: phase 1Sanger sequencing Illumina sequencing Phase 1 assembly 555 960 sequences 2.4 Gbp of 3 Gbp Half of that in pieces of 9 300 bp or longer Scaffold gapcontighttp://www.flickr.com/photos/jurvetson/57080968/ 27. Salmon: phase 2Illumina sequencingPaired endMate Pair 3kb and longerPhase 2 stated goalScaffolds greater than 1 MbpHalf the genome in contigs of at least 50 000 bphe%female% named%Sally%with% double[haploid%genome% of%es>mated% length% Gbp.% 3% 12%Scaffoldgapcontig 28. Cod: the playersUnnamed Atlantic cod 29. Cod: the genome Heterozygote850 million bases (Mbp )* Wild-caught** 30. Cod: phase 1 454 sequencing(Sanger sequencing)Phase 1 assembly157 887 sequences753 Mbp of 830 MbpHalf in scaffolds of at least 460 000 bpHalf in contigs at least 2 800 bpScaffoldgap contig 31. Cod: phase 1 32. Cod: phase 2Phase 2Illumina sequencingPaired end>200xMate Pair 5kb >100x Phase 2 goal Half in scaffolds of at least 1 Mbp Half in contigs at least 10 15 000 bp 33. Atlantic salmon and Atlantic cod PseudotetraploidHeterozygosity** readscontigs ?scaffolds*Repeat copy 1Repeat copy 2Long repeats 34. What we need? Long reads! 35. Longer reads!Repeat copy 1 Repeat copy 2Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1Contig 4 Polymorphic contig 3 36. Developments in High Throughput Sequencing ABI 3730xl1000Roche/454 GSSeries3Hiseq Illumina HiSeq 100Life Tech SOLiDMiSeq SOLiDProtonIonTorrent PGM10PacBio RSGS Junior Gigabses per run (log scale)MiSeq 1 GS FLXIon ProtonPGM0.1 GA II GS Junior 0.01PacBio RS0.001 0.0001 Sanger0.0000110100100010000Read length (log scale)http://dx.doi.org/10.6084/m9.figshare.100940 37. PacBio sequencingSingle-moleculeC2 (current) chemistry:Average read length 3100 bp36 000 reads110 Mbp per run 38. PacBio sequencingSMRTBelltemplate Sequencing modesStandardSequencingGenerates& pass& each& one&on& molecule& Large Insert& SizesLarge& Sizes&Insert Single passsequenced& SubreadsCircularConsensusSequencingSmall Insert Sizes& Small&Insert& SizesMultiple mul8ple& passes passes& each&Generates&on& molecule&sequenced& 39. PacBio: usesSMRTBelltemplate Long reads low qualityStandardSequencing Generates& pass& each&one&on& molecule& Large Insert& SizesLarge& Sizes&InsertSingle pass sequenced& 85-87% accuracyCircularConsensusSequencing Useful for assembly?Small& Insert& Sizes& Generates&mul8ple& passes& each& on& molecule& sequenced& 40. Solutions for assembly 41. Pacbio for salmon and codSMRTBelltemplateLibrariesStandardSequencing Generates& pass& eaone&on& Large Insert& SizesLarge& Sizes&Insert sequenced& Aim for looooong insert sizesCircularConsensusSequencingSmall& Insert& Sizes& Generates&mul8ple& passes sequenced& 42. chnologySalmon: PacBio reads Data set 11.1x coverageHalf of all bases in reads at least 5.5 kbpLongest 26.5 kbpSMRTBelltemplate104 SMRT Cells Data set 2 Latest chemistry and enzyme (C2-XL)0.7x coverage By PacBio Menlo Park3Half of all bases in reads at least 6 kbpLongest 25 kbpStandardSequencingGenerates& pass& each& one&on& molecule&Large Insert& Sizes Large& Sizes& Insert sequenced& CircularConsensusSequencing Small&Insert&Sizes& 43. Salmon: PacBio readsAlignments of at least 1kb to released assemblyAlignmentsbinnedby%idenVtyPortion of the alignments Bin for read accuracy reported in the alignmentCumulaVveAlignmentQuanVty Figure courtesy of Jason Miller, JCVI, USA 44. Salmon: PacBio reads Repeat copy 1 Repeat copy 2SMRTBelltemplate Salmon repeatdatabaseMappingStandardSequencingGenerates& pass& each& one&on& molecule& Large&Insert&Sizes&sequenced& Mapping CircularConsensusSequencingScaffold gap Small&Insert&Sizes&contig Generates&mul8ple& passes& each& on& molecule& sequenced& 45. Salmon: repeats1.6 kb repeats mapped to PacBio reads left flank repeat right flank0 500010000Scale (bp)1500020000 25000 46. Salmon: repeats3-7 kb repeats mapped to PacBio readsleft flankrepeatright flank05000 10000Scale (bp) 15000 20000 25000 47. Salmon: error-correctionPacBioToCA Jason Miller, JCVI: Low fraction of reads recoveredImproves contig lengths by enabling new joinsChallenge for error-correction:polymorphic repeat copies Repeat copy 1 Repeat copy 2 48. Salmon: prospect PacBio reads span even the longest repeats3-7 kb repeats mapped to PacBio readsleft flankrepeat right flankRepeat copy 1Repeat copy 2 49. chnologyCod: PacBio reads8.1x coverageHalf of all bases in reads at least 4 kbpLongest 16.5 kbp SMRTBelltemplate 104 SMRT Cells Regular C2 chemistry Univ. of Oslo, Norway 3 StandardSequencing Generates& pass& each&one&on& molecule&Large Insert& Sizes Large& Sizes& Insertsequenced& CircularConsensusSequencing Small&Insert&Sizes& 50. SMRTBelltemplate Cod: PacBio readsStandardSequencingGenerates& pass& each& one&on& molecule& Large&Insert&Sizes&sequenced&Mapping CircularConsensusSequencingScaffoldgap Small&Insert&Sizes& contigGenerates& mul8ple&passes& each&on& molecule&sequenced& 51. Cod: PacBio resultsMapping to the published genome11.4 kbp subread 10.6 kbp subread 10.9 kbp subread 52. Cod: example 1Assembly ...ACACACTGTGTG... 232 bp gapTGTGTG... 53. Cod: example 1 ACACAC repeat 232 bp Gap TGTGTG repeat 54. Cod: example 1 55. Cod: example 1 56. Cod: example 1Assembly...ACACAC TGTGTG......ACACACAC TGTGTG......ACACACAC TGTGTG... Unplaced region AC TGTGTG... 57. Cod: example 2Assembly ...TGTGTG 344 bp gap 58. Cod: example 2TGTGTG repeat 344 bp Gap 59. Cod: example 2 60. Cod: example 2Assembly ...TGTGTG ...TGTGTG ...TGTGTG ...TGTGTG Heterozygosity? 61. Cod: example 3Assembly300 bp misassembly? 62. Cod: error-correction P_errorCorrection pipeline from 93% of reads recovered2.7x Alignments of at least 1kb to published assembly+23x+24 cpus4.5 days100 Gb RAM 63. Cod: prospectPacBio reads span many gaps PacBio reads may span heterozygous regionsPolymorphic contig 2 Contig 1Contig 4Polymorphic contig 3 64. Summary Salmon and cod extra challengingAssembly is difficult readscontigsscaffoldsPacBio has a huge potential 3-7 kb repeats mapped to PacBio readsleft flank repeat right flankhttp://en.wikipedia.org, http://fishandboat.com 65. Acknowledgements University of OsloJason Miller, JCVI Pacific BiosciencesSequencing team NSC ICSASGOle Kristian TrresenKjetill JakobsenSissel Jentoft Cod genome group The%female%named%double[haploid%Sally% genome% with% of% es>mated% length% Gbp.%3% 12% 66. http://wiki.galaxyproject.org/Events/GCC2013