2012 talk to CSE department at U. Arizona

1.Streaming lossy compression of biological sequencedata using probabilistic data structuresC. Titus BrownAssistant ProfessorCSE, MMG, BEACON Michigan State University August [email protected]

2. AcknowledgementsLab members involvedCollaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell Arend Hintze Billie Swalla, UW Rosangela Canino- Janet Jansson, LBNLKoning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe Likit Preeyanon Funding Jiarong Guo Tim BromUSDA NIFA; NSF IOS; Kanchan PavangadkarBEACON. Eric McDonald 3. We practice open science!Be the change you wantEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio:diginorm arxiv 4. Shotgun metagenomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Wikipedia: Environmental shotgun sequencing.p 5. AssemblyIt was the best of times, it was the wor, it was the worst of times, it was theisdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishnessbut for lots and lots of fragments! 6. Assemble based on word overlaps:Repeats cause problems: 7. Sequencers also produceerrors It was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness 8. Shotgun sequencing & assemblyRandomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu) 9. Assembly no subdivision!Assembly is inherently an all by all process. There is no good way to subdivide the reads withoutpotentially missing a key connection 10. Assembly no subdivision!Assembly is inherently an all by all process. There is no good way to subdivide the reads withoutpotentially missing a key connection I am, of course, lying. There were no good ways 11. Four main challenges for de novosequencing. Repeats. Low coverage. Errors These introduce breaks in theconstruction of contigs. Variation in coverage transcriptomes andmetagenomes, as well as amplified genomic.This challenges the assembler to distinguish betweenerroneous connections (e.g. repeats) and real connections. 12. Repeats Overlaps dont place sequences uniquely when there are repeats present.UMD assembly primer (cbcb.umd.edu) 13. CoverageEasy calculation:(# reads x avg read length) / genome sizeSo, for haploid human genome:30m reads x 100 bp = 3 bn 14. Coverage 1x doesnt mean every DNA sequence is readonce. It means that, if sampling were systematic, itwould be. Sampling isnt systematic, its random! 15. Actual coverage varies widely fromthe average. 16. Actual coverage varies widely from the average.Low coverage introduces unavoidable breaks. 17. Two basic assembly approaches Overlap/layout/consensus De Bruijn or k-mer graphs The former is used for long reads, esp all Sanger-based assemblies. The latter is used because of memory efficiency. 18. Overlap/layout/consensusEssentially,1. Calculate all overlaps (n^2)2. Cluster based on overlap.3. Do a multiple sequence alignmentUMD assembly primer (cbcb.umd.edu) 19. K-mer graphBreak reads (of any length) down into multipleoverlapping words of fixed length k.ATGGACCAGATGACAC (k=12) =>ATGGACCAGATG TGGACCAGATGAGGACCAGATGAC GACCAGATGACAACCAGATGACAC 20. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010) 21. K-mer graph (k=14) Each node represents a 14-mer;Links between each node are 13-mer overlaps 22. K-mer graph (k=14) Branches in the graph represent partially overlapping sequences. 23. K-mer graph (k=14) Single nucleotide variations cause long branches 24. K-mer graph (k=14)Single nucleotide variations cause long branches;They dont rejoin quickly. 25. K-mer graphs choosing pathsFor decisions about which paths etc, biology-basedheuristics come into play as well. 26. The computational conundrumMore data => better.and More data => computationally more challenging. 27. Reads vs edges (memory) in de Bruijn graphs Conway T C , Bromage A J Bioinformatics 2011;27:479-486 The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 28. The scale of the problem is stunning. I estimate a worldwide capacity for DNA sequencingof 15 petabases/yr (its probably larger). Individual labs can generate ~100 Gbp in ~1 week for$10k. This sequencing is at a boutique level: Sequencing formats are semi-standard. Basic analysis approaches are ~80% cookbook. Every biological prep, problem, and analysis is different. Traditionally, biologists receive no training incomputation. (And computational people receive notraining in biology :) and our computational infrastructure is optimizingfor high performance computing, not high throughput. 29. My problems are also veryannoying (From Monday seminar) Est ~50 Tbp tocomprehensively sample the microbialcomposition of a gram of soil. Currently we have approximately 2 Tbp spreadacross 9 soil samples. Need 3 TB RAM on single chassis to doassembly of 300 Gbp. estimate 500 TB RAM for 50 Tbp of sequence. That just wont do. 30. Theoretical => applied solutions.Theoretical advances Practically useful & usableDemonstratedin data structures and implementations, at scale.effectiveness on real data.algorithms 31. Three parts to our solution.1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch).2. An online streaming approach to lossy compression.3. Compressible de Bruijn graph representation. 32. 1. CountMin SketchTo add element: increment associated counter at all hash localesTo get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/ 33. Our approach is very memoryefficient 34. and does not introduce significantmiscounts on NGS data sets. 35. 2. Online, streaming, lossy(NOVEL)compression.Much of next-gen sequencing is redundant. 36. Uneven coverage => even more (NOVEL)redundancy Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!! This 100x will consumedisk space and, because of errors, memory. 37. Can we preferentially retain reads that contain trueedges? Conway T C , Bromage A J Bioinformatics 2011;27:479-486 The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 38. Downsample based on de Bruijngraph structure; this can be derivedonline. 39. Digital normalization algorithmfor read in dataset:if estimated_coverage(read) < CUTOFF:update_kmer_counts(read)save(read)else:# discard readNote, single pass; fixed memory. 40. The median k-mer count in a sentence is agood estimator of redundancy within the graph. This gives us a reference-free measure ofcoverage. 41. Digital normalization retains information, whilediscarding data and errors 42. Contig assembly now scales with underlying genomesize Transcriptomes, microbial genomes inclMDA, and most metagenomes can be assembledin under 50 GB of RAM, with identical orimproved results. Memory efficient is improved by use of CountMinSketch. 43. (NOVEL)3. Compressible de Bruijn graphsEach node represents a 14-mer; Links between each node are 13-mer overlaps 44. Can store implicit de Bruijn graphs ina Bloom filterAGTCGGAGTCGGCATGACAGTCGG C GTCGGCTCGGCA A CGGCATGGCATG T GCATGACATGAC G ABloom lter C 45. False positives introduce falsenodes/edges.When does this start to distort the graph? 46. Average component size remains lowthrough 18% FPR. 47. Graph diameter remains constantthrough 18% FPR. 48. Global graph structure is retained past18% FPR1%5%10%15% 49. Equivalent to bond percolation problem; percolationthreshold independent of k (?) 50. This data structure is strikinglyefficient for storing sparse k-mergraphs. Exact is for best possible information-theoretical storage. 51. We implemented graph partitioning on top of this probabilistic de Bruijn graph.Split reads into bins belonging to different source species.Can do this based almost entirely on connectivity of sequences. 52. Partitioning scales assembly for asubset of problems. Can be done in ~10x less memory than assembly. Partition at low k and assemble exactly at any higherk (DBG). Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to highvariation, etc.) Can eliminate small partitions/contigs in thepartitioning phase. An incredibly convenient approach enabling divide &conquer approaches across the board. 53. Technical challenges met (and defeated) Exhaustive in-memory traversal of graphs containing 5-15 billion nodes. Sequencing technology introduces false connections in graph (Howe et al., in prep.) Implementation lets us scale ~20x over other approaches. 54. Minia assembler(minia.geneouest.org)Chaikhi thesis presentation 55. Our approaches yield a variety ofstrategiesAssemblyAssembly Metagenomic data PartitioningAssemblyAssemblyShotgun data Digitalnormalization Shotgun data Assembly 56. Concluding thoughts, thus far Our approaches provide significant andsubstantial practical and theoretical leverage toone of the most challenging current problems incomputational biology: assembly. They also improve quality of analysis, in somecases. They provide a path to the future: Many-core compatible; distributable? Decreased memory footprint => cloud computing can be used for many analyses. They are in use, ~dozens of labs using digital normalization. 57. Future researchMany directions in the works! (see posted grantprops) Theoretical groundwork for normalizationapproach. Graph search & alignment algorithms. Error detection & correction. Resequencing analysis. Online (infinite) assembly. 58. Streaming Twitter analysis. 59. Running HMMs over de Bruijn graphs (=> cross validation) hmmgs: Assemblebased on good-scoringHMM paths through thegraph. Independent of otherassemblers; verysensitive, specific. 95% of hmmgs rplBdomains are present inour partitionedassemblies.Jordan Fish, Qiong Wang, and Jim Cole (RDP) 60. Side note: error correction is thebiggest data problem left insequencing.Both for mapping & assembly. 61. Streaming error correction. First pass Second pass Error-correct low- Error-correct low-All readsYes! abundance k-mers in Yes! abundance k-mers in read.read.Does read comeDoes read comefrom a high-from a now high-coverage locus? coverage locus? Add read to graphLeave unchanged. and save for later. Only saved readsNo!No! We can do error trimming ofgenomic, MDA, transcriptomic, metagenomic data in < 2passes, fixed memory.We have just submitted a proposal to adapt Euler orQuake-like error correction (e.g. spectral alignment

2012 talk to CSE department at U. Arizona

Technology

Transcript of 2012 talk to CSE department at U. Arizona