CT Brown - Doing next-gen sequencing analysis in the cloud

1. Doing next-gen sequencing analysis in the cloud. C. Titus Brown [email protected]

2. AcknowledgementsLab members involvedCollaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell ArendHintze Billie Swalla, UW RosangelaCanino- Janet Jansson, LBNLKoning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe LikitPreeyanonFunding JiarongGuo Tim BromUSDA NIFA; NSF IOS; KanchanPavangadkar BEACON. Eric McDonald 3. Be the change you want to see We are aggressivelyopenEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html(Whats a good license??) Preprints: on arXiv, q-bio:kmer-percolation arxivdiginormarxiv 4. The data catastrophe! Data set sizes growing faster than compute capacity(esp RAM). Many biological algorithms dont scale all that well,anyway. Algorithmically, we want: Single-pass. Compression approaches (lossy or otherwise). Low-memory data structures I, personally, think the last thing in the world we needis another standalone package: pre-filteringapproaches.Run our nifty approaches first, then feed into the 5. Digital normalization Suppose you have adilution factor of A (10) toB(1). To get 10x of B youneed to get 100x of A!Overkill!! This 100x will consumedisk space and, because of errors, memory. 6. Downsample based on de Bruijngraph structure (which can bederived online) 7. Digital normalization algorithmfor read in dataset:if median_kmer_count(read) < CUTOFF:update_kmer_counts(read)save(read)else: # discard readNote, single pass; fixed memory. 8. Digital normalization is efficient &effective Single pass algorithm Fixed memory; Algorithmic nerdvana! Cheaper than assembly; Reduces assembly time; Scales assembly memory. Brown et al., in review, PLoS On 9. Digital normalization removes errors 10. Shotgun data is often (1) highcoverage and (2) biased in coverage. 11. here we discard > 95% of data! 12. Other key points Virtually identical contigassembly; scaffolding worksbut is not yet cookie-cutter. Digital normalization changes the way de Bruijn graphassembly scales from the size of your data set tothe size of the source sample. Alwayslower memory than assembly: we nevercollect most erroneous k-mers. Digital normalization can be done once and thenassembly parameter exploration can be done. 13. Quotable quotes.Comment: This looks like a great solution forpeople who cant afford real computers. OK, but: Buying ever bigger computers is a great solution for people who dont want to thinkhard.To be less snide: both kinds of scaling are needed,of course. 14. Why use diginorm? Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes. Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity). And, well, the general idea of locus specific graph analysis solves lots of things 15. Some interim concludingthoughts Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware. This is not true for highly diverse metagenomeenvironments For soil, we estimate that we need 50 Tbp / gramsoil. Sigh. Biologists and bioinformaticianshate: Throwing away data Caveats in bioinformatics papers (which reviewers like, note) 16. Streaming error correction.We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this framework. 17. Side note: error correction is thebiggest data problem left insequencing.Both for mapping & assembly. 18. Replication fu In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook. This is an interactive Web notebook for data analysis Hey, neat! We can use this for replication! All of our figures can be regenerated from scratch,on an EC2 instance, using a Makefile (datapipeline) and IPython Notebook (figure generation). Everything is version controlled. Honestly not much work, and will be less the nexttime. 19. So howd that go? People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html Almost nobody else cares ;( Presub enquiry to editor: Be sure that your paper canbe reproduced. Uh, please read my letter to the end? Could you improve your Makefile? I want toreimplementdiginorm in another language and reuseyour pipeline, but your Makefile is a mess. Incredibly useful, nonetheless. Already part ofundergraduate and graduate training in my lab;helping us and others with next parpes; etc. etc. etc. Life is way too short to waste on unnecessarily replicating your own workflows, much less other peoples. 20. AcknowledgementsLab members involvedCollaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell ArendHintze Billie Swalla, UW RosangelaCanino- Janet Jansson, LBNLKoning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe LikitPreeyanonFunding JiarongGuo Tim BromUSDA NIFA; NSF IOS; KanchanPavangadkar BEACON. Eric McDonald 21. Advertisement! Qingpeng Zhang (QP) will talk about our very useful khmer software for efficiently counting k- mers. Want a simple Python lib for reading & indexing FASTA/FASTQ? Check out screed.Better science through superior software. 22. AdvertisementPanel on Should we have voluntary review standards for bioinformatics? Tomorrow, 4:30pm. 23. We are aggressivelyopenEverything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html(Whats a good license??) Preprints: on arXiv, q-bio:kmer-percolation arxivdiginormarxiv

CT Brown - Doing next-gen sequencing analysis in the cloud

Technology

Transcript of CT Brown - Doing next-gen sequencing analysis in the cloud