2014 khmer protocols

Click here to load reader

download 2014 khmer protocols

of 38

Transcript of 2014 khmer protocols

  • 1.Making de novo assembly cheap & easy: standardized protocols for mRNAseq and metagenome assembly and analysis C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 [email protected]

2. My labs focus De novo assembly and efficient/effective use ofNGS, especially for non-model organism. Open source software engineering. Training and education in NGS. 3. There is quite a bit of life left to sequence & assemhttp://pacelab.colorado.edu/ 4. Three problems: 1.Assembly memory & compute requirements?2.Its a complex process; what are good defaults?3.Training is limited in opportunity, difficult for students, not always effective. 5. First problem: lots of data! 6. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ 7. to assembled original sequence.UMD assembly primer (cbcb.umd.edu) 8. Practical memory measurementsVelvet measurements (Adina Howe) 9. Shotgun sequencing & de novo assembly: It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness 10. Why are big data sets difficult? Need to resolve errors: the more coverage there is, the more errors there are. Memory usage ~ real variation + number of errors Number of errors ~ size of data set 11. The scaling problem We can cheaply gather DNA data in quantitiessufficient to swamp straightforward assembly algorithms running on commodity hardware. Since ~2008: The field has engaged in lots of engineeringoptimization but the data generation rate has consistently outstripped Moores Law. 12. Our solution: Digital normalization 13. Digital normalization 14. Digital normalization 15. Digital normalization 16. Digital normalization 17. Digital normalization 18. Contig assembly now scales a lot better.Most samples can be assembled in < 50 GB of memory. 19. Diginorm is widely useful, becoming widely used: 1. Assembly of the H. contortus parasitic nematode genome, a high polymorphism/variable coverage problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a big assembly problem. (in prep) 3. Osedax symbiont metagenome, a contaminated metagenome problem (Goffredi et al, 2013; pmid 20. Second problem: too many choices! Read trimming and ltering (x100)What programs and options do you use??Assembly (x10)Quantication (x20)Science! (x 10,000)Annotation (x20) 21. Third problem: training I teach: Summer NGS course (two weeks, KBS); heavilyoversubscribed. Many ad hoc workshops Fall BEACON course (intro computational science) Others teach: Summer/fall workshops (Robin Buell) Various genomics/bioinformatics courses (Shin-hanShiu, Rob Britton, ???) 22. Overall training results: We can fairly easily get people over the initialtechnical hump (here are some programs, heres how to use them). We can begin to teach people the way to thinkabout the problem. People have a really tough time connectinggeneric instruction to their own research, however! (And people need to learn how to analyze their own 23. Three problems: 1.Assembly memory & compute requirements?2.Its a complex process; what are good defaults?3.Training is limited in opportunity, difficult for students, not always effective. 24. Solution? khmer-protocols Read cleaning Effort to provide standard cheapassembly protocols for Illumina mRNAseq & metagenomes in the cloud.DiginormAssembly Entirely copy/paste; ~2-6 days fromraw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.AnnotationRSEM differential expression Open, versioned, forkable, citable. 25. Eel Pond mRNAseq protocol Adapter trim & quality lter Group transcriptsEBSeq (Differential expression analysis)Diginorm to C=20Annotate x database Trim highcoverage reads at low-abundance k-mers RSEM (Map QC reads to count) Assemble with TrinityExtracting differentially expressed genes & graphing 26. Kalamazoo metagenome protocol Adapter trim & quality lter Partition graph Map reads to assembly Diginorm to C=10 Too big to assemble? Split into "groups"Annotate contigs with abundancesTrim highcoverage reads at low-abundance k-mers Reinate groups (optionalDiginorm to C=5Small enough to assemble?Assemble!!!Prokka 27. Show: Web sitehttp://khmer-protocols.readthedocs.org/ 28. Show: mRNAseq output Differential expression graph 29. Show: mRNAseq spreadsheet 30. Show: BLAST server 31. Soon: Galaxy integration 32. What khmer-protocols is: Starting point. Defensible initial solution to get initial results.Works on ~80% or more of samples, guesstimated. Great (?) way to learn 100% reproducible; methods section oncomputational analysis is more or less written for you. Fairly fast and inexpensive (comparatively)(~$100/data set) 33. What khmer-protocols is not: The One True Solution. The Best Solution. Proprietary. Closed. Slow and expensive (comparatively). 34. Speed up/efficiency? Walltime to complete assembliesRAM needed to complete assembliesocc oases occ trinity ocu oases ocu trinityocc oases occ trinity ocu oases ocu trinity 500400Total memory used (GB)Total walltime (hrs)75502530020010000 DN RAWDN RAWDN RAWSampleDN RAWDN RAWDN RAWDN RAWDN RAWSampleElijah Lowe 35. Diginorm increases sensitivity (very slightly :)Evaluation by homology against a reference gene37 extra from diginorm, vs 17 lost;64 extra from diginorm, vs 15 lost; Elijah Lowe 36. Please use! Would love feedback: what worked? What didntwork? Cannot support khmer protocols on HPC, but cansupport it in the cloud; iCER may (?) support it on HPC -- all of the software is installed. (We are working on better default support for HPC.) 37. Links & more references ged.msu.edu/angus/ - NGS course materials khmer-protocols.readthedocs.org khmerprotocols Cloud computing discussion next Wed, 1/22,2pm, iCER. Dont e-mail me at: [email protected]