Post on 16-Dec-2015
DNA Subway Green Line Onramp to HPC in Biology Education
Dave Micklos and Uwe Hilgert
iPlant CollaborativeDNA Learning Center,
Cold Spring Harbor Laboratory; Bio5 Institute,
University of Arizona
…ridean educational Discovery Environment
Green Line: RNA Sequence (RNA-Seq) Analysis
• First fully GUI interface for RNA-Seq analysis — no command line or data conversions
• Accesses XSEDE system through the iPlant Agave API• Co-localizes up to 100 GB of data in iPlant Data Store• Look for differential gene expression in different
tissues, life stages, or treatment• Generate lists of expressed genes and fold-changes• Annotate sequenced genomes; add results to Red
Line projects
150 feet
RNA code represents “active” DNA in genome
Homo sapiens bitter taste receptor (TAS2R38) DNA code > RNA code
CCTTTCTGCACTGGGTGGCAACCAGGTCTTTAGATTAGCCAACTAGAGAAGAGAAGTAGAATAGCCAATTAGAGAAGTGACATCATGTTGACTCTAACTCGCATCCGCACTGTGTCCTATGAAGTCAGGAGTACATTTCTGTTCATTTCAGTCCTGGAGTTTGCAGTGGGGTTTCTGACCAATGCCTTCGTTTTCTTGGTGAATTTTTGGGATGTAGTGAAGAGGCAGGCACTGAGCAACAGTGATTGTGTGCTGCTGTGTCTCAGCATCAGCCGGCTTTTCCTGCATGGACTGCTGTTCCTGAGTGCTATCCAGCTTACCCACTTCCAGAAGTTGAGTGAACCACTGAACCACAGCTACCAAGCCATCATCATGCTATGGATGATTGCAAACCAAGCCAACCTCTGGCTTGCTGCCTGCCTCAGCCTGCTTTACTGCTCCAAGCTCATCCGTTTCTCTCACACCTTCCTGATCTGCTTGGCAAGCTGGGTCTCCAGGAAGATCTCCCAGATGCTCCTGGGTATTATTCTTTGCTCCTGCATCTGCACTGTCCTCTGTGTTTGGTGCTTTTTTAGCAGACCTCACTTCACAGTCACAACTGTGCTATTCATGAATAACAATACAAGGCTCAACTGGCAGATTAAAGATCTCAATTTATTTTATTCCTTTCTCTTCTGCTATCTGTGGTCTGTGCCTCCTTTCCTATTGTTTCTGGTTTCTTCTGGGATGCTGACTGTCTCCCTGGGAAGGCACATGAGGACAATGAAGGTCTATACCAGAAACTCTCGTGACCCCAGCCTGGAGGCCCACATTAAAGCCCTCAAGTCTCTTGTCTCCTTTTTCTGCTTCTTTGTGATATCATCCTGTGCTGCCTTCATCTCTGTGCCCCTACTGATTCTGTGGCGCGACAAAATAGGGGTGATGGTTTGTGTTGGGATAATGGCAGCTTGTCCCTCTGGGCATGCAGCCATCCTGATCTCAGGCAATGCCAAGTTGAGGAGAGCTGTGATGACCATTCTGCTCTGGGCTCAGAGCAGCCTGAAGGTAAGAGCCGACCACAAGGCAGATTCCCGGACACTGTGCTGAGAATGGACATGAAATGAGCTCTTCATTAATACGCCTGTGAGTCTTCATAAATATGCC
66
Differential Gene ExpressionRNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells at different times
77
Differential Gene ExpressionRNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells
RNA Sequence (RNA-Seq) Analysis
Isolate total RNA; convert to DNA library
Design RNA-Seq experiment, i.e., differential expression
Sequence experiment and control libraries
Analyze sequence data on DNA Subway Green Line
Follow-up experimental validation
Image source: http://www.bgisequence.com
1) Manage Data: Quality Assessment with FastQC; ~100 Million 75/150 nucleotide reads in < 1hr
2) FastX ToolKit: Quality Control with FastX Toolkit; ~100M 75/150 nucleotide reads in <1 hr (some took up to 19 hours…)
3) TopHat: Aligns ~100 Million 75/150 nucleotide (paired end) reads to a reference genome of 100M–5B in 6–19hr
TopHat AlignmentJBrowse
TopHat AlignmentJBrowse
4) CuffLinks: Assembles transcripts and calculates abundance on BAM files, 1–12GB in 6–19hr
5) CuffDiff: Merges assemblies from Cufflinks and performs differential expression analysis on 4–9 samples in 6–19 hr
Green LineQueue time vs Run time
Asking for a high run time, leads to longer queue times Asking for a short high time may lead to job being
terminated Users don't like to wait too long Users want the results right away Finding the right balance is not easy
Green LineDealing w/ the unexpected
Systems taken offline Maintenance Network outages, data transfer issues Science API gives glitches Authentication
Green Line“Monitoring XSEDE”
DNA Subway“Power Desktop”
• Intuitive interface to support seamless genome “round trip” for eukaryote of choice
• Access high performance computing to analyze whole genome data (RNA-seq, initially)
• Scaffold data to sequenced genomes available in iPlant Data Store
• Directly upload RNA-seq reads as biological evidence for genome annotation using Red Line
NSF CCLI Project RetreatJune 8–20, 2014, CSHL
• 11 faculty from PUIs• Program included lectures/practical sessions
Wet lab: RNA library prepGreen Line analysis & bioinformaticsPedagogy/teaching resources Virtual training materials
Agnes Ayme-Southgate College of Charleston, SC
Flight muscle development during life-stage transitions in Apis melifera (honeybee)
Judy Brusslan California State University, Long Beach, CA
Leaf development and senescence in Arabidopsis thaliana
Raymond Enke James Madison University, VA
Retina development in Gallus gallus
Shaye Lewis Prairie View A&M University, TX
Testes development from juvenile to puberty in caprine (goat)
Irina Makarevitch Hamline University, MN
Response to cold stress in maize
Judith Ogilvie Saint Louis University, MO
Retinal changes of mice with retinitis pigmentosa
Jeremy Seto New York City College of Technology, CUNY, NY
Differentiation of rat pheochromocytoma line cells (PC12) to a neuronal-like phenotype
Carrie Thurber Abraham Baldwin Agricultural College, IL
Seed abscission in Sorghum bicolor
George Ude Bowie State University, MD
Floral inflorescence genes in banana/plantains
Deirdre Vaden Prairie View A&M University, TX
Peripheral blood mononuclear cells from hypertensive rats treated with captopril
Scott Woody University of Wisconsin, WI
Gibberellic acid exposure in Brassica rapa (Fast Plants) gibberellic acid (gad) mutants
NSF CCLI Project RetreatFaculty Participants
NSF CCLI Project RetreatFlight muscle development during life-stage
transitions in Apis mellifera (honeybee)
Agnes Ayme-Southgate, College of Charleston, SC
All honeybees begin as worker bees, flying short distances. Some honeybees transition into foragers, flying long distances. This transition necessitates major changes in flight muscles. Goal is to identify the gene expression changes in flight muscles during this transition
Courses• Biol 322: Developmental Biology, 30–38 students• Genetics, 100 students• Undergraduate research in lab, 2–3 students
NSF CCLI Project RetreatDifferential gene expression in Capra hircus (goat)
testes during juvenile development
Shaye Lewis, Prairie View A&M University, TX
Fertility phenotypes show low heritability, and semen analysis parameters cannot determine fertility status. Molecular biomarkers can increase efficiency of artificial insemination and embryo transfer in goats. Goal is to identify genes important for normal testes development and function
Courses•4533: Animal Breeding & Genetics, 20 students•Undergraduate research in lab, 4 students
NSF CCLI Project RetreatUnderstanding transcriptional response to cold
stress in maize
Irina Makarevitch, Hamline University, MN
Maize is grown worldwide and is astaple for >1 billion people. Maize is thermophilic and sensitive to low temperatures, and understanding how plants respond to cold can improve yields.Goal is to identify genes that are differentially expressed when maize is grown under cold stress
Courses•Biol 201: Principles of Genetics, 80 students•Biol 301: Genomics & Bioinformatics, 20 students•Undergraduate research in lab, 4 students
NSF CCLI Project RetreatRNA-Seq Datasets Generated and Analyzed
Using the Green Line of DNA Subway
• 8 eukaryotic organisms• 21 controls paired with 26
experimental conditions• 402 Gbases sequenced• 837 jobs submitted to TACC• 87% jobs completed• 695 hours total CPU time• 16 threads/processors running
concurrently
100 level
200 level
300 level
400 level
500 level
Undergrad Research
Intro
Biology
Genetics, 270
Molecular & Cell Biology, 50
Genetics, 220
Molecular Biology, 100
Genomics & Bioinformatics, 70
Developmental Biology, 35
Cell Structure & Function, 30
Synthetic Biology, 30
Anatomy/Physiology, 50
Advanced Genetic Techniques, 15
Cell & Molecular Biology, 75
Genomics, 40
Animal Breeding & Genetics, 20
Independent Research, 5
Molecular Applications in Crop Improvement
15
100s 320 550 140 20 15
Intended Implementation 2014-15
DNA Subway is…
ProducersUwe HilgertDavid MicklosJason Williams
DesignersEun-Sook JeongSusan Lauter
ProgrammersCornel GhibanMohammed KhalfanSheldon McKay
ContributorsMatt VaughnRion DooleyAnthony BiondoJim BurnetteScott CainEd LeeZhenyuan Lu
AdvisorsMatt ConteCarson HoltBruce NashOscar Pineda-Catalan
HPC in Undergraduate Biology EducationBanbury Center, CSHL, September 3-5, 2014
Contact Dave Micklos (micklos@cshl.edu)
A Great Gatsby era estate on Long Island’s “Gold Coast”
Funded by NSF and the Alfred P. Sloan Foundation