DNA Subway Green Line Onramp to HPC in Biology Education Dave Micklos and Uwe Hilgert iPlant...

Post on 16-Dec-2015

221 views 5 download

Tags:

Transcript of DNA Subway Green Line Onramp to HPC in Biology Education Dave Micklos and Uwe Hilgert iPlant...

DNA Subway Green Line Onramp to HPC in Biology Education

Dave Micklos and Uwe Hilgert

iPlant CollaborativeDNA Learning Center,

Cold Spring Harbor Laboratory; Bio5 Institute,

University of Arizona

…ridean educational Discovery Environment

Green Line: RNA Sequence (RNA-Seq) Analysis

• First fully GUI interface for RNA-Seq analysis — no command line or data conversions

• Accesses XSEDE system through the iPlant Agave API• Co-localizes up to 100 GB of data in iPlant Data Store• Look for differential gene expression in different

tissues, life stages, or treatment• Generate lists of expressed genes and fold-changes• Annotate sequenced genomes; add results to Red

Line projects

150 feet

RNA code represents “active” DNA in genome

Homo sapiens bitter taste receptor (TAS2R38) DNA code > RNA code

CCTTTCTGCACTGGGTGGCAACCAGGTCTTTAGATTAGCCAACTAGAGAAGAGAAGTAGAATAGCCAATTAGAGAAGTGACATCATGTTGACTCTAACTCGCATCCGCACTGTGTCCTATGAAGTCAGGAGTACATTTCTGTTCATTTCAGTCCTGGAGTTTGCAGTGGGGTTTCTGACCAATGCCTTCGTTTTCTTGGTGAATTTTTGGGATGTAGTGAAGAGGCAGGCACTGAGCAACAGTGATTGTGTGCTGCTGTGTCTCAGCATCAGCCGGCTTTTCCTGCATGGACTGCTGTTCCTGAGTGCTATCCAGCTTACCCACTTCCAGAAGTTGAGTGAACCACTGAACCACAGCTACCAAGCCATCATCATGCTATGGATGATTGCAAACCAAGCCAACCTCTGGCTTGCTGCCTGCCTCAGCCTGCTTTACTGCTCCAAGCTCATCCGTTTCTCTCACACCTTCCTGATCTGCTTGGCAAGCTGGGTCTCCAGGAAGATCTCCCAGATGCTCCTGGGTATTATTCTTTGCTCCTGCATCTGCACTGTCCTCTGTGTTTGGTGCTTTTTTAGCAGACCTCACTTCACAGTCACAACTGTGCTATTCATGAATAACAATACAAGGCTCAACTGGCAGATTAAAGATCTCAATTTATTTTATTCCTTTCTCTTCTGCTATCTGTGGTCTGTGCCTCCTTTCCTATTGTTTCTGGTTTCTTCTGGGATGCTGACTGTCTCCCTGGGAAGGCACATGAGGACAATGAAGGTCTATACCAGAAACTCTCGTGACCCCAGCCTGGAGGCCCACATTAAAGCCCTCAAGTCTCTTGTCTCCTTTTTCTGCTTCTTTGTGATATCATCCTGTGCTGCCTTCATCTCTGTGCCCCTACTGATTCTGTGGCGCGACAAAATAGGGGTGATGGTTTGTGTTGGGATAATGGCAGCTTGTCCCTCTGGGCATGCAGCCATCCTGATCTCAGGCAATGCCAAGTTGAGGAGAGCTGTGATGACCATTCTGCTCTGGGCTCAGAGCAGCCTGAAGGTAAGAGCCGACCACAAGGCAGATTCCCGGACACTGTGCTGAGAATGGACATGAAATGAGCTCTTCATTAATACGCCTGTGAGTCTTCATAAATATGCC

66

Differential Gene ExpressionRNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells at different times

77

Differential Gene ExpressionRNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells

RNA Sequence (RNA-Seq) Analysis

Isolate total RNA; convert to DNA library

Design RNA-Seq experiment, i.e., differential expression

Sequence experiment and control libraries

Analyze sequence data on DNA Subway Green Line

Follow-up experimental validation

Image source: http://www.bgisequence.com

1) Manage Data: Quality Assessment with FastQC; ~100 Million 75/150 nucleotide reads in < 1hr

2) FastX ToolKit: Quality Control with FastX Toolkit; ~100M 75/150 nucleotide reads in <1 hr (some took up to 19 hours…)

3) TopHat: Aligns ~100 Million 75/150 nucleotide (paired end) reads to a reference genome of 100M–5B in 6–19hr

TopHat AlignmentJBrowse

TopHat AlignmentJBrowse

4) CuffLinks: Assembles transcripts and calculates abundance on BAM files, 1–12GB in 6–19hr

5) CuffDiff: Merges assemblies from Cufflinks and performs differential expression analysis on 4–9 samples in 6–19 hr

Green LineQueue time vs Run time

Asking for a high run time, leads to longer queue times Asking for a short high time may lead to job being

terminated Users don't like to wait too long Users want the results right away Finding the right balance is not easy

Green LineDealing w/ the unexpected

Systems taken offline Maintenance Network outages, data transfer issues Science API gives glitches Authentication

Green Line“Monitoring XSEDE”

DNA Subway“Power Desktop”

• Intuitive interface to support seamless genome “round trip” for eukaryote of choice

• Access high performance computing to analyze whole genome data (RNA-seq, initially)

• Scaffold data to sequenced genomes available in iPlant Data Store

• Directly upload RNA-seq reads as biological evidence for genome annotation using Red Line

NSF CCLI Project RetreatJune 8–20, 2014, CSHL

• 11 faculty from PUIs• Program included lectures/practical sessions

Wet lab: RNA library prepGreen Line analysis & bioinformaticsPedagogy/teaching resources Virtual training materials

Agnes Ayme-Southgate College of Charleston, SC

Flight muscle development during life-stage transitions in Apis melifera (honeybee)

Judy Brusslan California State University, Long Beach, CA

Leaf development and senescence in Arabidopsis thaliana

Raymond Enke James Madison University, VA

Retina development in Gallus gallus

Shaye Lewis Prairie View A&M University, TX

Testes development from juvenile to puberty in caprine (goat)

Irina Makarevitch Hamline University, MN

Response to cold stress in maize

Judith Ogilvie Saint Louis University, MO

Retinal changes of mice with retinitis pigmentosa

Jeremy Seto New York City College of Technology, CUNY, NY

Differentiation of rat pheochromocytoma line cells (PC12) to a neuronal-like phenotype

Carrie Thurber Abraham Baldwin Agricultural College, IL

Seed abscission in Sorghum bicolor

George Ude Bowie State University, MD

Floral inflorescence genes in banana/plantains

Deirdre Vaden Prairie View A&M University, TX

Peripheral blood mononuclear cells from hypertensive rats treated with captopril

Scott Woody University of Wisconsin, WI

Gibberellic acid exposure in Brassica rapa (Fast Plants) gibberellic acid (gad) mutants

NSF CCLI Project RetreatFaculty Participants

NSF CCLI Project RetreatFlight muscle development during life-stage

transitions in Apis mellifera (honeybee)

Agnes Ayme-Southgate, College of Charleston, SC

All honeybees begin as worker bees, flying short distances. Some honeybees transition into foragers, flying long distances. This transition necessitates major changes in flight muscles. Goal is to identify the gene expression changes in flight muscles during this transition

Courses• Biol 322: Developmental Biology, 30–38 students• Genetics, 100 students• Undergraduate research in lab, 2–3 students

NSF CCLI Project RetreatDifferential gene expression in Capra hircus (goat)

testes during juvenile development

Shaye Lewis, Prairie View A&M University, TX

Fertility phenotypes show low heritability, and semen analysis parameters cannot determine fertility status. Molecular biomarkers can increase efficiency of artificial insemination and embryo transfer in goats. Goal is to identify genes important for normal testes development and function

Courses•4533: Animal Breeding & Genetics, 20 students•Undergraduate research in lab, 4 students

NSF CCLI Project RetreatUnderstanding transcriptional response to cold

stress in maize

Irina Makarevitch, Hamline University, MN

Maize is grown worldwide and is astaple for >1 billion people. Maize is thermophilic and sensitive to low temperatures, and understanding how plants respond to cold can improve yields.Goal is to identify genes that are differentially expressed when maize is grown under cold stress

Courses•Biol 201: Principles of Genetics, 80 students•Biol 301: Genomics & Bioinformatics, 20 students•Undergraduate research in lab, 4 students

NSF CCLI Project RetreatRNA-Seq Datasets Generated and Analyzed

Using the Green Line of DNA Subway

• 8 eukaryotic organisms• 21 controls paired with 26

experimental conditions• 402 Gbases sequenced• 837 jobs submitted to TACC• 87% jobs completed• 695 hours total CPU time• 16 threads/processors running

concurrently

100 level

200 level

300 level

400 level

500 level

Undergrad Research

Intro

Biology

Genetics, 270

Molecular & Cell Biology, 50

Genetics, 220

Molecular Biology, 100

Genomics & Bioinformatics, 70

Developmental Biology, 35

Cell Structure & Function, 30

Synthetic Biology, 30

Anatomy/Physiology, 50

Advanced Genetic Techniques, 15

Cell & Molecular Biology, 75

Genomics, 40

Animal Breeding & Genetics, 20

Independent Research, 5

Molecular Applications in Crop Improvement

15

100s 320 550 140 20 15

Intended Implementation 2014-15

DNA Subway is…

ProducersUwe HilgertDavid MicklosJason Williams

DesignersEun-Sook JeongSusan Lauter

ProgrammersCornel GhibanMohammed KhalfanSheldon McKay

ContributorsMatt VaughnRion DooleyAnthony BiondoJim BurnetteScott CainEd LeeZhenyuan Lu

AdvisorsMatt ConteCarson HoltBruce NashOscar Pineda-Catalan

HPC in Undergraduate Biology EducationBanbury Center, CSHL, September 3-5, 2014

Contact Dave Micklos (micklos@cshl.edu)

A Great Gatsby era estate on Long Island’s “Gold Coast”

Funded by NSF and the Alfred P. Sloan Foundation