Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS...
-
Upload
ashley-goodwin -
Category
Documents
-
view
223 -
download
0
Transcript of Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS...
![Page 1: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/1.jpg)
Genomics, Transcriptomics, and Proteomics: Engaging Biologists
Richard LeDuc
Manager, NCGAS
eScience, Chicago 10/8/2012
![Page 2: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/2.jpg)
Central Dogma of Molecular Biology
DNAATGGC
ATA CC
DNAReplicates
itself
mRNA
DNA istranscribed to
RNA
Protein
RNA istranslated to
protein
![Page 3: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/3.jpg)
Central Dogma of Molecular Biology
DNA mRNA Protein
Genomics Transcriptomics Proteomics
![Page 4: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/4.jpg)
Tools of the Trade
Instruments
• Next-Generation Sequencers Illumina 454 PacBio
• Mass Spectrometers 5 kinds of mass
analyzers Hybrid analyzers +
separation technology
Techniques
• Genome assembly• RNA-sequencing• ChIP-sequencing• Methyl-sequencing• Shotgun bottom-up
proteomics• 2D gel proteomics• Top-down proteomics
4
![Page 5: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/5.jpg)
5
Zhao et al. BMC Bioinformatics 2011, 12(Suppl 14):S2http://www.biomedcentral.com/1471-2105/12/S14/S2
Figure © Vincent Montoya / wikipedia
![Page 6: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/6.jpg)
Analysis as Data Reduction
• Proteomics Shotgun Bottom-up
3.4 GB of instrument data 172 MB (x1/20) of
unstructured files (5,219 files in 67 folders)
13 MB of publishable results (x1/260).
Improved technology increases the size of the instrument files, but not usually the intermediate or final file sizes.
• DNA Sequencing Often on the order of x1/2500
from start to finish
Instrument Data
![Page 7: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/7.jpg)
Options for Computational SupportCompute at the Instrument
• Supercomputer in a box
Many commercial venders are entering with turn-key solutions to specific problems.
Limited variety of analytic expertise.
• Build Your OwnComputational Center
A rack or two, a few servers, and you are good to go.
Only a subset of HPC skills are present in staff.
Computer Centers
• Biologists Each has to learn to work
with existing systems. Few have specialized in HPC.
• Computer center Support for hundreds of small
projects. Each project has different
needs.
![Page 8: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/8.jpg)
• Funded by National Science Foundation1. Large memory clusters for assembly
2. Bioinformatics consulting for biologists
3. Optimized software for better efficiency
• Open for business at: http://ncgas.org
![Page 9: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/9.jpg)
Making it easier for Biologists
• Web interface to NCGAS resources
• Supports many bioinformatics tools
• Available for both research and instruction.
Common
Rare
Computational Skills
LOW
HIGH
![Page 10: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/10.jpg)
10
![Page 11: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/11.jpg)
GALAXY.NCGAS.ORG Model
Virtual box hosting Galaxy.ncgas.org
The host for each tool is configured individually
Quarry Mason
Data CapacitorArchive
NCGAS establishes tools, hardens them, and moves them into production.
Custom Galaxy tools can be made for moving data
Individual projects can get duplicate boxes – provided they support it themselves.
Policies on the DC guarantee that untouched data is removed with time.
![Page 12: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/12.jpg)
NCGAS Sandbox Demo at SC 11
• STEP 1: data pre-processing, to evaluate and improve the quality of the input sequence
• STEP 2: sequence alignment to a known reference genome
• STEP 3: SNP detection to scan the alignment result for new polymorphisms
![Page 13: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/13.jpg)
10 Gbps
100 Gbps NCGAS Mason
(Free for NSF users)
IU POD(12 cents
per core hour)Data CapacitorNO data storage Charges
Your Friendly Neighborhood Sequencing Center
Your Friendly Neighborhood Sequencing Center
Your Friendly Neighborhood Sequencing Center
Moving Forward
Other NCGAS XSEDE Resources…
Lustre WAN File System
Globus On-line and other tools
Optimized Software
![Page 14: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/14.jpg)
How would this work at scale?
1. Biologists use Galaxy and other web portals to move data and execute workflows
2. Instrument data transferred across Internet2
3. Data Capacitor flows data into Mason or other computational clusters
4. Data reduction allows “compute in place” to work
5. Data Capacitor mounts or mirrors reference data from NCBI or other sources
![Page 15: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/15.jpg)
![Page 16: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/16.jpg)
In Sum…
• Modern molecular biology – specifically the omics such as genomics, transcriptomics, and proteomics, provides many tools for answering many questions, but no single solution meets all needs.
• The amount of data generated decreases along a workflow. This has implications in both storage and analysis.
• NCGAS can provide a national scale infrastructure to better serve the needs of biologists who cannot become bioinformaticians to accomplish their research.
• Increasingly specialized skills are needed to provide best-practice solutions at all steps in a workflow.
![Page 17: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/17.jpg)
Thank You
Questions?
Bill Barnett ([email protected])
Rich LeDuc ([email protected])
Le-Shin Wu ([email protected])
Carrie Ganote ([email protected])
![Page 18: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/18.jpg)
NCGAS Cyberinfrastructure at IU
• Mason large memory cluster (512 GB/node)• Quarry cluster (16 GB/node)• Data Capacitor (1 PB at 20 Gbps throughput)• Research File System (RFS) for data storage• Research Database Cluster for managing data
sets.• All interconnected with a high speed internal
network (40 Gbps)
![Page 19: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/19.jpg)
Acknowledgements & disclaimer
• This material is based upon work supported by the National Science Foundation under Grants No. ABI-1062432
• This work was supported in part by the Lilly Endowment, Inc. and the Indiana University Pervasive Technology Institute
• Any opinions presented here are those of the presenter(s) and do not necessarily represent the opinions of the National Science Foundation or any other funding agencies
![Page 20: Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.](https://reader036.fdocuments.in/reader036/viewer/2022062308/56649e7d5503460f94b801f5/html5/thumbnails/20.jpg)
License terms
• Please cite as: LeDuc, R.D., Genomics, Transcriptomics, and Proteomics: Engaging Biologists, presented at Extending High-Performance Computing Beyond its Traditional User Communities, Co-located with the 8th IEEE International Conference on eScience, Chicago, USA, October 8, 2012. Available from: http://hdl.handle.net/2022/14746
• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.
• Except where otherwise noted, contents of this presentation are copyright 2011 by the Trustees of Indiana University.
• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.