Post on 25-Dec-2015
Nat
iona
l Can
cer
Inst
itute
October 6, 2009Jack R. Collins, Ph.D.Director, Advanced Biomedical Computing CenterNational Cancer InstituteFrederick, Maryland, USA
NCI-Frederick/SAIC-Frederick, Inc.
Applying HPC to Biology:The Digital Age
ABCC Mission Summary
Provide high performance computational resources to the NCI/NIH biomedical community
Provide storage, backup, network, access control, system administration and security functionalities to NCI/NCI-F
Support NCI initiatives including imaging, bioinformatics, proteomics, and nanobiology
Provide new computational technologies for application to biomedical problems
Time to solution must be measured in heartbeats
In 2008, one person is expected to die from cancer every 56 seconds in the United States. HPC must enable scientists to impact cancer treatment.
Paradigm Shift in Biology
Computers are getting fast enough and we are now collecting enough data that we can begin to generate reasonable models that can be tested and refined to better mimic reality.
If an approximate model can help “refine” 10% of the HT experiments at NCI, it could save over $1M per year in consumables and accelerate scientific understanding.
Computer Science is starting to notice: (Many recent articles in ACM/IEEE journals.)
NCI Vision for Translational Research
Function
al
Biology
Ctr.
High-
throughp
ut target
screening
Chemica
l Biology
Consort GMP
producti
on
Preclinic
al
testing
Academic res.
labs Private sector
CLOUD Patient data
Science data
TCGA TARGE
T CGEMS
caHUB
caBIGBigHEALTH
Ca eHR
Characterization center
University ca ctr.NCCCPSPOREsCCOPsCoop. Grps.
Clinical Ctr.
Causal
pathway
s
Tissue
Patient selection
Grantee consortia Imaging
Sequencing
AndMicroarra
y
Nanotechnology
Proteomics
HIV DrugResistance
CAPRGEMs
Molecular Structure
caBIG®
Data Driven ComputationIntegration and Understanding is key
Next-Gen Sequencing
Metabolomics Structural Biology
Epigenomics Regulatory Networks
Nanotechnology
Micro-array Protein Pathways
Drug Design (traditional)
Comparative Genomics GWAS
Systems Biology
Data Analytics Pattern Recognition
Proteomics Image Analysis / Visualization
Clinical Outcome
“Next-Generation” Sequencing Technologies
Not just one.
But “farms” in multiple labs.
Output from one Illumina paired-end run generates ~7TB of raw data.
NextGen Data “Tsunami”
2009 2010 2011 2012 2013 2014 20150
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
TB
of
Dat
a
20 Mb/hr
50 Mb/hr
100 Mb/hr
300 Mb/hr
500 Mb/hr
1000 Mb/hr
2500 Mb/hr
AT
RF
SOLiD™
The Cancer Genome Atlas
• Cancer Genome Atlas Gets $275M Funding from Stimulus, NCI and NHGRI (October 01, 2009)
• NEW YORK (GenomeWeb News) – The Cancer Genome Atlas project will receive a total of $275 million over the next two years to fund genomic mapping of more than 20 types of cancer.
• The $175 million in ARRA funding announced Wednesday by President Barack Obama will be buttressed by an additional $100 million from the National Cancer Institute and the National Human Genome Research Institute.
• Obama said in a speech at the National Institutes of Health's campus on Wednesday that genomics and human genetics research have begun to generate hope for cancer treatments, but, "We've only scratched the surface of these kinds of treatments, because we've only begun to understand the relationship between our environment and genetics in causing and promoting cancer."
• Over the two-year period, TCGA plans to collect more than 20,000 tissue samples from more than 20 cancer types, complete maps of the genomic changes in 10 of those cancers, and sequence and characterize at least 100 tumors of as many as 15 additional cancers. These maps will be deposited into public databases for use by the worldwide research community in research programs aimed at finding new ways to diagnose, treat, and prevent cancer.
TCGA Storage / Compute Requirements
−600GB per patient per disease−500 patients per disease
•300TB of data per disease−20 cancer types
•6PB of primary data
•Data Annotation•Data Integration•Analysis of high-dimensional data for patterns
Google as an HPC model
But I don’t want just results, I want relationships between my results based on ontologies and other metrics.
Analyzing High-dimensional Data
o Complex Task: Computer Scientists / Mathematicians Needed!o Non-intuitive properties (eg. Mario Valle, 2008) … so Efficient
Methods/Algorithms Needed!o Appropriate Computing Platforms (memory, multi-core, cell,
FPGA, GPGPU, ?)o An Interface to Utilize the Compute Platform (Programming
Model for mortals with finite time)
NCI In Silico Research Centers
Supporting investigator-initiated, hypothesis-driven research into the etiology, treatment, and prevention of cancer using in silico methods• Generating and publishing novel cancer research findings
mining existing data resources such as TCGA• Identifying novel bioinformatics processes and tools to
exploit existing data resources
Advocacy for and input into caBIG enhancements• integration and interoperability of data and analytical
services• Infrastructure
• NCI Investing in in silico research pilot over next 3 years• Five extramural and one intramural award
Telomerase Targeted Anti-cancer AgentsTelomerase Discovery -> Nobel Prize in Medicine 2009
Mol. Cancer Therapeutics Vol. 1:103 (Dec 2001)
Imaging
Tumor Angiogenesis ModelingTumor Segmentation (GBM)
New Fluorescent MarkersConfocal Image Analysis
Cellular Imaging - Biomedical Uses(Emphasis on oncology and personalized medicine)
• FRET Studies of protein dynamics and function• Single cell molecular profiling via antibody labeling different proteins in
different tissue sections.• Accurate delineation of the edges of tumors.• Assessment of vascularization of tumors.• Assessment of immune cell infiltration.• Localization of proliferating and apoptotic cells.• Determine sites of extra-cellular matrix degradation and cell invasion.• Investigate metastasis• Analysis of genomic instability and gene organization using FISH labeling• FRAP investigation of protein diffusion kinetics within cells
Green Fluorescent ProteinTeal Fluorescent ProteinYellow Fluorescent ProteinRed Fluorescent Protein
Fluorescent Proteins - NCI/ABCC focusFluorescent Proteins - NCI/ABCC focus
Copyright 2004-2009 OLYMPUS CORPORATION All Rights Reserved
Approximately 3000 fluorescent probes for biology.http://probes.invitrogen.com/handbook/
Protein Engineering / rational design
A priori calculation of spectroscopic characteristics due to different chromophores
A priori calculation of spectroscopic shifts due to mutations in protein.
Accurately estimate quantum yieldsA priori calculation of maturation kineticsCalculate factors in thermal stability and protein-protein
interactions
Typical errors of standard quantum chemistry calculations for such systems, even in the gas phase, may amount up to 0.2-0.5 eV
Errors of 50 nm for the optical range ~ 500 nm are too large
2.5 eV ~ 500 nm
S0
S1
Let us assumethe error +0.25 eV ~ 50 nm
2.75 eV ~ 450 nm
S0
S1
Accuracy of calculations
GREEN
BLUE
Let us assumethe error -0.25 eV ~ 50 nm
Yellow2.25 eV ~ 550 nm
Method N-Chromophore A-Chromophore
DE(eV) L (nm) f DE(eV) L (nm) f
TDDFT/B3LYP//B3LYP/6-31+G**
3.46 358.8 0.69 4.18 296.3 0.10
3.06 405.5 0.98
TDDFT/BP86//B3LYP/6-31+G**
3.19 389.0 0.55 3.65 340.0 0.15
2.94 422.1 0.86
CIS// B3LYP/6-31+G** 4.43 279.8 1.14 6.99 177.4 0.52
3.75 330.3 1.55
ZINDO//B3LYP/6-31+G**
3.45 359.9 0.96
2.59 479.1 1.22
S0-S1 excitation energies for the free neutral (N) and anionic (A) chromophore of GFP
-A Chromophore L(exp)=479 nm
GFP
mTFP1
*Hui-wang Ai et al., BMC Biology, 2008, 6:13
Crystal structure of mTFP1 (2HQK)
Absorption Fluorescence
Chromophore in GFP and mTFP1
Computational Cost
2K
Log2
4 8 16 32 64 2568x32
Computational Storage / Cost O(N)
Increasing pixel size (intensity palette)
2K X 2K X16bit = 8MB
200K X 200K X 48bit = 240GB50 angles = 12TB per image
Replica + Steered Dynamics
Ga = 7.3kcal/mol
50ns / trajectory
MD1
MD2
MD3
MD4
MD5
MD6
HPC Required to Differentiate Structure/Properties
Overall Computational Cost
2K
Log2
4 8 16 32 64 2568x32
Computational Storage / Cost O(N)
Modeling Cost
DF1 DF1-Mini
HemolysisCell Lysis
MembraneInteraction
Kidney Failure
(salt, pH)
Aggregation(salt, pH)
Liver Failure(hydrophobic
)
Aggregation(hydrophobic
)
Immune Resp.
IgG
Anti-drugOr Protective
Reactivity(OH Capture)
Critical Biological/Toxicity Differences
Structure: Non-intuitive Results Explain Toxicity
Note the large differences in exposed fullerene surface among the two 3D model structures
DF1
DF1-Mini
HPC “Compute Cloud”
NIH currently gathering requirements across all of the Institutes
Virtualization may have a bigger impact in the near term.
Data Tsunami – What do we need?“Bioinformatics is modern biology”
Not just more datadata is more complex
Storage• High capacity (PB) storage farms• High-speed access to PB of storage• Automated MetaData Extraction• Relational Data Integration• Distribution / Security
Network• Data Transfer• Collaborative Interaction• Access to National Resources Both Compute and Experimental
Compute• Not Just Floating Point!• Data Analysis / Mining for High-Dimensional PB Datasets• New Algorithms/Software -> Hypothesis Generation • Software/Languages to implement the algorithms in parallel …• On Heterogeneous compute platforms (CPU, GPGPU, Cell, FPGA)
Information -> Knowledge• Proper analysis and Visualization lead to …• Human Understanding
Acknowledgements
• Igor Topol• Bob Stephens, Alex Levitsky• Brian Luke• Robert Wells and Albino Bacolla• Yanling Liu, Stephen Lockett, Joe Kalen, Chris Kurcz• Raul Cachau• And you - Thank You
What do we need?
Computation:• New algorithms for parallel computers• Software/Languages to implement the algorithms• Heterogeneous compute platforms (CPU, GPGPU, Cell, FPGA)
Storage and Access:• High capacity (PB) storage farms• High-speed access to PB of storage• High-speed networks
Information -> Knowledge:• Proper analysis and Visualization• Understanding and progress
“Consolidate locally, distribute globally”