Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart...
-
Upload
johnathan-booth -
Category
Documents
-
view
216 -
download
0
Transcript of Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart...
Computational Biology: Practical lessons and thoughts for the future
Dr. Craig A. [email protected]
Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart
Director, Research and Academic Computing, University Information Technology Services
Director, Information Technology Core, Indiana Genomics Initiative
19 June 2003
License terms• Please cite as: Stewart, C.A. Computational Biology: Practical lessons and
thoughts for the future. 2003. Presentation. Presented at: ZIH, Technische Universität Dresden (Dresden, Germany, 19 Jun 2003). Available from: http://hdl.handle.net/2022/14802
• Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.
2
Outline• The revolution in biology & IU’s response –the Indiana
Genomics Initiative• Example software applications
– Central Life Sciences Database Service– fastDNAml
• What are the grand challenge problems in computational biology?
• Some thoughts about dealing with biological and biomedical researchers in general
• A brief description of IU’s high performance computing, storage, and visualization environments
The revolution in biology
• Automated, high-throughput sequencing has revolutionized biology.
• Computing has been a part of this revolution in three ways so far:– Computing has been essential to
the assembly of genomes– There is now so much biological
data available that it is impossible to utilize it effectively without aid of computers
– Networking and the Web have made biological data generally and publicly available
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Indiana Genomics Initiative (INGEN)
• Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000
• Build on traditional strengths and add new areas of research for IU
• Perform the research that will generate new treatments for human disease in the post-genomic era
• Improve human health generally and in the State of Indiana particularly
• Enhance economic growth in Indiana
INGEN Structure
Programs– Bioethics– Genomics– Bioinformatics– Medical Informatics– Education– Training
Cores– Tech Transfer– Gene Expression– Cell & Protein
Expression– Human Expression– Proteomics– Integrated Imaging– In vivo Imaging– Animal– Information Technology
($6.7M)
Challenges for UITS and the INGEN IT Core
• Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing)
• Assist bioinformatics researchers in use of advanced computing facilities
• Questions we are asked:– Why wouldn't it be better just to buy me a newer PC?
• Questions we asked:– What do you do now with computers that you would like
to do faster?– What would you do if computer resources were not a
constraint?
So, why is this better than just buying me a new PC?
• Unique facilities provided by IT Core– Redundant data storage– HPC – better uniprocessor performance; trivially
parallel programming, parallel programming– Visualization in the research laboratories
• Hardcopy document – INGEN's advanced IT facilities: The least you need to know
• Outreach efforts• Demonstration projects
Example projects
• Multiple simultaneous Matlab jobs for brain imaging.
• Installation of many commercial and open source bioinformatics applications.
• Site licenses for several commercial packages• Evaluation of several software products that were
not implemented• Creation of new software
Software pages from external sourcces
• Commercial– GCG/Seqweb– DiscoveryLink– PAUP
• Open Source– BLAST– FASTA– CLUSTALW– AutoDock
• Several programs written by UITS staff
Creation of new software• Gamma Knife – Penelope. Modified existing version for
more precise targeting with IU's Gamma Knife. • Karyote (TM) Cell model. Developed a portion of the code
used for model cell function. http://biodynamics.indiana.edu/
• PiVNs. Software to visualize human family trees • 3-DIVE (3D Interactive Volume Explorer).
http://www.avl.iu.edu/projects/3DIVE/• Protein Family Annotator – collaborative development with
IBM, Inc.• Centralized Life Sciences Data service• fastDNAml – maximum likelihood phylogenies
(http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)
Data Integration• Goal set by IU School of Medicine: Any
research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges
• IU has more than 1 TB of biomedical data stored in massive data storage system
• There are many public data sources• Different labs were independently downloading,
subsetting, and formatting data• Solution: IBM DiscoveryLink, DB/2 Information
Integrator
A life sciences data example - Centralized Life Science Database
• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)
• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.
• Lab data and programs like BLAST are included via DL’s wrappers.
• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences
• IU contributed writing of data parsers
A computational example - evolutionary biology
• Evolutionary trees describe how different organisms relate to each other
• This was originally done by comparison of fossils
• Statistical techniques and genomic data have made possible new approaches
fastDNAml: Building Phylogenetic Trees
• Goal: an objective means by which phylogenetic trees can be estimated
• The number of bifurcating unrooted trees for n taxa is(2n-5)!/ (n-3)! 2n-3
• Solution: heuristic search• Trees built incrementally. Trees
are optimized in steps, and best tree(s) are then kept for next round of additions
• High communication/compute ratio
fastDNAml algorithm, incremental tree building
• Compute the optimal tree for three taxa (chosen randomly) - only one topology possible
• Randomly pick another taxon, and consider each of the 2i-5 trees possible by adding this taxon into the first, three-taxa tree.
• Keep the best (maximum likelihood tree)
fastDNAml algorithm - Branch rearrangement
• Local branch rearrangement: move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities)
• Keep best resulting tree• Repeat this step until local
swapping no longer improves likelihood value
Because of local effects….
• Where you end up sometimes depends on where you start• This process searches a huge space of possible trees, and
is thus dependent upon the randomly selected initial taxa• Can get stuck in local optimum, rather than global• Must do multiple runs with different randomizations of
taxon entry order, and compare the results• Similar trees and likelihood values provide some
confidence, but still the space of all possible trees has not been searched extensively
fastDNAml parallel algorithm
fastDNAmlPerformance on IBM SP
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
Number of Processors
Spee
dUp
Perfect Scaling 50 Taxa 101 Taxa 150 Taxa
From Stewart et al., SC2001
Other grand challenge problems and some thoughts about the
future
3.0T MRI Scanner SGI Onyx
Real-time fMRI
In 1996, this required a supercomputerToday, it’s routine (Images and work by PSC)
CRAY T3E
Gamma Knife• Used to treat
inoperable tumors• Treatment methods
currently use a standardized head model
• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head
“Simulation-only” studies Aquaporins -proteins which conduct
large volumes of water through cell walls while filtering out charged particles like hydrogen ions.
Massive simulation showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over.
That breaks the ‘proton wire’ Work done at PSC
Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)35,000 hours TCS Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)
35,000 hours TCS
Deduced Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
~800 genomes @ 10k-20k per =~107 ORF’s
4 CPU years
228 CPU years
3 CPU years
570 CPU years
252 CPU years
3 CPU years
104 entries
integrated Genomic Annotation Pipeline - iGAP
Slide source: San Diego Supercomputing Center
Drug Design
• Protein folding “the right way”– Homology modeling– Then adjust for
sidechain variations, etc.
• Drug screening– Target generation –
so what– Target verification –
that’s important!– Toxicity prediction –
VERY important
What is the killer application in computational biology?
• Systems biology – latest buzzword, but…. (see special issues in Nature and Science)
• Goal: multiscale modeling from cell chemistry up to multiple populations
• Current software tools still inadequate• Multiscale modeling calls for use of established HPC
techniques – e.g. adaptive mesh refinement, coupled applications
• Current challenge examples: actin fiber creation, heart attack modeling
• Opportunity for predictive biology?
Current challenge areasProblem High
ThroughputGrid Capability
Protein modeling X
Genome annotation, alignment, phylogenetics
X X x*
Drug Target Screening X X(corporate grids)
X
Systems biology X X
Medical practice support
X X
*Only a few large scale problems merit ‘capability’ status
Other example large-scale computational biology grid projects
• Department of Energy “Genomes to Life” http://doegenomestolife.org/
• Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/
• Asia Pacific BioGrid (http://www.apbionet.org/)• Encyclopedia of Life (http://eol.sdsc.edu/)
Thoughts about working with biologists
Bioinformatics and Biomedical Research
• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.
• Traditional biomedical researchers must take advantage of new possibilities
• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers
Anopheles gambiae
From www.sciencemag.org/feature/ data/mosquito/mtm/index.htmlSource Library: Centers for Disease Control Photo Credit: Jim Gathany
INGEN IT Status Overall• So far, so good• 108 users of IU’s supercomputers• 104 users of massive data storage system• Six new software packages created or enhanced, more than
20 packages installed for use by INGEN-affiliated researchers• Three software packages made available as open source
software as direct result of INGEN. Opportunities for tech transfer due to use of Lesser GNU.
• The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc.
• Work on Penelope code for Gamma Knife likely to be first major transferable technology development. Stands to improve efficacy of Gamma Knife treatment at IU.
So how do you find biologists with whom to collaborate?
• Chicken and egg problem?
• Or more like fishing?• Or bank robbery?
Bank robbery• Willie Sutton, a famous American bank robber, was
asked why he robbed banks, and reportedly said “because that's where the money is.”*
• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might have when working with
an aerospace design firm– Patience
• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.”
*Unfortunately this is an urban legend; Sutton never said this
Some information about the Indiana University high performance
computing environment
Networking: I-light• Network jointly owned by
Indiana University and Purdue University
• 36 fibers between Bloomington and Indianapolis (IU’s main campuses)
• 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus)
• Co-location with Abilene GigaPOP
• Expansion to other universities recently funded
Sun E10000 (Solar)• Acquired 4/00• Shared memory architecture• ~52 GFLOPS• 64 400MHz cpus, 64GB memory• > 2 TB external disk• Supports some bioinformatics
software available only (or primarily) under Solaris (e.g. GCG/SeqWeb)
• Used extensively by researchers using large databases (db performance, cheminformatics, knowledge management)
IBM Research SP (Aries/Orion Complex)
• 632 cpus, 1.005 TeraFLOPS. First University-owned supercomputer in US to exceed 1 TFLOPS aggregate peak theoretical processing capacity.
• Geographically distributed at IUB and IUPUI
• Initially 50th, now 112th in Top 500 supercomputer list (to be lower in just a few days!)
• Distributed memory system with shared memory nodes
AVIDD
• AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data
• Project funded largely by the National Science Foundation (NSF), funds from Indiana University, and also by a Shared University Research grant from IBM, Inc.
AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and
Visualization of Instrument-Driven Data• Hardware components:
– Distributed Linux cluster• Three locations: IU Northwest, Indiana University Purdue University
Indianapolis, IU Bloomington• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk• Tuned, configured, and optimized for handling real-time data
streams– A suite of distributed visualization environments– Massive data storage
• Usage components:– Research by application scientists– Research by computer scientists– Education
Goals for AVIDD
• Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving)
• Focused on modern instruments that produce data in digital format at high rates. Example instruments:– Advanced Photon Source, Advanced Light Source– Atmospheric science instruments in forest– Gene sequencers, expression chip readers
Goals for AVIDD, Con’t• Performance goals:
– Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running)
– The system should be able to give (nearly) immediate attention to real-time computing tasks, while still running at high rates of overall utilization
– It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours
• Science goals:– The distribution of 3D visualization environments in scientists’ labs
should enhance the ability of scientists to spontaneously interact with their data.
– Ability to manage large data sets should no longer be an obstacle to scientific research
– AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research
More details on Linux Cluster• AVIDD-N: IU Northwest: 18 1.3 GHz PIII processors.
This cluster is for instructional use at the IU Northwest campus. (Funded primarily via a Shared University Research grant from IBM.)
• AVIDD-B and AVIDD-I: Two identical clusters, each with 208 2.4 GHz Prestonia processors. Each cluster has three types of nodes: head nodes, storage nodes, and compute nodes. (Servers: IBM x335)
• AVIDD-I64: 36 1.0 GHz Itanium processors (Servers: IBM Tiger)
• Myrinet2000, Gbit, and 100bT networks within cluster. Non-routing network using Force10 equipment between Bloomington and Indianapolis
Linux Cluster Software
• GPFS (proprietary from IBM). General Parallel File System
• System management system from IBM• Maui Scheduler• PBS Pro • LAM/MPI• Redhat Linux
Real-time pre-emption of jobs• High overall rate of utilization, while able to respond
‘immediately’ to requests for real-time data analysis.• System design
– Maui Scheduler: support multiple QoS levels for jobs– PBSPro: support multiple QoS, and provide signaling for job
termination, job suspension, and job checkpointing– LAM/MPI and Redhat: kernel-level checkpointing
• Options to be supported:– cancel and terminate job– Re-queue job– signal, wait, and requeue job– checkpoint job (as available)– signal job (used to send SIGSTOP/SIGRESUME)
1 TFLOPS Achieved on Linpack!• AVIDD-I and AVIDD-B together = have peak theoretical
capacity of 1.997 TFLOPS.• We have just achieved 1.02 TFLOPS on Linpack
benchmark for this distributed system.• Details:
– Force10 switches, non-routing 20 GB/Sec network connecting AVIDD-I and AVIDD-B. (~90 km distance)
– LINPACK implementation from University of Tenessee called HPL (High Perfomrance LINPACK), ver 1.0 (http://www.netlib.org/benchmark/hpl/). Problem size we used is 220000, and block size is 200.
– LAM/MPI 6.6 beta development version (3/23/2003)– Tuning: block size (optimized for smaller matrices, and then seemed to
continue to work well), increased the default frame size for communications, fiddled with number of systems used, rebooted entire system just before running benchmark (!)
Cost of grid computing on performance
• Each of the two clusters alone achieved 682.5 GFLOPS, or 68% of peak theoretical of 998.4 GFLOPS per cluster
• The aggregate distributed cluster achieved 1.02 TFLOPS out of 1.997, or 51% of peak theoretical
Massive Data Storage System• Based on HPSS (High Performance
Software System)• First HPSS installation with
distributed movers; STK 9310 Silos in Bloomington and Indianapolis
• Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable.
• 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data.
• Common File System (CFS) – disk storage ‘for the masses’
John-E-BoxInvented by John N. Huffman, John C. Huffman, and Eric
Wernert
Acknowledgments
• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.
• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.
• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
Acknowledgements con’t• UITS Research and Academic Computing Division
managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar
• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock
• UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom
• Assistance with this presentation: John Herrin, Malinda Lingwall
• Thanks to Dr. Michael Resch, Director, HLRS, for inviting me to visit HLRS
• Thanks to Dr. Wolfgang Nagel for inviting me to visit ZHR and Dresden!
• Further information is available at– ingen.iu.edu– http://www.indiana.edu/~uits/rac/– http://www.ncsc.org/casc/paper.html– http://www.indiana.edu/~rac/staff_papers.html
• A recommended German bioinformatics site:– http://www.bioinformatik.de/
• Paper coming soon for SIGUCCS conference Oct. 2003