Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart...

Computational Biology: Practical lessons and thoughts for the future

Dr. Craig A. [email protected]

Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart

Director, Research and Academic Computing, University Information Technology Services

Director, Information Technology Core, Indiana Genomics Initiative

19 June 2003

License terms• Please cite as: Stewart, C.A. Computational Biology: Practical lessons and

thoughts for the future. 2003. Presentation. Presented at: ZIH, Technische Universität Dresden (Dresden, Germany, 19 Jun 2003). Available from: http://hdl.handle.net/2022/14802

• Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

http://hdl.handle.net/2022/14802



Outline• The revolution in biology & IU’s response –the Indiana

Genomics Initiative• Example software applications

– Central Life Sciences Database Service– fastDNAml

• What are the grand challenge problems in computational biology?

• Some thoughts about dealing with biological and biomedical researchers in general

• A brief description of IU’s high performance computing, storage, and visualization environments

The revolution in biology

• Automated, high-throughput sequencing has revolutionized biology.

• Computing has been a part of this revolution in three ways so far:– Computing has been essential to

the assembly of genomes– There is now so much biological

data available that it is impossible to utilize it effectively without aid of computers

– Networking and the Web have made biological data generally and publicly available

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Indiana Genomics Initiative (INGEN)

• Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000

• Build on traditional strengths and add new areas of research for IU

• Perform the research that will generate new treatments for human disease in the post-genomic era

• Improve human health generally and in the State of Indiana particularly

• Enhance economic growth in Indiana

INGEN Structure

Programs– Bioethics– Genomics– Bioinformatics– Medical Informatics– Education– Training

Cores– Tech Transfer– Gene Expression– Cell & Protein

Expression– Human Expression– Proteomics– Integrated Imaging– In vivo Imaging– Animal– Information Technology

($6.7M)

Challenges for UITS and the INGEN IT Core

• Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing)

• Assist bioinformatics researchers in use of advanced computing facilities

• Questions we are asked:– Why wouldn't it be better just to buy me a newer PC?

• Questions we asked:– What do you do now with computers that you would like

to do faster?– What would you do if computer resources were not a

constraint?

So, why is this better than just buying me a new PC?

• Unique facilities provided by IT Core– Redundant data storage– HPC – better uniprocessor performance; trivially

parallel programming, parallel programming– Visualization in the research laboratories

• Hardcopy document – INGEN's advanced IT facilities: The least you need to know

• Outreach efforts• Demonstration projects

Example projects

• Multiple simultaneous Matlab jobs for brain imaging.

• Installation of many commercial and open source bioinformatics applications.

• Site licenses for several commercial packages• Evaluation of several software products that were

not implemented• Creation of new software

Software pages from external sourcces

• Commercial– GCG/Seqweb– DiscoveryLink– PAUP

• Open Source– BLAST– FASTA– CLUSTALW– AutoDock

• Several programs written by UITS staff

Creation of new software• Gamma Knife – Penelope. Modified existing version for

more precise targeting with IU's Gamma Knife. • Karyote (TM) Cell model. Developed a portion of the code

used for model cell function. http://biodynamics.indiana.edu/

• PiVNs. Software to visualize human family trees • 3-DIVE (3D Interactive Volume Explorer).

http://www.avl.iu.edu/projects/3DIVE/• Protein Family Annotator – collaborative development with

IBM, Inc.• Centralized Life Sciences Data service• fastDNAml – maximum likelihood phylogenies

(http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)

Data Integration• Goal set by IU School of Medicine: Any

research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges

• IU has more than 1 TB of biomedical data stored in massive data storage system

• There are many public data sources• Different labs were independently downloading,

subsetting, and formatting data• Solution: IBM DiscoveryLink, DB/2 Information

Integrator

A life sciences data example - Centralized Life Science Database

• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)

• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.

• Lab data and programs like BLAST are included via DL’s wrappers.

• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences

• IU contributed writing of data parsers

A computational example - evolutionary biology

• Evolutionary trees describe how different organisms relate to each other

• This was originally done by comparison of fossils

• Statistical techniques and genomic data have made possible new approaches

fastDNAml: Building Phylogenetic Trees

• Goal: an objective means by which phylogenetic trees can be estimated

• The number of bifurcating unrooted trees for n taxa is(2n-5)!/ (n-3)! 2n-3

• Solution: heuristic search• Trees built incrementally. Trees

are optimized in steps, and best tree(s) are then kept for next round of additions

• High communication/compute ratio

fastDNAml algorithm, incremental tree building

• Compute the optimal tree for three taxa (chosen randomly) - only one topology possible

• Randomly pick another taxon, and consider each of the 2i-5 trees possible by adding this taxon into the first, three-taxa tree.

• Keep the best (maximum likelihood tree)

fastDNAml algorithm - Branch rearrangement

• Local branch rearrangement: move any subtree crossing n vertices (if n=1 there are 2i-6 possibilities)

• Keep best resulting tree• Repeat this step until local

swapping no longer improves likelihood value

Because of local effects….

• Where you end up sometimes depends on where you start• This process searches a huge space of possible trees, and

is thus dependent upon the randomly selected initial taxa• Can get stuck in local optimum, rather than global• Must do multiple runs with different randomizations of

taxon entry order, and compare the results• Similar trees and likelihood values provide some

confidence, but still the space of all possible trees has not been searched extensively

fastDNAml parallel algorithm

fastDNAmlPerformance on IBM SP

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Number of Processors

Spee

dUp

Perfect Scaling 50 Taxa 101 Taxa 150 Taxa

From Stewart et al., SC2001

Other grand challenge problems and some thoughts about the

future

3.0T MRI Scanner SGI Onyx

Real-time fMRI

In 1996, this required a supercomputerToday, it’s routine (Images and work by PSC)

CRAY T3E

Gamma Knife• Used to treat

inoperable tumors• Treatment methods

currently use a standardized head model

• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head

“Simulation-only” studies Aquaporins -proteins which conduct

large volumes of water through cell walls while filtering out charged particles like hydrogen ions.

Massive simulation showed that water moves through aquaporin channels in single file. Oxygen leads the way in. Half way through, the water molecule flips over.

That breaks the ‘proton wire’ Work done at PSC

Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)35,000 hours TCS Klaus Schulten et al, U. of Illinois, SCIENCE (April 19, 2002)

35,000 hours TCS

Deduced Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

~800 genomes @ 10k-20k per =~107 ORF’s

4 CPU years

228 CPU years

3 CPU years

570 CPU years

252 CPU years

3 CPU years

104 entries

integrated Genomic Annotation Pipeline - iGAP

Slide source: San Diego Supercomputing Center

Drug Design

• Protein folding “the right way”– Homology modeling– Then adjust for

sidechain variations, etc.

• Drug screening– Target generation –

so what– Target verification –

that’s important!– Toxicity prediction –

VERY important

What is the killer application in computational biology?

• Systems biology – latest buzzword, but…. (see special issues in Nature and Science)

• Goal: multiscale modeling from cell chemistry up to multiple populations

• Current software tools still inadequate• Multiscale modeling calls for use of established HPC

techniques – e.g. adaptive mesh refinement, coupled applications

• Current challenge examples: actin fiber creation, heart attack modeling

• Opportunity for predictive biology?

Current challenge areasProblem High

ThroughputGrid Capability

Protein modeling X

Genome annotation, alignment, phylogenetics

X X x*

Drug Target Screening X X(corporate grids)

X

Systems biology X X

Medical practice support

X X

*Only a few large scale problems merit ‘capability’ status

Other example large-scale computational biology grid projects

• Department of Energy “Genomes to Life” http://doegenomestolife.org/

• Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/

• Asia Pacific BioGrid (http://www.apbionet.org/)• Encyclopedia of Life (http://eol.sdsc.edu/)

Thoughts about working with biologists

Bioinformatics and Biomedical Research

• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities

• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers

Anopheles gambiae

From www.sciencemag.org/feature/ data/mosquito/mtm/index.htmlSource Library: Centers for Disease Control Photo Credit: Jim Gathany

INGEN IT Status Overall• So far, so good• 108 users of IU’s supercomputers• 104 users of massive data storage system• Six new software packages created or enhanced, more than

20 packages installed for use by INGEN-affiliated researchers• Three software packages made available as open source

software as direct result of INGEN. Opportunities for tech transfer due to use of Lesser GNU.

• The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc.

• Work on Penelope code for Gamma Knife likely to be first major transferable technology development. Stands to improve efficacy of Gamma Knife treatment at IU.

So how do you find biologists with whom to collaborate?

• Chicken and egg problem?

• Or more like fishing?• Or bank robbery?

Bank robbery• Willie Sutton, a famous American bank robber, was

asked why he robbed banks, and reportedly said “because that's where the money is.”*

• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might have when working with

an aerospace design firm– Patience

• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.”

*Unfortunately this is an urban legend; Sutton never said this

Some information about the Indiana University high performance

computing environment

Networking: I-light• Network jointly owned by

Indiana University and Purdue University

• 36 fibers between Bloomington and Indianapolis (IU’s main campuses)

• 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus)

• Co-location with Abilene GigaPOP

• Expansion to other universities recently funded

Sun E10000 (Solar)• Acquired 4/00• Shared memory architecture• ~52 GFLOPS• 64 400MHz cpus, 64GB memory• > 2 TB external disk• Supports some bioinformatics

software available only (or primarily) under Solaris (e.g. GCG/SeqWeb)

• Used extensively by researchers using large databases (db performance, cheminformatics, knowledge management)

IBM Research SP (Aries/Orion Complex)

• 632 cpus, 1.005 TeraFLOPS. First University-owned supercomputer in US to exceed 1 TFLOPS aggregate peak theoretical processing capacity.

• Geographically distributed at IUB and IUPUI

• Initially 50th, now 112th in Top 500 supercomputer list (to be lower in just a few days!)

• Distributed memory system with shared memory nodes

AVIDD

• AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data

• Project funded largely by the National Science Foundation (NSF), funds from Indiana University, and also by a Shared University Research grant from IBM, Inc.

AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and

Visualization of Instrument-Driven Data• Hardware components:

– Distributed Linux cluster• Three locations: IU Northwest, Indiana University Purdue University

Indianapolis, IU Bloomington• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk• Tuned, configured, and optimized for handling real-time data

streams– A suite of distributed visualization environments– Massive data storage

• Usage components:– Research by application scientists– Research by computer scientists– Education

Goals for AVIDD

• Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving)

• Focused on modern instruments that produce data in digital format at high rates. Example instruments:– Advanced Photon Source, Advanced Light Source– Atmospheric science instruments in forest– Gene sequencers, expression chip readers

Goals for AVIDD, Con’t• Performance goals:

– Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running)

– The system should be able to give (nearly) immediate attention to real-time computing tasks, while still running at high rates of overall utilization

– It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours

• Science goals:– The distribution of 3D visualization environments in scientists’ labs

should enhance the ability of scientists to spontaneously interact with their data.

– Ability to manage large data sets should no longer be an obstacle to scientific research

– AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research

More details on Linux Cluster• AVIDD-N: IU Northwest: 18 1.3 GHz PIII processors.

This cluster is for instructional use at the IU Northwest campus. (Funded primarily via a Shared University Research grant from IBM.)

• AVIDD-B and AVIDD-I: Two identical clusters, each with 208 2.4 GHz Prestonia processors. Each cluster has three types of nodes: head nodes, storage nodes, and compute nodes. (Servers: IBM x335)

• AVIDD-I64: 36 1.0 GHz Itanium processors (Servers: IBM Tiger)

• Myrinet2000, Gbit, and 100bT networks within cluster. Non-routing network using Force10 equipment between Bloomington and Indianapolis

Linux Cluster Software

• GPFS (proprietary from IBM). General Parallel File System

• System management system from IBM• Maui Scheduler• PBS Pro • LAM/MPI• Redhat Linux

Real-time pre-emption of jobs• High overall rate of utilization, while able to respond

‘immediately’ to requests for real-time data analysis.• System design

– Maui Scheduler: support multiple QoS levels for jobs– PBSPro: support multiple QoS, and provide signaling for job

termination, job suspension, and job checkpointing– LAM/MPI and Redhat: kernel-level checkpointing

• Options to be supported:– cancel and terminate job– Re-queue job– signal, wait, and requeue job– checkpoint job (as available)– signal job (used to send SIGSTOP/SIGRESUME)

1 TFLOPS Achieved on Linpack!• AVIDD-I and AVIDD-B together = have peak theoretical

capacity of 1.997 TFLOPS.• We have just achieved 1.02 TFLOPS on Linpack

benchmark for this distributed system.• Details:

– Force10 switches, non-routing 20 GB/Sec network connecting AVIDD-I and AVIDD-B. (~90 km distance)

– LINPACK implementation from University of Tenessee called HPL (High Perfomrance LINPACK), ver 1.0 (http://www.netlib.org/benchmark/hpl/). Problem size we used is 220000, and block size is 200.

– LAM/MPI 6.6 beta development version (3/23/2003)– Tuning: block size (optimized for smaller matrices, and then seemed to

continue to work well), increased the default frame size for communications, fiddled with number of systems used, rebooted entire system just before running benchmark (!)

http://www.netlib.org/benchmark/hpl/

Cost of grid computing on performance

• Each of the two clusters alone achieved 682.5 GFLOPS, or 68% of peak theoretical of 998.4 GFLOPS per cluster

• The aggregate distributed cluster achieved 1.02 TFLOPS out of 1.997, or 51% of peak theoretical

Massive Data Storage System• Based on HPSS (High Performance

Software System)• First HPSS installation with

distributed movers; STK 9310 Silos in Bloomington and Indianapolis

• Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable.

• 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data.

• Common File System (CFS) – disk storage ‘for the masses’

John-E-BoxInvented by John N. Huffman, John C. Huffman, and Eric

Wernert

Acknowledgments

• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Acknowledgements con’t• UITS Research and Academic Computing Division

managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom

• Assistance with this presentation: John Herrin, Malinda Lingwall

• Thanks to Dr. Michael Resch, Director, HLRS, for inviting me to visit HLRS

• Thanks to Dr. Wolfgang Nagel for inviting me to visit ZHR and Dresden!

• Further information is available at– ingen.iu.edu– http://www.indiana.edu/~uits/rac/– http://www.ncsc.org/casc/paper.html– http://www.indiana.edu/~rac/staff_papers.html

• A recommended German bioinformatics site:– http://www.bioinformatik.de/

• Paper coming soon for SIGUCCS conference Oct. 2003

Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart...

Documents

Transcript of Computational Biology: Practical lessons and thoughts for the future Dr. Craig A. Stewart...