Current challenges and opportunities in Biogrids Dr. Craig A. Stewart [email protected] Director,...

31
Current challenges and opportunities in Biogrids Dr. Craig A. Stewart [email protected] Director, Research and Academic Computing, University Information Technology Services Director, Information Technology Core, Indiana Genomics Initiative Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart 6th Metacomputing Symposium 22 May 2003

Transcript of Current challenges and opportunities in Biogrids Dr. Craig A. Stewart [email protected] Director,...

Page 1: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Current challenges and opportunities in Biogrids

Dr. Craig A. Stewart

[email protected]

Director, Research and Academic Computing, University Information Technology Services

Director, Information Technology Core, Indiana Genomics Initiative

Visiting Scientist, Höchstleistungsrechenzentrum Universität Stuttgart

6th Metacomputing Symposium 22 May 2003

Page 2: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

License terms• Please cite as: Stewart, C.A. Current challenges and opportunities in Biogrids.

2003. Presentation. Presented at: 6th Metacomputing Symposium (High Performance Computing Center, Universitaet Stuttgart, Stuttgart, Germany, 22 May 2003). Available from: http://hdl.handle.net/2022/15217

• Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

Page 3: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Outline• Background about grids and biology• Biodata grids• Biocomputation grids• Some comments and suggestions regarding the

challenges and opportunities for the computing community and the biology community

• NB:– Likely more questions than answers!– “Grids” will be defined loosely, and not necessarily consistently– Similar lack of precision will be employed with the various flavors

of “–omics.” Ultimately it’s all computational biology.

Page 4: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Why do subject-specific grids exist?1

• In general:– Practical issues– Communities of practice and trust– Existence of specific problems that appear to call for grid-based

approaches (e.g. GriPhyN)

• In biology:– Rudimentary “grid” projects predate the Web. Example: Flybase

via Gopher. [Flybase dates to 1993]– Fractionated communities– Many independent data sources suggest a grid approach

1 These views may be peculiar to the US or to the speaker

Page 5: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

The revolution in biology

• Automated, high-throughput sequencing has revolutionized biology.

• Computing has been a part of this revolution in three ways so far:– Computing has been essential to the assembly of genomes– There is now so much biological data available that it is

impossible to utilize it effectively without aid of computers– Networking and the Web have made biological data generally

and publicly available

• Computing should be in the future critical for:– Automated data analysis– Simulation and prediction

Page 6: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Page 7: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Biodata Grids

Page 8: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.
Page 9: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

So how big is big?

• Genbank has grown exponentially, but the total sequences are now still only ~30B base pairs

• All of the data and programs from NCBI could be fit on one reasonably large supercomputer

• Even BIRN, the most ambitious of planned bio data grid projects, has a data set that will grow 10s to 100s of TBs per year

• ‘large dataset’ in the biological sciences ≠ ‘large dataset’ in the physical sciences

• Complexity of linkages within the data, however…

Page 10: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

How many data sources?

• DNA/Chromosomes – GenBank. Operated by NCBI (National Center for Biotechnology

Information). http://www.ncbi.nlm.nih.gov– European Molecular Biology Laboratory – Nucleotide Sequence

Database. http://www.ebi.ac.uk/genomes– DNA Database of Japan (DDBJ). http://www.ddbj.nig.ac.jp

• Proteins– ExPASy http://www.expasy.org/– Protein Data Base – PDB http://www.rcsb.org/pdb/

• Biochemistry & Enzymes– PathDB http://www.ncgr.org/software/version_2_0.html– Kegg WIT http://wit.mcs.anl.gov/WIT2/

• Not to mention the organism-specific databases

Page 11: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

The needs and opportunities in Biodata grids

• Many disparate subcommunities, many funding sources, lots of history

• NCBI, DDBJ, EMBO contain essentially the same data; they complement/compete in terms of features and functions.

• Web clicking is not a suitable way to do large-scale computing!

• Private companies may need to be very private

http://www.ncbi.nlm.nih.gov/

Page 12: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Data integration and management

• Person-intensive downloads• Avaki (http://www.avaki.com/)• Lion Biosciences

(http://www.lionbioscience.com/)• IBM – DB2 Information Integrator and

DiscoveryLink (www.ibm.com/)• Various XML-based efforts

Page 13: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

IU Centralized Life Science Database (CSLD)

• Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges

• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)

• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.

• Lab data and programs like BLAST are included via DL’s wrappers.

• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences

• IU contributed writing of data parsers

Page 14: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.
Page 15: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.
Page 16: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Biocomputation Grids

Page 17: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Orders of magnitude in biology

Timescale (seconds)10-15 10-9 10-6 10-3 100 103

10-12 109

Siz

e S

cale

Ato

ms

Bio

pol

ym

ers

Geologic &EvolutionaryTimescales

106

Org

anis

ms

Ab initioquantum chemistry

First principlesmolecular dynamics

Empirical force fieldmolecular dynamics

Enzymemechanisms

Proteinfolding

Homology-basedprotein modeling

EvolutionaryprocessesEcosystems

andepidemiology

Cell signalingCel

ls

100

103

106

100

103

106

100

103

106

100

103

106

Organ function

DNAreplication

Finite elementmodels

Electrostaticcontinuum models

Discrete Automatamodels

Slide source: Rick Stevens, Argonne National Laboratory; information source DOE Genomes to Life ©

Page 18: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Example large-scale computational biology grid projects

• Department of Energy “Genomes to Life” http://doegenomestolife.org/

• Biomedical Informatics Research Network (BIRN) http://birn.ncrr.nih.gov/birn/

• Asia Pacific BioGrid (http://www.apbionet.org/)• Encyclopedia of Life (http://eol.sdsc.edu/)

Page 19: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Deduced Protein sequences

Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)

Structural assignment of domains by PSI-BLAST on FOLDLIB

Only sequences w/out A-prediction

Only sequences w/out A-prediction

Structural assignment of domains by 123D on FOLDLIB

Create PSI-BLAST profiles for Protein sequences

Store assigned regions in the DB

Functional assignment by PFAM, NR, PSIPred assignments

FOLDLIB

NR, PFAM

Building FOLDLIB:

PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP

90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

Domain location prediction by sequence

structure infosequence info

SCOP, PDB

~800 genomes @ 10k-20k per =~107 ORF’s

4 CPU years

228 CPU years

3 CPU years

570 CPU years

252 CPU years

3 CPU years

104 entries

integrated Genomic Annotation Pipeline - iGAP

Slide source: San Diego Supercomputing Center ©

Page 20: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

One example: Building Phylogenetic Trees

• Goal: an objective means by which phylogenetic trees can be estimated

• The number of bifurcating unrooted trees for n taxa is(2n-5)!/ (n-3)! 2n-3

• Solution: heuristic search• Trees built incrementally. Trees

are optimized in steps, and best tree(s) are then kept for next round of additions

• High communication/compute ratio

Page 21: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

fastDNAml performance on an international Grid

0

500

1000

1500

2000

2500

3000

3500

0 2 4 6 8 10 12 14 16 18

# Processors

Wal

l clo

ck ti

me

(sec

onds

)

IU Only

IU&NUS

IU&ANU

From iGrid ’98 at SC98

Page 22: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

fastDNAmlPerformance on IBM SP

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

Number of Processors

Spee

dUp

Perfect Scaling 50 Taxa 101 Taxa 150 Taxa

From Stewart et al., SC2001

Page 23: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

fastDNAml and Biogrid Computing

• IU-created library called SMBL (Simple Message Brokering Library) permits use of Condor flocks as “worker” processes

• fastDNAml has a very high compute/communicate ratio

• fastDNAml is one example of a general phenomenon in biogrid computation: How much of it is really capability computing, and how much of it would be high-throughput computing if the applications were really well written?

Page 24: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Some thoughts about the future

Page 25: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Current challenge areas

Problem High Throughput

Grid Capability

Protein modeling X

Genome annotation, alignment, phylogenetics

X X x*

Drug Target Screening X X(corporate grids)

X

Systems biology X X

Medical practice support

X X

*Only a few large scale problems merit ‘capability’ status

Page 26: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

What is the killer application for biocomputation grids?

• Systems biology – latest buzzword, but…. (see special issues in Nature and Science)

• Goal: multiscale modeling from cell chemistry up to multiple populations

• Current software tools still inadequate• Multiscale modeling calls for use of established HPC

techniques – e.g. adaptive mesh refinement, coupled applications

• The structure of the problems match the structure of grids

• Current challenge examples: actin fiber creation, heart attack modeling

• Opportunity for predictive biology?

Page 27: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Opportunities in Computational Biology and Biomedical Research

• Bioinformatics and related areas offer tremendous new possibilities

• Computer-oriented biomedical researchers must utilize the detailed knowledge held by “traditional” researchers

• There are tremendous opportunities for computer scientists and computational scientists to find and solve interesting and important problems!

Fromwww.sciencemag.org/

feature/data/mosquito/mtm/index.html

Source Library: Centers for Disease Control

Photo Credit: Jim Gathany

Page 28: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Some thoughts about the future of Grids and biocomputing

• Biodata problems are largely solvable now without use of sophisticated grid technology. This will change!

• Biocomputation grids must be developed with appropriate technology choices. Enhancement of software must happen simultaneously!

• Until the grid software becomes substantially simpler for the end user, grid projects will likely continue to be based on communities of common interest.

• There are many biodata grid and biocomputation grid opportunities that are a good match for grid architectures. There are natural similarities between the structure of grids and the likely structure of significant grand challenge problems in computational biology, biomedicine, etc.

Page 29: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Acknowledgments

• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

• Particular thanks to Dr. Michael Resch, Director, HLRS, for inviting me to visit HLRS, and to Dr. Matthias Mϋller and Peggy Lindner for inviting me to speak here today.

Page 30: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Acknowledgements con’t• UITS Research and Academic Computing Division managers:

Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom

• Assistance with this presentation: John Herrin, Malinda Lingwall

Page 31: Current challenges and opportunities in Biogrids Dr. Craig A. Stewart stewart@iu.edu Director, Research and Academic Computing, University Information.

Additional Information

• Further information is available at– http://www.indiana.edu/~uits/rac/– http://www.indiana.edu/~rac/staff_papers.html– http://www.casc.org

• A recommended German bioinformatics site:– http://www.bioinformatik.de/