High Performance Computing for University Medical Research: A Successful Implementation Dr. Craig A....

41
High Performance Computing for University Medical Research: A Successful Implementation Dr. Craig A. Stewart, Ph.D. [email protected] Director, Research and Academic Computing, University Information Technology Services Director, Information Technology Core, Indiana Genomics Initiative Dr. Richard Repasky, Ph.D. [email protected] Bioinformatics Specialist

Transcript of High Performance Computing for University Medical Research: A Successful Implementation Dr. Craig A....

High Performance Computing for University Medical

Research: A Successful Implementation

Dr. Craig A. Stewart, [email protected], Research and Academic Computing, University Information Technology ServicesDirector, Information Technology Core, Indiana Genomics Initiative

Dr. Richard Repasky, [email protected] Specialist

License Terms

• Please cite this presentation as: Stewart, C.A. and R. Repasky. High Performance Computing for University Medical Research: A Successful Implementation. 2007. Presentation. Presented at: Bio-IT World Conference & Expo (Boston, MA, 24-26 Apr 2007). Available from: http://hdl.handle.net/2022/14600

• Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document.

• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.

• Except where otherwise noted, the contents of this presentation are copyright 2007 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

Bioinformatics and Biomedical Research

• Bioinformatics, Genomics, Proteomics, ____ics all promise to radically change our understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities

• “Post-genomic” research must take advantage of the tremendous store of detailed knowledge held by traditional biomedical researchers

Anopheles gambiae

• From www.sciencemag.org/feature/data/mosquito/mtm/index.html

Source Library:Centers for Disease Control PHIL Photo Credit:Jim Gathany

IU’s goals for the Indiana Genomics Initiative

(INGEN)• Build on traditional strengths of IU School of Medicine

• Build on IU's strength in Information Technology

• Add new programs of research made possible by the sequencing of the human genome

• Perform the research that will generate new treatments for human disease in the post-genomic era

• Improve human health generally and in the State of Indiana particularly• Enhance economic growth in Indiana• INGEN was created by a $105M grant from the Lilly Endowment, Inc.

and launched December, 2000• The goal of this talk is to explain how advanced information technology

was implemented to aid in the meeting of these goals.

Outline

• Background information about IU

• The Indiana Genomics Initiative (INGEN)

• The INGEN Information Technology Core

• Facilities

• Service

• Some key projects

• Status and summary of success factors

• Acknowledgements

IU in a nutshell

• $2B Annual Budget• 8 campuses, 90,000 students, 3,900 faculty• 878 degree programs; > 100 programs ranked within top 20

of their type nationally• Nation’s second largest school of medicine

• 1,347 M.D., Ph.D. and M.D./Ph.D students• Sole school of medicine in Indiana• Traditional strengths in human genetic diseases (e.g.,

Alcoholism, Huntingtons) and medical records (Regenstrief Institute)

IT @ IU in a nutshell

•CIO: Vice President Michael A. McRobbie

•~$100M annual budget

•Technology services offered university- wide

•Networking

•IU Operates network Operations Center for Abilene

•High Performance Computing

•First university in US to own a 1 TFLOPS supercomputer

•Top 500 list has for past several years included at least one IU supercomputer

INGEN StructurePrograms

• Bioethics

• Genomics

• Medical Informatics

• Education

• Training

Cores

• Tech Transfer

• Gene Expression

• Cell & Protein Expression

• Human Expression

• Information Technology

• Proteomics

• Integrated Imaging

• In vivo Imaging

• Animal

EducationEducation TrainingTrainingBioinformaticsBioinformatics

MedicalMedical

InformaticsInformatics

GenomicsGenomics

BioethicsBioethics

ProteomicsProteomics

IntegratedIntegratedMicroscopyMicroscopy

CellCellandand

ProteinProteinExpressionExpression

TechnologyTechnologyTransferTransfer

InformationInformationTechnologyTechnology

DrosophilaDrosophilaGenotypingGenotypingand Geneand Gene

ExpressionExpression

HumanHumanExpressionExpression

In VivoIn VivoImagingImaging

AnimalAnimal

Indiana Genomics Initiative

ProgramsPrograms

CoresCores

Information Technology Core

• Foci:

• High Performance Computing

• Visualization (esp. 3D)

• Massive Data Storage

• Support for use of all of the above

• $6.7M budget for IT Core

• Baseline IT services for School of Medicine responsibility of School of Medicine CIO

Challenges for UITS and the INGEN IT Core

• Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing)

• Assist bioinformatics researchers in use of advanced computing facilities

• Questions we are asked:

• Why wouldn't it be better just to buy me a newer PC?

• Questions we ask:

• What do you do now with computers that you would like to do faster?

• What would you do if computer resources were not a constraint?

Steps in meeting the challenge

• Use INGEN funding to enhance IU’s high performance computing hardware environment

• Use INGEN funding to add dedicated staff supporting INGEN researchers

• Proof of concept projects showing advanced capabilities of IU’s IT environment

• Outreach to get many people using at least the basic capabilities of IU’s advanced IT environment

Hardware Environment

• I-Light network

• High Performance Computing

• IBM SP – 1.005 TFLOPS

• Sun E10000 52 GFLOPS

• Large, distributed Linux cluster – 1.1 TFLOPS

• Massive Data Storage system

• Advanced Visualization Systems

• CAVE

• John-E-Box

IBM Research SP (Aries/Orion Complex)

• Acquired 9/96, expanded in 1998, 1999, 2000,2001,2002 with help of IU IT Strategic Plan funds, IBM SUR grants and INGEN grant from Lilly Endowment, Inc.

• Geographically distributed at IUB and IUPUI• 632 cpus, 1.005 TeraFLOPS• First University-owned supercomputer in US to exceed 1 TFLOPS

processing capacity• Initially 50th, now 112th in Top 500 supercomputer list• Distributed memory system with shared memory nodes• AIX 5.1, wealth of software including SAS, SPSS, S-Plus,

Mathematica, Matlab, Maple, Gaussian, GIS, scientific/numerical libraries, Oracle and DB2, and more

IBM Research SP (Aries/Orion)

©2000 Tyagan Miller

Sun E10000 (Solar)• Acquired 4/00

• Shared memory architecture

• ~52 GFLOPS

• 64 400MHz cpus, 64GB memory

• > 2 TB external disk

• Solaris 2.8

• Supports some bioinformatics software not available under AIX (e.g. GCG/SeqWeb)

Sun E10000 (Solar)

©2000 Tyagan Miller

Distributed Linux Cluster

• AVIDD (Analysis and Visualization of Instrument-Driven Data)

• 1.1 TFLOPS, 0.5 TB RAM, 10 TB Disk

• Tuned, configured, and optimized for handling real-time data streams

Massive Data Storage System

• Based on HPSS (High Performance Software System)

• 180 TB capacity with existing tapes; total capacity of 480 TB

• First distributed HPSS installation; STK 9310 Silos in Bloomington and Indianapolis

• Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable.

STK Silo

©2000 Tyagan Miller

Advanced Visualization

• Advanced Visualization Lab – recognized as leader in implementation of 3D and other advanced visualization technologies

• CAVE – Immersive 3D environment

• John-E-Box – IU designed, low-cost passive 3D device. Under construction now, planned for installation in multiple INGEN-affiliated labs

John-E-BoxInvented by John N. Huffman, John C.

Huffman, and Eric Wernert

Specific benefits in hardware environment as a

result of INGEN funding:

• Funded significant fraction of upgrade of IU’s IBM SP to 1 TFLOPS

• Funded addition of STK Silo in Indianapolis (and tapes) to provide redundant storage of data

• Funded placement of visualization equipment within the School of Medicine

So, what now that we have all of this hardware?

• Strategic relationships with vendors

• University Information Technology Services has a history of excellent customer support and long-term, collaborative research.

• Focus on provision of facilities and services as a competitive advantage.

• Annual customer satisfaction survey – user satisfaction typically > 95%. These results probably not representative of SoM as of 2000.

• More information available at http://www.indiana.edu/~rac/siguccs_copyright.html

• It’s people – consulting staff – that make the hardware useful for researchers

INGEN IT Core Support Staff

• Visualization programmer, HPC programmer, and bioinformatics database specialist hired to support INGEN

• Staff added to existing management units within UITS

• economy of scale (management, exchange of expertise)

• Assures addition rather than substitution for base-funded consulting support

So, why is this better than just buying me a new PC?

• Unique facilities provided by IT Core

• Redundant data storage

• HPC – better uniprocessor performance; trivially parallel programming, parallel programming

• Visualization in the research laboratories

• Hardcopy document – INGEN's advanced IT facilities: The least you need to know

• Outreach efforts

• Demonstration projects

Example projects

• Multiple simultaneous Matlab jobs for brain imaging.

• Installation of many commercial and open source bioinformatics applications.

• Site licenses for several commercial packages

• Evaluation of several software products that were not implemented.

Creation of new software• Gamma Knife – Penelope. Modified existing version for

more precise targeting with IU's Gamma Knife.

• Karyote (TM) Cell model. Developed a portion of the code used for model cell function. http://biodynamics.indiana.edu/

• PiVNs. Software to visualize human family trees

• 3-DIVE (3D Interactive Volume Explorer). http://www.avl.iu.edu/projects/3DIVE/

• fastDNAml – maximum likelihood phylogenies (http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)

• Protein Family Annotator – collaborative development with IBM, Inc.

Data Integration• Goal set by IU School of Medicine: Any research

within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges

• IU has more than 1 TB of biomedical data stored in massive data storage system

• There are many public data sources

• Different labs were independently downloading, subsetting, and formatting data

• Solution: IBM DiscoveryLink, DB/2 Information Integrator

Centralized Life Science Database (CSLD)

• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)

• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.

• Lab data and programs like BLAST are included via DL’s wrappers.

• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences

• IU contributed writing of data parsers

Status Overall• So far, so good

• 108 users of IU’s supercomputers

• 104 users of massive data storage system

• Six new software packages created or enhanced, more than 20 packages installed for use by INGEN-affiliated researchers

• 1 TB of biomedical data stored in the massive data storage system

• Three software packages made available as open source software as direct result of INGEN

• The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc.

Success in meeting goals?• Work on Penelope code for Gamma Knife likely to be

first major transferable technology development. Stands to improve efficacy of Gamma Knife treatment at IU

• Excellent success in supporting basic research

• Development of open source software (licensed under terms similar to Lesser GNU) provide opportunities for technology transfer

• Participation in grants and industrial partnerships provides economic benefit for IU

Success factors

• Creation of new position, Chief Information Officer and Associate Dean, within IU School of Medicine, and significant improvement in basic IT infrastructure within the IU School of Medicine

• INGEN has permitted IU to build on excellent IT infrastructure

• Dedicated (but not isolated) staff supporting INGEN researchers

• Commitment to customer service

• Outreach (in the proper formats)

Success factors, con't

• Scientific collaborations

• Strategy research on behalf of IU School of Medicine

• Accountability

• Leveraging of industrial partnerships

Funding Support

• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.

• Joint Study Agreement with IBM, Inc. Protein Family Annotator: School of Informatics - M Dalkilic, Center for Genomics and Bioinformatics - P Cherbas, Univ. Information Technology Services & INGEN IT Core - C Stewart.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Additional Information

• Further information is available at• ingen.iu.edu• http://www.indiana.edu/~uits/rac/• http://cgb.indiana.edu/• http://www.ncsc.org/casc/paper.html

Acknowledgements (People)• UITS Research and Academic Computing Division

managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• INGEN Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom

• Assistance with this presentation: John Herrin, Malinda Lingwall