Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many...

14
Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data "library" distributed on computers around the world. Grid computing methods for finding and using interesting genome knowledge from this mountain of data are discussed - their promise and practical concerns for building usable bioinformatics grids. Don Gilbert, [email protected]

Transcript of Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many...

Page 1: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Genomes to Grids Thoughts on Building Data Grids for Biology

Biologists have discovered many millions of genes and genome features, now part of the bio-data "library" distributed on computers around the world. Grid computing methods for finding and using interesting genome knowledge from this mountain of data are discussed - their promise and practical concerns for building usable bioinformatics grids.

Don Gilbert, [email protected]

Page 2: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Bio Databanks, EBI, Sept. 2002

Databank Contents EntriesEMBL DNA Sequences 18,800,000SWALL Protein sequences 900,000InterPro+ Protein motifs 1,000,000HGBASE SNP database 1,500,000

Metabolic Pathways 250,000MEDLINE Literature 11,350,000Total 33,800,000

Many data objects, data sets updated frequently (daily) --> Keeping current data is a problem

Page 3: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Constellation of Bio-Data (SRS - Lion Bioscience)

Many databanks, variously structured,widely distributed, loosely federated - finding “best data” a problem

Page 4: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Genome database & info system components

Any genome database relies on, and feeds into, many other databases

Page 5: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

BioGrid Schematic

• Grid-aware client software• Data and software directories

• Grid of processing computers

Page 6: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Moving Bio-Data on the Grid1. @virtualdata= biodirectory( "find protein coding sequences

for Drosophila and Anopheles species”)

2. @realdata= biodirectory( "get locators for @virtualdata split n ways”)

3. for i (1.. n) { copydata(realdata[i],gridcpu[i]); runapp(gridcpu[i]) }

Page 7: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Directories of Bio-Data• Directories are a necessary step for usable grids of bio-data

– "broad and shallow" directories federate the "narrow and deep" databases

• Bio-Data Access Tools

SRS, Sequence Retrieval System; Entrez ; AceDB; Genome relational databases (Ensembl, FlyBase, WormBase) ; IBM DiscoveryLink; BioDAS ; BioMoby

• Directory services for data access tools– Layer onto access tools for common query/retrieval of important data– LDAP: mature, efficient for high volumes, queries over distributed

directories ; works well with bio-access tools

– Web Services: XML messages over Web ; wide industry support , but standards are in progress

Page 8: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Bio-data Directory Needs• Build on existing technology for finding distributed objects• Efficient for millions of objects, by the gigabyte and terabyte• Queries distributed across directories of collaborating

services• Support existing and new bioinformatics data access

(relational dbs, object and XML dbs, SRS, Entrez, AceDB)• Simple client program methods for computable use of

directories• Flexible, common schema for describing objects• Replicate directories and objects among bioinformatics

centers• Peer-to-peer directories for collaborative projects• Strong authentication and security for data access

Page 9: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Directory technology: LDAP,Web Services and/or?

• LDAP– Object-centric, optimized for efficient read operations.

– Hierarchical, network-able, distributed and replicated in nature

– Has many features needed for bio-data access

• Web/XML – SOAP+: SOAP for directory requests, WSDL to interface the directory

repository, UDDI to locate the service (some assembly still required…)

– UDDI is potential match to LDAP as directory technology

– DSML: layer on top of LDAP for Web/XML interoperability

• Peer-to-peer (JXTA)? Grid SQL? XML-query systems? – Possible future directory technology

Page 10: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

BioDirectory Tests• SRS bioinformatics data retrieval system, for efficient

retrieval of millions of bio-objects

• OpenLDAP for high performance and JavaLDAP for easy to configure directory transport.

• GLUE and Jakarta/Tomcat for Web Services tests. • DSML, Directory Services markup for XML/LDAP

conversion.• Test queries: 20,000 to 1.2 million biosequence objects from

GenBank, SwissProt and related dbs.

IUBio SRS Server + LDAP, WebServices --> Bio-object directory search/retrieval

Page 11: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.
Page 12: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Using Bio DirectoriesSimple client

software

Automated use

People use

Discovery

Search by many criteria

Retrieve bulk subsets

Page 13: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

BioGrid Runner

A Globus CoG kit application for bioinformaticshttp://iubio.bio.indiana.edu/biogrid/runner/

Page 14: Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.

Wrap up• Future of Bio-data on Grids

– Globus Toolkit useful for bio-grid data & compute intensive tools (BLAST, HMMer, Meme, others)

– High volume, complex, changing, distributed data– Add methods to find & move data among grid,

diretories of objects– LDAP works well ; Web-XML is usable, being defined

• Bio. Community needs and uses– Common data descriptions, schema, ontologies – Simple, practical, flexible grid methods ; use existing

dbs

See http://iubio.bio.indiana.edu/biogrid/