IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web...

12
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan [email protected]

Transcript of IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web...

IBM Watson Research

© 2004 IBM Corporation

BioHaystack: Gateway to theBiological Semantic Web

Dennis [email protected]

IBM Watson Research

© 2004 IBM Corporation

Problems in bioinformatics

Myriad of public databases have specific facets of information about biological objects of interest (e.g., proteins, genes, etc.)

Databases have their own access protocols, data formats, naming conventions, and means of describing relationships between objects in different databases

Different software required to view information from different databases

– User must be keenly aware of which tool or site to use

– Relevant information comes in fragments

– Exploration process is discontinuous

IBM Watson Research

© 2004 IBM Corporation

A common naming convention: LSID URNs

Life Sciences Identifiers (LSIDs) are URNs for biological objects that are backed by RDF metadata:

– E.g., urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:nm_001240

LSID and LSID protocol (SOAP-based) specification sponsored by I3C and undergoing standardization by OMG

Most of the publicly available bioinformatics databases available via LSID today

– PDB LSID authority online; “proxy” LSID authorities for databases such as NIH databases, SwissProt hosted by I3C

Really easy to set up LSID clients and servers

– IBM Internet Technology group provides Open Source LSID client and server software for a variety of languages and platforms

IBM Watson Research

© 2004 IBM Corporation

RDF/XML: on demand data integration

humanhemoglobin

LSID

oxygentransportprotein

atagccgtacctgcgagtctagaagct

derives from

is a

humanhemoglobin

LSID

humanhemoglobin

LSID

has 3D structure

GenBank

Gene Ontology

PDB

humanhemoglobin

LSID

atagccgtacctgcgagtctagaagctderives from

oxygentransportprotein

is a

has 3D structure

Unified view

+

+

IBM Watson Research

© 2004 IBM Corporation

Haystack: letting users interact with their data

Haystack is a tool for creating, exploring, and organizing information:

– Personal information: e-mails, contacts, documents, etc.

– Bioinformatics: proteins, publications, genes, etc.

Research project originating from MIT CSAIL

Uses RDF as an underlying data model

Built on Java and Eclipse, IBM’s Open Source rich client platform

http://haystack.lcs.mit.edu/

IBM Watson Research

© 2004 IBM Corporation

Browsing highly interconnected information

Single screen presents multiple facets of a single object originating from separate databases

Users navigate space like a Web browser: hyperlinking, drag and drop, etc.

IBM Watson Research

© 2004 IBM Corporation

Personalization

People keep track of their information by personalizing their workspaces:

– Grouping paperwork into folders

– Highlighting important text in documents

– Attaching sticky notes as reminders

– Jotting down lists of related items

Haystack has pervasive support for annotation and allows users to group related objects together arbitrarily for their own purposes

IBM Watson Research

© 2004 IBM Corporation

BioHaystack

BioHaystack: application of Haystack technologies to bioinformatics problem

– Integrated environment for working with biological data

– Intended for end users, i.e., non-programmers

– Builds on LSID, RDF, and Haystack

Integration offers the promise of lowering barriers to access to different backend systems (e.g., LSID servers, Grids, Web Services, relational databases, annotation servers)

Just as the Web browser acts as a client for Web content, BioHaystack can act as a client for biological Semantic content and services

IBM Watson Research

© 2004 IBM Corporation

Real world collaboration: myGrid

UK-funded joint project with the University of Manchester and other UK research institutions

RDF-based platform for supporting e-Science experiments

Real use cases; developed in collaboration with bioinformaticians

myGrid creates LSIDs and RDF metadata in the process of enacting experiments for scientists

Using BioHaystack as a browser for metadata

IBM Watson Research

© 2004 IBM Corporation

Registry

mIR

Discovery View

HaystackProvenance

Browser

FreeFluoEnactor

TavernaWF Builder

PedroAnnotation tool

Ontology Store

Others

WSDL Soap-lab

Interface Description

Annotation/description

Annotation providers

Query &Retrieve Workflow

Execution

Store data/knowledge

Scientists

Bioinformaticians

invoking

Query & register

ServiceProviders

Data descriptions

Vocabulary

myGrid Architecture

Courtesy of Professor Carole Goble, University of Manchester

IBM Watson Research

© 2004 IBM Corporation

BioHaystack + myGrid

Courtesy of Professor Carole Goble, University of Manchester

IBM Watson Research

© 2004 IBM Corporation

Thank you for your attention

Dennis Quan, [email protected] (IBM Watson Research)

Haystack project home page (download available May 24)

– http://haystack.lcs.mit.edu/

IBM LSID home page

– http://www.ibm.com/developerworks/oss/lsid/

myGrid home page

– http://www.mygrid.org.uk/

See also our session on constructing Haystack applications:

– Developer’s Day, Saturday, 4:30pm