BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September...

62
BioMart Databases made easy ichard Holland uropean Bioinformatics Institute elsinki, September 2006

Transcript of BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September...

Page 1: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart

Databases made easy

Richard HollandEuropean Bioinformatics InstituteHelsinki, September 2006

Page 2: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart

• A joint project – European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a generic, query-oriented data

management system capable of integrating distributed data sources.

Page 3: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Focus

• ‘Data mining’ or advance search – Creating custom datasets– Querying multiple datasets– Interactive

•Users– People who provide database-based service– ‘Power user’ biologists and bioinformaticians

Page 4: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Requirements

• User– ‘One-stop shop’ for biological data– Suitable for power biologists and bioinformaticians– A set of interfaces that allow user to group and refine

biological data based upon many criteria

• Deployer– ‘Out of the box’ installation– Built in ‘ query optimization– Easy data federation

• Architecture– Domain agnostic– Distributed– Platform independent

Page 5: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Advanced search GUIs

Page 6: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Single interface

Page 7: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Single access point

Page 8: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Queries across different databases

Dataset 1

Dataset 2

Links

Page 9: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Main features

• Domain agnostic• Platform independent (MySQL, ORACLE,

Postgres)• Scalable for big datasets• Federated architecture• Automated UI configuration

Page 10: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

How does it work?

Page 11: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart

Data mart XML XML XML Meta data

BioMart software

Source data

Page 12: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Query Engine

Federated architecture

Page 13: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

FK

FK

FK

FK

PK

PK

Data model

Page 14: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

Page 15: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

Page 16: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Data mart and dataset

Dataset

Page 17: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Data mart, dataset and virtual schema

virtual schema

Page 18: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart abstractions

• Dataset– A subset of data organized into 1 or more tables

• Attribute– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Page 19: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Page 20: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart abstractions (cont)

• Link– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Page 21: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Page 22: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Page 23: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importable

name=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Page 24: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Creating BioMart databases

Page 25: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditorMartBuilder

Page 26: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Schema transformationprinciples

• Central table– Longest n:1, 1:1 path

• Dimension table– Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first

Page 27: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartBuilder Application

• Read database meta data• Transforms a source schema into suggested datasets and lets you edit

the process• Produces a set of SQL statements (DDL)

to run against the server to perform the transformation

Page 28: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.
Page 29: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Dataset Configuration

• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Exportables, Importables• Semantics• Relational mapping

• User interface• Linking datasets• XML-based

Page 30: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Page 31: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Naming convention examples

• Homo sapiens gene ensembl– hsapiens_gene_ensembl__gene__main– hsapiens_gene_ensembl__xref_hugo__dm

• Encode– hsapiens_encode__encode__main

• Uniprot– uniprot__protein__main– uniprot__interpro__dm

• Uniprot sequence– uniprot_sequence__sequence__main

Page 32: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Dataset Configuration

XML

XML

XML

Page 33: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartEditor

Page 34: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Accessing BioMart databases

Page 35: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

Page 36: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartView (current)

Page 37: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartView (new 0_5)

Page 38: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartExplorer

Page 39: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartShell

Using = dataset

Get = attribute

Where = filter

Page 40: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartShell (MQL)● Uses Mart Query Language (MQL) to generate queries:

using <dataset> get <attributes> where <filters>

● Can join datasets together:

using Dataset1 get Attribute1 where Filter1=var1 as q;

using Dataset2 get Attribute2 where Filter2=var2 and filter3 in q

● Can script and pipe:

martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

Page 41: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

MartShell examplesMartShell> using MSD.msd get pdb_id where

resolution_less < 1.5 and has_ec_info only;193l194l1arb ...

MartShell> using MSD.msd get pdb_id where resolution_less < 1.5 and has_ec_info only as q;MartShell> using Ensembl.hsapiens_gene_ensembl get sequence transcript_flanks+1000 where pdb in q;ENST00000270142.2 ENSG00000142168.2strand=forward chr=21 assembly=NCBI34downstream flanking sequence of transcript only

AAACTAAATTAGCTCTGATACTTATTTATATAAACAGCTTCAGTGGAA ....

Page 42: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

biomaRt

Page 43: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Taverna

Page 44: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

DAS ProServer

Page 45: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart deployers

• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 46: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

EBI

UniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

Page 47: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart deployers

• Large scale data federation (Hinxton)

• Optimising access to a large database (Ensembl, WormBase, ArrayExpress)

• Connecting priopriatery datasets to public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

Page 48: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

WormBase

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Genes

Expression

Phenotypes

Variations

Literature

Ontologies

Sequence

Page 49: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Ensembl

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

Genes

Ontologies

Variations

Protein annotation

Disease

Homologies

Sequence

Array annotations

Page 50: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

HapMap

Population

Frequencies

Inter population

comparisons

Gene

annotation

Population

Frequencies

Inter population

comparisons

Gene

annotation

Page 51: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

ArrayExpress

Page 52: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart deployers

• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating third party data with public

data (Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

Page 53: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

In development

• CAPRISA• RGD• DICTYBASE• PURDUE UNIVERSITY• RZPD

Page 54: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Music Mart

Page 55: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Gramene– HapMap– Variety of ‘in house’ projects (academia and industrial)

Page 56: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

User restriction

XML

Dataset

XML

martUser

“default”

“advanced”

Page 57: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Interface configuration

XML

Dataset

XML

Interface

“single-pageweb interface”

“wizard styleweb interface”

Page 58: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Web services

MartView

3306

Local Mart

3306

X

Remote Mart

MartService

3306

80

XML

Page 59: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Web services (cont)MartService requests

• Registry XML

• Dataset information: name, type etc

• DatasetConfig XML

• Mart Query: – API query object is converted to a XML representation on the client

and sent to the server.

– Query object is regenerated on the server and processed. Results are sent back to client as a simple tab-delim HTML page.

Page 60: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Summary

• A generic data management system– A set of easily configurable user interfaces– Distributed Data federation– Query optimization

Page 61: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

BioMart

• www.biomart.org• Open source (LGPL)• Public MySQL server• ftp• [email protected][email protected]

Page 62: BioMart Databases made easy Richard Holland European Bioinformatics Institute Helsinki, September 2006.

Acknowledgments• BioMart

– Arek Kasprzyk (EBI)– Damian Smedley (EBI)– Syed Haider (EBI)– Gudmundur Thorisson (CSHL)

• Contributors– Darin London (EBI)– Will Spooner (CSHL)– Damian Keefe (Ensembl)– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Steffen Durinck (SCD-ESAT, Katholieke Universiteit Leuven)– Benoit Ballester (Universite de la Mediterranee)– Stephen Robinson (EBI)– Asif Kibria (EBI)– Paul Donlon (Unilever)