BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

40
BioMart Query Network rek Kasprzyk uropean Bioinformatics Institute January 2005

Transcript of BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Page 1: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart Query Network

Arek KasprzykEuropean Bioinformatics Institute8 January 2005

Page 2: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Biological databases

• Distributed• Different format• Different focus• Different release schedule• Scalability factor

Page 3: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.
Page 4: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart

Page 5: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

Page 6: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartView

Page 7: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart@Ensembl

Page 8: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartShell

Page 9: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartExplorer

Page 10: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Database

Page 11: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

FK

FK

FK

FK

PK

FK FK FKFK

PK PK

PK PK

Schema

Page 12: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Schema

Page 13: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

FK

FK

FK

FK

PK

PK

Schema

Page 14: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Schema - ‘reversed star’

Page 15: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Fixed schema transformationA

B

TA

TB

C

Page 16: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Schema transformation

• Central table– Longest n:1, 1:1 path

• Dimension table– Central transformation ‘around’ 1:n

table. – Link tables are decomposed into a set

of 1:n first

Page 17: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartBuilder• Input

– central object– database meta data– cardinalities

• Output– Set of SQL statements:

• “create table as select …”

• Transformations – represented as asymmetric tree

Page 18: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartBuilder

DATASET: hsapiens_gene_ensemblTYPE MAIN [M] DIMENSION [D] EXIT [E]: MTABLE NAME: genegene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: STYPE MAIN [M] DIMENSION [D] EXIT [E]: EADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: NCHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO:

CREATE TABLE TEMP0 as SELECT gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id;CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.display_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id;drop table TEMP0;

Page 19: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Transformation configuration

satellog_repeats M repeats disease n1satellog_repeats M repeats gc 11satellog_repeats M repeats linkage_depth Ssatellog_repeats M repeats repeats Ssatellog_repeats M repeats transcripts Ssatellog_repeats M repeats ugcount Ssatellog_repeats M repeats ugstats Ssatellog_repeats M repeats rep_class n1satellog_repeats D ugcount ugcount Ssatellog_repeats D ugcount ugstats Ssatellog_repeats D ugcount gc Ssatellog_repeats D ugcount repeats n1r

Page 20: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Data access

Page 21: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Dataset – Key Abstraction

• Dataset– Organised into a single schema– BioMart database contains one or more dataset(s)– Attribute– Filter– Exportable/Importable (Links)

• Dataset - an equivalent of relational table– Exportable/Importable = PK/FK

Page 22: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Key Abstractions

GENE CENTRAL

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Page 23: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Exportables, Importables and Links

• Exportable = ordered list of attributes• Importable = ordered list of filters

– WHERE filt1=value1– WHERE filt1=value1 or filt1=value2– WHERE filt1>value1 and filt2<value2

• Links = matching importable and exportable

Page 24: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartView

Page 25: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Dataset Configuration

• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Links • Semantics• Relational mapping

• User interface• Linking datasets• XML-based

Page 26: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Dataset Configuration

XML

XML

XML

Page 27: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key– Boolean filter _bool– List filter _list

Page 28: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartEditor

Page 29: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

MartEditor

• Naïve configuration• Updates• Links• Automatic discovery of new tables

Page 30: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Class diagram - configuration

Page 31: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Class diagram - querying

Page 32: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Information flow

• Read connections• Register individual datasets and create

linked datasets• Get input from the user, split queries to

individual datasets. • Find the shortest path between datasets

(Dijikstra)• Compile SQL

Page 33: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Summary

Page 34: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart

• Domain independent• Platform independent

– MySQL 4– Oracle 9i

• Plugin architecture

Page 35: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart model

• Already applied– Ensembl– Vega– dbSNP– Uniprot– MSD– Variety of small projects

• In development– ArrayExpress– Wormbase– RGD

Page 36: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Future work

• BioMart v 0.2 to be released later on in january

• Java library to be upgraded over coming months to the new architecture

• BioMart has been integrated with Taverna

• MartBuilder - to be properly implemented

Page 37: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

BioMart

• www.ebi.ac.uk/biomart• Open source (LGPL)• Public MySQL server• ftp• [email protected][email protected]

Page 38: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.

Acknowledgments

• BioMart– Damian Smedley– Darin London

• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Will Spooner (CSHL)

Page 39: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.
Page 40: BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005.