Download - Model Organism Linked Data

Model Organism Linked Data(NIH Commons/MOD Interoperability supplement to SGD)

Michel DumontierAssociate Professor of Medicine

Stanford University

Team

Michel Dumontier (Biomedical Informatics Research, Stanford)Maxime Déraspe (U. Laval & Biomedical Informatics Research, Stanford)Jacques Corbeil (U. Laval)

Mike Cherry (Department of Genetics, Stanford)Kalpana Karra (Department of Genetics, Stanford)Gail Binkley (Department of Genetics, Stanford)

Gos Micklem (Cambridge Systems Biology Centre, U. of Cambridge)Julie Sullivan (Cambridge Systems Biology Centre, U. of Cambridge)

25+ available endpoints available for different MODs:

YeastMine, WormMine, FlyMine, ZebrafishMine, MouseMine, ThaleMine, HumanMine

Access via (db query) API

Core data object model (commonly used) + Mine-specific customizations

-> heterogeneity in tables, fields, and terminologies used pose challenges for interoperability and pan-database queries

InterMine is a platform for Model Organism Data

Linked Data and Semantic Web technologies (RDF, SPARQL) are increasingly adopted in the bioinformatics data provider community:

DBCLS, EBI, NCBI, NLM, and many others

MODs, like many Omics databases, often rely on other people’s content

Linked Data can offer deferenceable links to authoratitive sources

Opportunity to improve MOD data interoperability through mapping of their Ontologies and Vocabularies

Towards increased interoperability with Semantic Web technologies

Model Organism Linked Data (MO-LD)

Effort to expose InterMine data a FAIR -Findable, Accessible, Interoperable, Reusable

Specific Aims:

1. To improve interoperability of MOD data by publishing Linked Data

2. To enable and demonstrate federated queries between MOD data and the network of Linked Data

3. To package our software and data for easier local and cloud-based deployment

Includes 6 MODs -YeastMine, FlyMine, ZebrafishMine, RatMine, MouseMine, HumanMine

Linked with 38 Bio2RDF datasets

RefSeq, PantherDB, GO, NCBI gene, HGNC, ENSEMBL, OMIM, …

InterMine-RDFizer script to reproduce with any InterMine instance

Web application to visualize, explore and query the Linked Datasets

Model Organism Linked Database (MO-LD)

RDFization of InterMine

Query InterMine API with Object Model

Convert the tabular results into triples (RDF)

Merge the resources with the same primary keys

Link Data with external datasets

Load the RDF data into a triple store

InterMine-LD

External linked datasets (38) with the 6 MODs

Linking MODs with LOD- incomplete linking

InterMine primary key

Identifier DataSource

00001 Q6GZX4 Uniprot

00002 ASIC1 HGNC

00003 GO:0004396 GO

00004 AL732629.6 RefSeq

Cross References Table*

from InterMine

* Also done with Ontology tables

Linked Data Platform

SPARQL Query Editor

Faceted Browser (Virtuoso)

RelFinder for Relation Visualization

Application Programming Interface (Swagger.io - OpenAPIs specification)

MO-LD.org

SPARQL Support for Programmers

Get all reactions from KEGG that are associated with genes that are extrinsic components of the cell membrane

Federated Query

RelFinder - Find connections between 2 or more entities

Infrastructure Deployment and Reusability

Docker (container engine) to build and deploy the MOLD infrastructure

https://hub.docker.com/u/mold

Microservices architecture for reusability and extensibility :Web application, API and Virtuoso images

Cloud-Ready - tested on Amazon EC2

Tutorial : https://github.com/mo-ld/mold-dock

Only 5 commands to deploy a Linked-MOD !

https://hub.docker.com/u/mold

https://github.com/mo-ld/mold-dock

Reflections

Not all data in MODs are available in the InterMine instance

Not all references are in the cross-references table, limits Linked Data generation

Team interactions led to change in export process

RDFizer focuses only on two tables of the core object model offers as template by InterMine (CrossReference + DataSource and Ontology + OntologyTerm).

Support for mine-specific tables would also improve coverage of contents and links

Can we improve the quality of the representation by using community vocabularies (FALDO, CiTo, SIO)?

Can we offer high performance query services (Triple Pattern Fragments/HDT)

How can we persist data in other archives (wikidata / schema.org+cse)

Are curation priorties in line with what users want?

Can pan-species analyses tell us something about success in drug discovery?

Future Directions