Model Organism Linked Data(NIH Commons/MOD Interoperability supplement to SGD)
Michel DumontierAssociate Professor of Medicine
Stanford University
Team
Michel Dumontier (Biomedical Informatics Research, Stanford)Maxime Déraspe (U. Laval & Biomedical Informatics Research, Stanford)Jacques Corbeil (U. Laval)
Mike Cherry (Department of Genetics, Stanford)Kalpana Karra (Department of Genetics, Stanford)Gail Binkley (Department of Genetics, Stanford)
Gos Micklem (Cambridge Systems Biology Centre, U. of Cambridge)Julie Sullivan (Cambridge Systems Biology Centre, U. of Cambridge)
25+ available endpoints available for different MODs:
YeastMine, WormMine, FlyMine, ZebrafishMine, MouseMine, ThaleMine, HumanMine
Access via (db query) API
Core data object model (commonly used) + Mine-specific customizations
-> heterogeneity in tables, fields, and terminologies used pose challenges for interoperability and pan-database queries
InterMine is a platform for Model Organism Data
Linked Data and Semantic Web technologies (RDF, SPARQL) are increasingly adopted in the bioinformatics data provider community:
DBCLS, EBI, NCBI, NLM, and many others
MODs, like many Omics databases, often rely on other people’s content
Linked Data can offer deferenceable links to authoratitive sources
Opportunity to improve MOD data interoperability through mapping of their Ontologies and Vocabularies
Towards increased interoperability with Semantic Web technologies
Model Organism Linked Data (MO-LD)
Effort to expose InterMine data a FAIR -Findable, Accessible, Interoperable, Reusable
Specific Aims:
1. To improve interoperability of MOD data by publishing Linked Data
2. To enable and demonstrate federated queries between MOD data and the network of Linked Data
3. To package our software and data for easier local and cloud-based deployment
Includes 6 MODs -YeastMine, FlyMine, ZebrafishMine, RatMine, MouseMine, HumanMine
Linked with 38 Bio2RDF datasets
RefSeq, PantherDB, GO, NCBI gene, HGNC, ENSEMBL, OMIM, …
InterMine-RDFizer script to reproduce with any InterMine instance
Web application to visualize, explore and query the Linked Datasets
Model Organism Linked Database (MO-LD)
RDFization of InterMine
Query InterMine API with Object Model
Convert the tabular results into triples (RDF)
Merge the resources with the same primary keys
Link Data with external datasets
Load the RDF data into a triple store
InterMine-LD
External linked datasets (38) with the 6 MODs
Linking MODs with LOD- incomplete linking
InterMine primary key
Identifier DataSource
00001 Q6GZX4 Uniprot
00002 ASIC1 HGNC
00003 GO:0004396 GO
00004 AL732629.6 RefSeq
Cross References Table*
from InterMine
* Also done with Ontology tables
Linked Data Platform
SPARQL Query Editor
Faceted Browser (Virtuoso)
RelFinder for Relation Visualization
Application Programming Interface (Swagger.io - OpenAPIs specification)
MO-LD.org
SPARQL Support for Programmers
Get all reactions from KEGG that are associated with genes that are extrinsic components of the cell membrane
Federated Query
RelFinder - Find connections between 2 or more entities
Infrastructure Deployment and Reusability
Docker (container engine) to build and deploy the MOLD infrastructure
https://hub.docker.com/u/mold
Microservices architecture for reusability and extensibility :Web application, API and Virtuoso images
Cloud-Ready - tested on Amazon EC2
Tutorial : https://github.com/mo-ld/mold-dock
Only 5 commands to deploy a Linked-MOD !
Reflections
Not all data in MODs are available in the InterMine instance
Not all references are in the cross-references table, limits Linked Data generation
Team interactions led to change in export process
RDFizer focuses only on two tables of the core object model offers as template by InterMine (CrossReference + DataSource and Ontology + OntologyTerm).
Support for mine-specific tables would also improve coverage of contents and links
Can we improve the quality of the representation by using community vocabularies (FALDO, CiTo, SIO)?
Can we offer high performance query services (Triple Pattern Fragments/HDT)
How can we persist data in other archives (wikidata / schema.org+cse)
Are curation priorties in line with what users want?
Can pan-species analyses tell us something about success in drug discovery?
Future Directions
Top Related