Supported by the NIH grant 1U24 AI117966-01 to UCSDPI , Co-Investigators at:
Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Philippe Rocca-Serra
Oxford e-Research Centre, University of Oxford, UK
Smart Descriptions & Smarter Vocabularies (SDSVoc)
30 November-1 December 2016, CWI Amsterdam Science Park
The model:dataset descriptions for
data discovery in DataMed
Like JATS (Journal Article Tag Suite) is used by PubMed to index literature, DATS (DatA Tag Suite) is needed for a scalable way to
index data sources in the DataMed prototype
http://datamed.org
NIH BD2K Data
DiscoveryIndex
prototype
Biomedicaland healthcaredatasets
open community-driven development & documentation
open community-driven development & documentation
http://biocaddie.org/workgroup-3-group-links
http://github.com/biocaddie/WG3-MetadataSpecifications
http://tiny.cc/datswebinar
v Enabling discoverability: find and access datasets available in multiple
repositories
v Focusing on surfacing key metadata descriptors, such as
² information and relations between datasets, creators, publication,
funding sources, nature of biological signal and perturbation etc.
v Not the perfect model to represent the experimental details
² the level of detail and metadata needed to ensure interoperability
and reusability are left to the indexed databases
² We have aimed to have maximum coverage of use cases with minimalnumber of data elements and relations
² Only very few properties are required
² Follow Best Practices for Data on the Web
What is ‘ remit?What is ‘ remit?
Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshellThe development process in a nutshell
Model serialized as JSON schemas and mapping to schema.org(v1.0, v1.1, v2.0, v2.1)
Using an existing model?Using an existing model?
v schema.orgv DataCitev RIF-CS (Registry Interchange Format – Collection and Services)v W3C HCLS dataset descriptions (mapping of many models including Dublin Core,
DCAT, PROV, VoID, VoID-ext)v Project Open Metadata (used by HealthData.gov)
v ISA (Investigation/Study/Assay)v BioProjectv BioSample
v MiNIMLv PRIDE-mlv MAGE-tabv GA4GH metadata schemav SRA xmlv CDISC SDM / element of BRIDGE model
Generic Models
Life Science / BioMedicalModels
Considered multiple models, mapped/analyze these ones:
bottom-up approach
Convergence of elements extracted from competencyquestionsand existing (generic and biomedical)
data models(incl. DataCite, DCAT, schema.org, HCLS dataset, RIF-CS, ISA-Tab, SRA-xml etc.)
model for scalable indexingmodel for scalable indexing
Adoption
of elements extracted from
and from
core entities
extended entities
plus elements from other models (e.g.
dataset/distribution/catalog from DCAT)
Serializations and use of schema.orgSerializations and use of schema.org
v DATS model represented as JSON schemas, instances as:² JSON* format, and ² JSON-LD** with vocabulary from schema.org
² serializations in other formats and with other vocabularies can also be done, as / if needed
v Benefits for DataMed and databases indexed by DataMedv Increased visibility (by both popular search engines), accessibility (via common query interfaces) and possibly improve ranking
v Use and extensions of schema.org² Submitted to their tracker missing DATS elements² Coordinating via the bioschemas.org initiative (ELIXIR is also part of) the extension of schema.org for life science
* JavaScript Object Notation** JavaScript Object Notation for Linked Data
Other adopters exporting
DATS in their APIs
To evaluate DATS model capabilities
Work in progress:documentation and curation guidelines for
adopters
Implementations and documentation Implementations and documentation
Top Related