RDA Wheat Data Interoperability Cookbook and last developments

12
RDA Wheat Data Interoperability Cookbook and last developments 9 th March 2015, San Diego

Transcript of RDA Wheat Data Interoperability Cookbook and last developments

Page 1: RDA Wheat Data Interoperability Cookbook and last developments

RDA Wheat Data Interoperability Cookbook and last developments

9th March 2015, San Diego

Page 2: RDA Wheat Data Interoperability Cookbook and last developments

2The WDI working group in brief

Endorsement: March 2014 Members: ~=30 members and 15 active members, Wheat

scientists, data and metadata technologists The goal: contribute to the improvement of Wheat related

data interoperability by Building a common interoperability framework (metadata, data formats and

vocabularies) Providing guidelines for describing, representing and linking Wheat related

data

Page 3: RDA Wheat Data Interoperability Cookbook and last developments

3

Deliverables A report of the survey of existing standards A cookbook intended for the Wheat data managers community, which

provides them with guidelines on what data formats, metadata, vocabularies and ontologies they should use to describe, represent and link different types of Wheat data.

A library of linked vocabularies and ontologies in machine readable formats with respect to the Linked Data standards.

A prototype which showcases the gain of interoperability

Initial plans

Page 4: RDA Wheat Data Interoperability Cookbook and last developments

4Where we are

Page 5: RDA Wheat Data Interoperability Cookbook and last developments

5Data type Data formats currently used Recommendations

Standardized Tool specific Non standardized

SNPs VCF BAM/SAM, BED, VARSCAN, VEP

VCF files generated by using the survey sequences of IWGSC + metadata about VCF files to enrich the information about the SNPs.

genome annotations

Genbank Flat File, General Feature Format (GFF), EMBL

GFF 3 + specifications with regard the description of specific columns

Germplasms MPCD, ABCD, Darwin Core, Darwin Core Germplasm

Grin Global tabulated MPCD

Gene expression

Many format standards laid out by repositories such as NCBI (GEO) and EBI Array Express

Existing format standards laid out by the repositories such as NCBI (GEO) and EBI Array Express + ENA

Physical maps GFF Cmap, fpc GFF3

Genetic maps Cmap, gnpmap GFF3 (to be confirmed)

Phenotypes Drops, ped, isa-tab, ephesis

tabulated Isa-tab

Page 6: RDA Wheat Data Interoperability Cookbook and last developments

6Examples of use cases

Title Searching for germplasm with specific traits

Description Example of searching for germplasm with specific traits - tagged with ontology terms?

Data types GermplasmPhenotype

Challenges ● Metadata very important ~ standardized format● Association of genes to traits, linked to germplasm, marker information● Need for quality controls- how confident are you of the data source?● Provenance of the germplasm- pedigree, ownership, ● Standard system for tracking germplasm, names

Title Identification of wheat genes that control root growth

Description Requires: Annotated genes (Gene Ontology, PFam, and other functional annotation)

Data types Genomic annotations? - Gene location ? (IWGS-SS ID or MIPS HCS link)

Challenges Mapping between wheat genes and orthologs from other species (deduce function by seq. similarity); Access to RNASeq data (genes that are not expressed in roots may be irrelevant) ; mapping of wheat genes and information on their function based on literature

Title Query on trial data associated with varieties

Data types Phenotypic data, GIS data, (wheat economy/production data)

Description To search wheat varieties with distribution maps, production figures, performances in wheat mega environments, associated projects worldwide plus layers of climatic data on specific wheat production areas and disease prevention information.

Challenges Phenotypic data should be linked to GIS data. Using keywords or ontology terms a system or a tool should be able to pull out such information from different websites/systems developed by wheat community.

Page 7: RDA Wheat Data Interoperability Cookbook and last developments

7

Page 8: RDA Wheat Data Interoperability Cookbook and last developments

8

Assess the level of visibility and interoperability of Wheat related vocabularies and ontologies Is the vocabulary/ontology updated regularly? What license and/or copyright is used? Is the vocabulary/ontology part of any ontology communities or listing

services? Is the vocabulary/ontology used or implemented in any database/repository? Does the vocabulary/ontology interlink and/or map to other vocabularies and

ontologies? Does the vocabulary/ontology

Identify the domain covered by the ontologies and vocabularies Refine the cookbook Collect more interoperability use cases

Collect some technical details

Wheat related ontologies & vocabularies survey

Page 9: RDA Wheat Data Interoperability Cookbook and last developments

9Wheat related ontologies & vocabularies survey

Page 10: RDA Wheat Data Interoperability Cookbook and last developments

The Wheat related BioPortal allows one to search for terms across multiple ontologies, browse mappings between terms in different ontologies, receive recommendations on which ontologies are most relevant for a corpus, annotate text with terms from ontologies

Page 11: RDA Wheat Data Interoperability Cookbook and last developments

11Next steps

Metadata (harmonization, minimal metadata sets) Mappings Next workshop (summer 2015)

Review and complete the recommendations Refine and complete the guidelines and the best practices

Finalize the repository of Wheat related vocabularies Prototyping: a semantic knowledge base

Integrate data from different data sources Provide smart search capabilities that leverage the vocabularies used against

the metadata.

Page 12: RDA Wheat Data Interoperability Cookbook and last developments

12

Thank you!