Data sharing & the nih data catalog
description
Transcript of Data sharing & the nih data catalog
Contributing to the Big Data to Knowledge Initiative at the
NIH
Data Sharing and the NIH Data Catalog
Big Data to Knowledge (BD2K)
Big Data 2 Knowledge
Data Catalog Frameworks and Standards
Policies and Data Sharing
Data Sharing Repositories
All NIH-funded data sharing repositories that are open to receiving data submissions from any researcher internationally - whether they are funded by the NIH or not
Data Sharing Policies
All data sharing policies that exist within the NIH that assist researchers in developing a plan to share their research data
Big Data 2 Knowledge
Data Catalog Frameworks and Standards
Policies and Data Sharing
NIH Data Catalog
Bringing Data Into the Research Ecosystem
Each dataset will be identified via Data Unique Identifier [DUID] (in NIH Data Catalog and in the associated journal)
Datasets specified in catalog using MeSH (creation of a dataset Publication Type)
Datasets are discoverable
NIH Data Catalog produces citable data publications
Citability + proper credit = incentives to submit and publish data
Datasets are citable
Data citations linked between and across the NIH Data Catalog with their associated scientific publication in PubMed/PubMed Central
Datasets are linked to the literature
Analysis of trends, impact of data, effect on NIH research funding
Datasets become information in the research
ecosystem
Common Metadata Elements
How do current data repositories describe their data?
NIH Data Sharing Repositories
Identifying Metadata Commonalities
Identifying Metadata Commonalities
Common Metadata Elements
Authorship
Data Description
Date Information
Building a Taxonomy of Metadata Descriptors
• Authorshipo Attributiono Authorso Creator(s)o Data Authorso Data Ownero Data Attributiono Contributor(s)o PI Name(s)o Investigator(s)o Sequence Authorso Responsible Partyo Data Providero Submitter
• Title informationo Name Titleo Collection Typeo Type of Deposito Service Nameo Image File Nameo File Nameo Data Collection Titleo Dataset Titleo Dataset Name and Accessiono Submission Titleo Lab Data Titleo Research Objective
Common Metadata Elements
Common Metadata Elements
Common Metadata Elements
Mapping Metadata to Existing
Standards
Mapping to DataCite
• DataCite Metadata Schemao Identifiero Creatoro Titleo Publishero PublicationYearo Subjecto Contributoro Dateo Resource Typeo RelatedIdentifiero Rightso Descriptiono Size, Format, Version
• Common Metadata Elementso Data Unique Identifiero Authorshipo Data Titleo Data Locationo Data Completion/Release
Dateo Data Descriptors (controlled
vocabulary)o Data Submitter/Affiliationo Date Informationo Data File Typeso Related Resourceso Access Data Restrictionso Data Description (narrative)
Mapping to Dryad
• Dryad Metadata Schemao dcterms:identifier/Data
Package Identifiero dcterms:creator/Authoro dcterms:title/Data Package
Titleo dcterms:relation/Location of
related content outside of Dryad
o dcterms:available/Date Available
o dcterms:descriptiono dcterms:subject/Keywordo dwc:scientificNameo dcterms:references/
Associated Dryad publication record ID
• Common Metadata Elementso Data Unique Identifiero Authorshipo Data Titleo Data Locationo Data Completion/Release
Dateo Data Descriptors (controlled
vocabulary)o Data Submitter/Affiliationo Date Informationo Data File Typeso Related Resourceso Access Data Restrictionso Data Description (narrative)
Mapping to MEDLINE
Common Metadata Elements
Proposed Definition
Data Unique Identifier A unique ID string that identifies a dataset within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author occurrence
Data Title Name or title by which the dataset is known
Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (i.e. Organism, Disease, Perturbation, Gender, Cell type, etc.)
PMIDs Identifier that will link dataset to associated article(s)
Availability/Accessibility of Data Whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the dataset
Version The version of the dataset (represented as a unique record)
Data Citation - ICMJE
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Author
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Location
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Date data is submitted and paper is ready to publish
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
NIH Data Catalog Volume (Issue)
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Unique Identifier
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
PMID Assigned to NIH Data Catalog Record
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Secondary source ID (Link to actual dataset)
NIH Data Catalog Issues and Concerns
What are we missing?
How many NIH datasets actually
exist?
How many unique NIH datasets are
NOT represented in existing data repositories?
Could these datasets be represented as a
data publication instead of in a repository?
If the datasets are already housed
somewhere – do we need a one stop
shop?
Is a NIH Data Catalog the best
solution?
Next Steps• Find out how many datasets are currently in NIH
data sharing repositorieso How many datasets do these repositories process per year?
• How many datasets are unique and NOT housed in a repository?o Search PubMed and PubMed Central and assign categories
• MeSHo PT: Electronic Supplementary materialo SH: Statistical and numerical datao MeSH: Databases, Factual
o Statistical Analysis – exclude datasets that already have a location
• How do we manage these unique datasets?
Questions?Thank you.