NIH Data Catalog - Updated Results
-
Upload
readkev -
Category
Technology
-
view
120 -
download
0
description
Transcript of NIH Data Catalog - Updated Results
Building an NIH Data Catalog: Bit by Bit
Kevin ReadBoard of Regents
September 11, 2013
1
2
Searching for NIH-funded ‘Orphaned’ data sets in
PubMed and PubMed Central
3
113,089
75,441
Remaining articles with orphaned data sets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH71,91
3 SI Field
71,680
PMC Acknowledgements
69,857
XML
4
SI Field Exclusions
Clinical-Trials.gov
PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
5
6
PMC Acknowledgements Exclusions
PDB
Clinica
lTrials.
gov
GenBankGEO IRD
MGI
DIP
Flybase
dbGaPSRA
Worm
BaseM
PD
NURSARGD
ICPSR
VectorB
ase0
100
200
300
400
500
600
700
800
Excluded keywords
7
XML Keyword Exclusions
GenBankPDB
GEOdbSNP
Clinica
lTrials.
govRGD
Flybase SRA DIP
dbGaP
Worm
Base MGI
BioGRID
VectorB
ase
Multiple Keyword
s0
100
200
300
400
500
600
Excluded keywords
FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information
Framework:Rat Genome Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
NIH Sponsored data repositories have now been added to PubMed and PMC
search indexes
8
383
What category of data set was used for the research described in the article?
Were live human or animal subjects used
in the collection of the data?
What were the subject(s) of study (from which or whom the data was collected)?
If new data set(s) were created, what
type(s) of data were collected?
What existing data set(s) were used? If any?
How many data sets are there in each
article?
9
10
Measuring blood pressure in mice
Measuring left hemisphere of brain for growth factor
Staining and imaging
Analysis of images using software
Phase OneResults
11
Average number of data sets per article:
2.9212
% of data sets that use live subjects
54%
Human
51%Animal
49%
13
% of new data
87%
14
% of data created using pre-existing
data sets
13%
Data Types
15
ImageGenetic or Genomic
Chemical
Biochemical
Electrical (Elecrophysiologic
al)
Optical – non-image
Behavioral
Computational Simulation or model
Magnetic Resonance – non-
image
Structural
Physiological
Questionnaire/Survey
Clinical Measures
Geospatial
INSUFFICIEN
T
Inter-rater Reliability:
16
Total # of datasets (High)
Total # of datasets (Low)
0
100
200
300
400
500
600
700
800
Total number of datasets found per 25 articles
Total
43%
How do we define a data set?
17
DATA SET
How do we define a data set?
18
DATA SETS
How do we define a data set?
19
DATA SETS
Where in the collection/processing
pipeline should data be described?
20
How do we assign data types to NIH funded
data sets?
21
What data should be shared in an NIH Data
Catalog?
22
Data sets that can be
repurposed
Data sets that make an article easier to
understand
Acknowledgements
Project SponsorsJerry Sheehan & Mike Huerta
Special ThanksLou Knecht & Jim Mork
AnnotatorsPreeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott
SupportKathel Dunn & David Gillikin
Library OperationsJoyce Backus & Dianne Babski
NLM LeadershipDonald Lindberg & Betsy Humphreys
All images are CC
23
24
Minimal MetadataCommon Metadata
ElementsProposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set