The PDB An Exemplar for Data Science To Date, But What About the Future?
-
Upload
philip-bourne -
Category
Education
-
view
737 -
download
1
description
Transcript of The PDB An Exemplar for Data Science To Date, But What About the Future?
The PDB An Exemplar for Data Science To Date, But What About the Future?
Philip E. Bourne Ph.D.Associate Director for Data Science
National Institutes of Health
Background
6/12 2/14 3/14
• Findings:• Sharing data & software through catalogs• Support methods and applications development• Need more training• Hire CSIO• Continued support throughout the lifecycle
http://acd.od.nih.gov/diwg.htm
Motivation for This Talk
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
More Motivation
The way we fund and operate biomedical databases will not scale.
How do we keep the best features of todays resources but also respond to shrinking budgets and changes in the
way we do science?
Lets address this question using the PDB as an example
Disclaimer: This is NOT a talk about the PDB per se, but a talk about data resources in general, but using the PDB as an example since we are all
familiar with it and it is considered an exemplar by most stakeholders
Good News: We Trust the PDB
PDB
Trust in the datais perhaps the PDB’sbiggest achievement
Good News: Trust
Trust is like compound interest
Comes from listening
Comes from engaging the community in every aspect of the process
Comes from data consistency and level of annotation
Comes from responsiveness
Comes from the quality of the delivery service
Good News/Bad News Re Data Quality
Good News:– If done right in the
beginning 25% of the PDB’s budget could have been saved
– Ontologies can work
– Automation has reduced cost even as the amount of data has increased
– Reproducibility is improved
Bad News:– Complex ontologies slow
adoption
– All data are created equal
– Annotation is limited
Good News/Bad News Re Community
Good News:– The community is
engaged
– The community has driven data sharing
Bad News:– The community does not
reduce costs through active participation
– There is insufficient reward for being part of the community e.g. as an annotator
How we do science is changing. Do data resources including the PDB best
serve the needs of the user at this point?
How is Science Changing?
More interdisciplinary
More translational
More access to diverse data types
More computational
More collaborative
Good News/Bad News for the PDB in this Changing Landscape
Bad News:
– Interface complex and uni-data oriented
– Data accessible; methods accessible (sort of); but not together
– Significant redundancy in services offered
Good News:
– Annotation!
– Demand is increasing
– Integrated with other data types
– Restful services
General Problem Statement:
How to insure a high quality annotated data source that provides
the optimal environment for accessibility and analysis by a broad
community of diverse users?
Okay so what can the funders do to address a situation where really the
PDB is currently a best case scenario?
1. Encourage more understanding for how existing data are used
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010
1RUZ: 1918 H1 Hemagglutinin
Structure Summary page activity forH1N1 Influenza related structures
3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir
[Andreas Prlic]
We Need to Learn from Industries Whose Livelihood Addresses the Question of Use
2. Address the Issue that Scholarship is Broken
I have a paper with 17,500 citations that no one has ever read
I have papers in PLOS ONE that have more citations than ones in PNAS
I have data sets I am proud of few places to put them
I edited a journal but it did not count for much
3. Address the Reward System
4. Enable Reproducibility
Much of the research life cycle is now digital - encourage the reliability, accessibility, findability, usability of data, methods, narrative, publications etc.
How? Data sharing plans
Standards frameworks
Data and software catalogs
PubMedCentral
? The Commons – PMC for the complete lifecycle
? Machine readable data sharing plans
? Small funding to communities
? Support for training and best practices in eScholarship
5. Establish The Commons
Public/private partnership
Work with IC’s, NCBI and CIT to identify and run pilots – cloud, HPC centers
Port DbGAP to the cloud
? Experiment with new funding strategies
Evaluate
Sustainability and Sharing: The Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:Data Sharing Plans
TheCommons
Government
The How:
DataDiscoveryIndex
SustainableStorage
Quality
Scientific Discovery
Usability
Security/Privacy
Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment
The End Game:
KnowledgeNIHAwardees
PrivateSector
Metrics/Standards
Rest ofAcademia
Software StandardsIndex
BD2KCenters
Cloud, Research Objects,Business Models
What Does the Commons Enable?
Dropbox like storage
The opportunity to apply quality metrics
Bring compute to the data
A place to collaborate
A place to discover
http://100plus.com/wp-content/uploads/Data-Commons-3-1024x825.png
The PDB in the Commons
Components:– Annotated collection of data files
– API’s to access these data files
– Example methods using these APIs
Potential outcomes– Nothing happens?
– A new breed of developer starts to use PDB data in new ways ?
– The casual user has a broader set of services that previously?
– Quality declines?
Some Acknowledgements
Eric Green & Mark Guyer (NHGRI)
Jennie Larkin (NHLBI)
Leigh Finnegan (NHGRI)
Vivien Bonazzi (NHGRI)
Michelle Dunn (NCI)
Mike Huerta (NLM)
David Lipman (NLM)
Jim Ostell (NLM)
Andrea Norris (CIT)
Peter Lyster (NIGMS)
All the over 100 folks on the BD2K team
NIHNIH……Turning Discovery Into HealthTurning Discovery Into Health