Reproducibility: A Funder and Data Science Perspective
-
Upload
philip-bourne -
Category
Education
-
view
191 -
download
1
Transcript of Reproducibility: A Funder and Data Science Perspective
Reproducibility:A Funder and Data Science Perspective
Philip E. Bourne, PhD, FACMI
University of Virginia
Thanks to Valerie Florence, NIH for some slides
http://www.slideshare.net/pebourne
NetSci Preworkshop 2017
June 19, 2017
6/19/17 1
Who Am I Representing And What Is My Bias?
• I am presenting my views, not necessarily those of NIH
• Now leading an institutional data science initiative
• Total data parasite
• Unnatural interest in scholarly communication• Co-founded and founding EIC PLOS Computational Biology –
OA advocate• Prior co-Director Protein Data Bank• Amateur student researcher in scholarly communication
26/19/17
Reproducibility is the responsibility of all stakeholders….
6/19/17 3
6/19/17 4
Lets start with researchers …
6/19/17 5
Reproducibility - Examples From My Own Work
It took several months to replicate this work this work
… And recently …
Phew…
http://www.sdsc.edu/pb/kinases/6/19/17 6
Beyond value to myself (and even then the emphasis is not enough) there is too little incentive to make my work reproducible by others …
6/19/17 7
Tools Fix This Problem Right?
• Extracted all PMC papers with associated Jupyternotebooks available
• Approx. 100
• Took a random sample of 25
• Only 1 ran out of the box
• Several ran with minor modification
• Others lacked libraries, sufficient details to run etc.
It takes more than tools.. It takes incentives …
Daniel Mietchen 2017 Personal Communication
6/19/17 8
Funders and publishers are the major levers ..
What are funders doing?
Consider the NIH …..
6/19/17 9
6/19/17 10
NIH Special Focus Area
https://www.nih.gov/research-training/rigor-reproducibility
116/19/17
Outcomes – General …
6/19/17 12
Enhancing Reproducibility through Rigor and Transparency NOT-OD-15-103
• Clarifies NIH expectations in 4 areas• Scientific premise
• Describe strengths and weaknesses of prior research
• Rigorous experimental design• How to achieve robust and unbiased outcomes
• Consideration of sex and other relevant biological variables
• Authentication of key biological and/or chemical resources e.g., cell lines
136/19/17
Outcomes – network based …
6/19/17 14
Experiment in Moving from Pipes to Platforms
Sangeet Paul Choudary https://www.slideshare.net/sanguit6/19/17 15
Commons & the FAIR Principles
• The Commons is a virtual platform physically located predominantly on public clouds
• Digital assets (objects) within that system are data, software, narrative, course materials etc.
• Assets are FAIR – Findable, Accessible, Interoperable and Reusable
https://www.workitdaily.com/job-search-solution/
Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618
PLoS Biol 15(4): e2001818
6/19/17 16
Just announced …https://commonfund.nih.gov/sites/default/files/RM-17-026_CommonsPilotPhase.pdf
6/19/17 17
Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618
Current Data Commons Pilots
Reference Data Sets
Commons Platform Pilots
Cloud Credit Model
Resource Search & Index
• Explore feasibility of the Commons Platform
• Facilitate collaboration and interoperability
• Provide access to cloud via credits to populate the Commons
• Connecting credits to NIH Grants
• Making large and/or high value NIH funded data sets and tools accessible in the cloud
• Developing Data & Software Indexing methods• Leveraging BD2K efforts bioCADDIE et al• Collaborating with external groups
6/19/17 18
Commons - Platform Stack
https://datascience.nih.gov/commons
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
Digital O
bject C
om
plian
ceApp store/User Interface
6/19/17 19
NIH + Community defined data sets
possible FOAs and CCM
BD2K Centers, MODS, HMP & InteroperabilitySupplements
Cloud credits model (CCM)
BioCADDIE/OtherIndexing
NCI & NIAID Cloud Pilots
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital O
bject C
om
plian
ce
App store/User Interface
Mapping BD2K Activities to the Commons Platform
https://datascience.nih.gov/commons
6/19/17 20
Overarching Questions
• Is the Commons a step towards improved reproducibility?
• Is the Commons approach at odds with other approaches, if not how best to coordinate?
• Do the pilots enable a full evaluation for a larger scale implementation?
• How best to evaluate the success of the pilots?
6/19/17 21
Other Questions
• Is a mix of cloud vendors appropriate?
• How to balance the overall metrics of success?• Reproducibility
• Cost saving
• Efficiency – centralized data vs distributed
• New science
• User satisfaction
• Data integration and reuse – how to measure?
• Data security
• What are the weaknesses?
6/19/17 22
• Thank You
6/19/17 23
Acknowledgements
• Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry, Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)
• NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis
• NHGRI: Eric Green, Valentina di Francesco
• NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster
• CIT: Andrea Norris
• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr
• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen
• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)
• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI)
• OSP: Dina Paltoo
6/19/17 24