Reproducibility: A Funder and Data Science Perspective

Reproducibility:A Funder and Data Science Perspective

Philip E. Bourne, PhD, FACMI

University of Virginia

Thanks to Valerie Florence, NIH for some slides

http://www.slideshare.net/pebourne

[email protected]

NetSci Preworkshop 2017

June 19, 2017

6/19/17 1

http://www.slideshare.net/pebourne

Who Am I Representing And What Is My Bias?

• I am presenting my views, not necessarily those of NIH

• Now leading an institutional data science initiative

• Total data parasite

• Unnatural interest in scholarly communication• Co-founded and founding EIC PLOS Computational Biology –

OA advocate• Prior co-Director Protein Data Bank• Amateur student researcher in scholarly communication

26/19/17

Reproducibility is the responsibility of all stakeholders….

6/19/17 3

6/19/17 4

Lets start with researchers …

6/19/17 5

Reproducibility - Examples From My Own Work

It took several months to replicate this work this work

… And recently …

Phew…

http://www.sdsc.edu/pb/kinases/6/19/17 6

Beyond value to myself (and even then the emphasis is not enough) there is too little incentive to make my work reproducible by others …

6/19/17 7

Tools Fix This Problem Right?

• Extracted all PMC papers with associated Jupyternotebooks available

• Approx. 100

• Took a random sample of 25

• Only 1 ran out of the box

• Several ran with minor modification

• Others lacked libraries, sufficient details to run etc.

It takes more than tools.. It takes incentives …

Daniel Mietchen 2017 Personal Communication

6/19/17 8

Funders and publishers are the major levers ..

What are funders doing?

Consider the NIH …..

6/19/17 9

6/19/17 10

NIH Special Focus Area

https://www.nih.gov/research-training/rigor-reproducibility

116/19/17

Outcomes – General …

6/19/17 12

Enhancing Reproducibility through Rigor and Transparency NOT-OD-15-103

• Clarifies NIH expectations in 4 areas• Scientific premise

• Describe strengths and weaknesses of prior research

• Rigorous experimental design• How to achieve robust and unbiased outcomes

• Consideration of sex and other relevant biological variables

• Authentication of key biological and/or chemical resources e.g., cell lines

136/19/17

Outcomes – network based …

6/19/17 14

Experiment in Moving from Pipes to Platforms

Sangeet Paul Choudary https://www.slideshare.net/sanguit6/19/17 15

Commons & the FAIR Principles

• The Commons is a virtual platform physically located predominantly on public clouds

• Digital assets (objects) within that system are data, software, narrative, course materials etc.

• Assets are FAIR – Findable, Accessible, Interoperable and Reusable

https://www.workitdaily.com/job-search-solution/

Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618

PLoS Biol 15(4): e2001818

6/19/17 16

Just announced …https://commonfund.nih.gov/sites/default/files/RM-17-026_CommonsPilotPhase.pdf

6/19/17 17

Bonazzi and Bourne 2017 FAIR: https://www.nature.com/articles/sdata201618

Current Data Commons Pilots

Reference Data Sets

Commons Platform Pilots

Cloud Credit Model

Resource Search & Index

• Explore feasibility of the Commons Platform

• Facilitate collaboration and interoperability

• Provide access to cloud via credits to populate the Commons

• Connecting credits to NIH Grants

• Making large and/or high value NIH funded data sets and tools accessible in the cloud

• Developing Data & Software Indexing methods• Leveraging BD2K efforts bioCADDIE et al• Collaborating with external groups

6/19/17 18

Commons - Platform Stack

https://datascience.nih.gov/commons

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data

“Reference” Data Sets

User defined data

Digital O

bject C

om

plian

ceApp store/User Interface

6/19/17 19

NIH + Community defined data sets

possible FOAs and CCM

BD2K Centers, MODS, HMP & InteroperabilitySupplements

Cloud credits model (CCM)

BioCADDIE/OtherIndexing

NCI & NIAID Cloud Pilots

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data“Reference” Data Sets

User defined data

Digital O

bject C

om

plian

ce

App store/User Interface

Mapping BD2K Activities to the Commons Platform

https://datascience.nih.gov/commons

6/19/17 20

Overarching Questions

• Is the Commons a step towards improved reproducibility?

• Is the Commons approach at odds with other approaches, if not how best to coordinate?

• Do the pilots enable a full evaluation for a larger scale implementation?

• How best to evaluate the success of the pilots?

6/19/17 21

Other Questions

• Is a mix of cloud vendors appropriate?

• How to balance the overall metrics of success?• Reproducibility

• Cost saving

• Efficiency – centralized data vs distributed

• New science

• User satisfaction

• Data integration and reuse – how to measure?

• Data security

• What are the weaknesses?

6/19/17 22

• Thank You

6/19/17 23

Acknowledgements

• Vivien Bonazzi, Jennie Larkin, Michelle Dunn, Mark Guyer, Allen Dearry, Sonynka Ngosso Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)

• NLM/NCBI: Patricia Brennan, Mike Huerta, George Komatsoulis

• NHGRI: Eric Green, Valentina di Francesco

• NIGMS: Jon Lorsch, Susan Gregurick, Peter Lyster

• CIT: Andrea Norris

• NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr

• NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen

• Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)

• RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI)

• OSP: Dina Paltoo

6/19/17 24

Reproducibility: A Funder and Data Science Perspective

Education

Transcript of Reproducibility: A Funder and Data Science Perspective