Scientific Reproducibility and Software - Stanford University
Transcript of Scientific Reproducibility and Software - Stanford University
![Page 1: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/1.jpg)
Scientific Reproducibility and
Software
Victoria Stodden
Information Society Project @ Yale Law School
Institute for Computational Engineering and Sciences
The University of Texas at Austin
February 8, 2010
![Page 2: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/2.jpg)
Agenda
1. Error Control: V,V,&EQ
2. Hypothesis: Increased Reproducibility Neededto comply with the Scientific Method
3. Survey: Barriers to Open Code/Data
4. Untangling Intellectual Property Issues
5. Barriers in like software lockin and ideaevolution (Gaussian) Software needs: analogto digital so need to track (survey)
![Page 3: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/3.jpg)
![Page 4: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/4.jpg)
Controlling Error is Central to
Scientific Progress“The scientific method’s central
motivation is the ubiquity of error- the awareness that mistakes andself-delusion can creep inabsolutely anywhere and that thescientist’s effort is primarilyexpended in recognizing androoting out error.” David Donoho etal. (2009)
![Page 5: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/5.jpg)
Software Verification
• Accuracy with which a computationalmodel delivers the results of theunderlying mathematical model:– Solution verification: does the
discretization error approach zero as weapproach the continuous model?
– Code verification: test suites: problemswith known solutions, known rates ofconvergence..
![Page 6: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/6.jpg)
Model Validation
• Accuracy of a computational model
with respect to observed data:
– Model error: misspecification
– Observation error, measurement error
![Page 7: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/7.jpg)
Uncertainty Quantification
• Numerical accuracy of estimates,
• Sensitivity of given results to boundary
conditions,
• Model calibration metrics,
• Parameter estimation, confidence
intervals.
![Page 8: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/8.jpg)
Controlling Error is Central to
the Scientific MethodIn stochastic modeling “the
possibility of erroneous decisionscannot be eliminated, and thebest one can do is to seekmethods of making decisionsthat, in a sense, minimize therisk of mistakes.” Jerzy Neyman,“Statistics - Servant of AllSciences,” Science, 1955, p. 401
![Page 9: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/9.jpg)
Hypothesis
• Verification, Validation, & Error
Quantification are necessary but not
sufficient for the practice of the
scientific method.
• Error control is not insular, but also
involves community vetting.
![Page 10: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/10.jpg)
Reproducibility
• Computational science: researcherworks with code or data in generatingpublished results.
• Reproducibility: the ability of other torecreate and verify computationalresults, given appropriate software andcomputing resources.
![Page 11: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/11.jpg)
Reproducibility: Hypothesis
• Facts are established through socialacceptance (Latour), throughindependent replication through openinspection.
• Reproducibility of results is essentialfor computational science to conformwith the scientific method.
![Page 12: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/12.jpg)
Science is Changing
• A transformation of the scientific enterprise
through massive computation, in scale, scope,
and pervasiveness, is currently underway..
• JASA June 1996: 9 of 20 articles computational
• JASA June 2006: 33 of 35 articles computational
![Page 13: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/13.jpg)
Example: Community Climate
System Model (CCSM)• Collaborative
systemsimulation
• Code availableby permission
• Data outputfiles bypermission
![Page 14: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/14.jpg)
Example: High Energy Physics
• 4 LHC experiments at CERN: 15 petabytesproduced annually
• Data shared through grid to mobilizecomputing power
• Director of CERN (Heuer): “Ten or 20 years ago we mighthave been able to repeat an experiment.They were simpler,cheaper and on a smaller scale. Today that is not the case. Soif we need to re-evaluate the data we collect to test a newtheory, or adjust it to a new development, we are going tohave to be able reuse it. That means we are going to need to
save it as open data.…” Computer Weekly, August 6, 2008
![Page 15: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/15.jpg)
Example: Astrophysics
Simulation Collaboratory• Data and code
sharing within
community
• Interface for
dynamic simulation
• mid 1930’s:
calculate the motion
of cosmic rays in
Earth’s magnetic
field..
![Page 16: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/16.jpg)
Example: Proofs
• Mathematical proof via simulation, notdeduction
• Breakdown point:
1/sqrt(2log(p))
• A valid proof?
• A contribution to the field of mathematics?
![Page 17: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/17.jpg)
The Third Branch of the
Scientific Method• Branch 1: Deductive/Theory: e.g.
mathematics; logic
• Branch 2: Inductive/Empirical: e.g. the
machinery of hypothesis testing; statistical
analysis of controlled experiments
• Branch 3: Large scale extrapolation and
prediction: Knowledge from computation or
tools for established branches?
![Page 18: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/18.jpg)
Contention About 3rd Branch
• Anderson: The End of Theory. (Wired, June2008)
• Hillis Rebuttal: We are looking for patternsfirst then create hypotheses as we alwayshave.. (The Edge, June 2008)
• Weinstein: Simulation underlies existingbranches
1. Tools to build intuition (branch 1)
2. Hypotheses to test (branch 2)
![Page 19: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/19.jpg)
Emerging Credibility Crisis in
Computational Science• Typical scientific communication doesn’t
include code, data, test suites.
• Much published computational science nearimpossible to replicate.
• Accession to 3rd branch of the scientificmethod involves the production of routinelyverifiable knowledge.
![Page 20: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/20.jpg)
Potential Solution:
Really Reproducible ResearchPioneered by Jon Claerbout
“An article about computational science
in a scientific publication is not the
scholarship itself, it is merely
advertising of the scholarship. The
actual scholarship is the complete
software development environment
and the complete set of instructions
which generated the figures.”
(quote from David Donoho, “Wavelab
and Reproducible Research,” 1995)
![Page 21: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/21.jpg)
Barriers to Sharing: Survey
Hypotheses:
1. Scientists are primarily motivated by
personal gain or loss.
2. Scientists are primarily worried about
being scooped.
![Page 22: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/22.jpg)
Survey of Computational Scientists
• Subfield: Machine Learning
• Sample: American academicsregistered at top Machine Learningconference (NIPS).
• Respondents: 134 responses from 638requests.
![Page 23: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/23.jpg)
Reported Sharing Habits
• 81% claim to reveal some code and 84%claim to reveal some data.
• Visual inspection of their websites: 30%had some code posted, 20% had somedata posted.
![Page 24: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/24.jpg)
Top Reasons Not to Share
54%
42%
-
41%
38%
35%
34%
33%
29%
Code Data
Time to document and clean up
Not receiving attribution
Possibility of patents
Legal barriers (ie. copyright)
Time to verify release with admin
Potential loss of future publications
Dealing with questions from users
Competitors may get an advantage
Web/Disk space limitations
77%
44%
40%
34%
-
30%
52%
30%
20%
![Page 25: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/25.jpg)
For example..
![Page 26: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/26.jpg)
Top Reasons to Share
81%
79%
79%
76%
74%
79%
73%
71%
71%
Code Data
Encourage scientific advancement
Encourage sharing in others
Be a good community member
Set a standard for the field
Improve the caliber of research
Get others to work on the problem
Increase in publicity
Opportunity for feedback
Finding collaborators
91%
90%
86%
82%
85%
81%
85%
78%
71%
![Page 27: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/27.jpg)
Have you been scooped?
Idea Theft Count Proportion
At least one publication scooped
2 or more scooped
No ideas stolen
53
31
50
0.51
0.30
0.49
![Page 28: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/28.jpg)
Preliminary Findings
• Surprise: Motivated to share by
communitarian ideals.
• Not surprising: Reasons for not
revealing reflect private incentives.
• Surprise: Scientists not that worried
about being scooped.
• Surprise: Scientists quite worried about
IP issues.
![Page 29: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/29.jpg)
Legal Barriers to Reproducibility
• Original expression of ideas falls undercopyright by default (written expression,code, figures, tables..)
• Copyright creates exclusive rights vested inthe author to:– reproduce the work
– prepare derivative works based upon the original
– Exceptions and limitations: Fair Use, Academicpurposes
![Page 30: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/30.jpg)
Creative Commons
• Founded by LarryLessig to make iteasier for artists toshare and use creativeworks
• A suite of licensesthat allows the authorto determine terms ofuse attached to works
![Page 31: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/31.jpg)
Creative Commons Licenses
• A notice posted by the author removing thedefault rights conferred by copyright and adding aselection of:
• BY: if you use the work attribution must beprovided,
• NC: work cannot be used for commercialpurposes,
• ND: derivative works not permitted,
• SA: derivative works must carry the same licenseas the original work.
![Page 32: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/32.jpg)
License Logos
![Page 33: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/33.jpg)
Open Source Software
Licensing• Creative Commons follows the
licensing approach used for opensource software, but adapted forcreative works
• Code licenses:– BSD license: attribution
– GNU GPL: attribution and share alike
– Hundreds of software licenses..
![Page 34: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/34.jpg)
Apply to Scientific Work?
• Remove copyright’s block to fully
reproducible research
• Attach a license with an attribution
component to all elements of the
research compendium (including code,
data), encouraging full release.
Solution: Reproducible Research Standard
![Page 35: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/35.jpg)
Reproducible Research
Standard
Realignment of legal framework withscientific norms:
• Release media components (text,figures) under CC BY.
• Release code components underModified BSD or similar.
• Both licenses free the scientific work ofcopying and reuse restrictions and havean attribution component.
![Page 36: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/36.jpg)
“ShareAlike” Inappropriate
• “ShareAlike”: licensing provision thatrequires identical licensing ofdownstream libraries,
• Issue 1: Control of independentscientists’ work,
• Issue 2: Incompatibility of differinglicenses with this provisions.
• GPL not suitable for scientific code.
![Page 37: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/37.jpg)
Releasing Data?
• Raw facts not copyrightable.
• Original “selection and arrangement” of
these facts is copyrightable. (Feist
Publ’ns Inc. v. Rural Tel. Serv. Co., 499
U.S. 340 (1991))
![Page 38: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/38.jpg)
Benefits of RRS
• Focus becomes release of the entireresearch compendium
• Hook for funders, journals, universities• Standardization avoids license
incompatibilities• Clarity of rights (beyond Fair Use)• IP framework supports scientific norms
• Facilitation of research, thus citation,discovery…
![Page 39: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/39.jpg)
Reproducibility is an Open Problem
(and scale matters)
• Simple case: open data and small scripts.Suits simple definition.
• Hard case: Inscrutable code, organicprogramming.
• Harder case: massive computing platforms,streaming data.
• Can we have reproducibility in the hard cases?
![Page 40: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/40.jpg)
Solutions for Harder Cases
• Tools for reproducibility:– Standardized testbeds
– Open code for continuous data processing, flagsfor “continuous verifiability”
– Standards and platforms for data sharing
– Provenance and workflow tracking tools (Mesirov)
• Tools for attribution:– Generalized contribution tracking
– Legal attribution/license tracking tracking andsearch (RDFa)
![Page 41: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/41.jpg)
Case Study: mloss.org
• Machine Learning Open Source Software
• Active code repository
• Code release at least as important as
data release
• Open question: software support
![Page 42: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/42.jpg)
Case Study: DANSE
• Neutron
scattering
• Make new data
available
• Unify software
for analysis
![Page 43: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/43.jpg)
Case Study: Wolfram|Alpha
• Obscure code -
testbeds for
verifiability
• Dataset construction
methods opaque
• (claims copyright
over outputs)
![Page 44: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/44.jpg)
Openness and Taleb’s Criticism
• Open Access movement removes thenotion of a scientific community
![Page 45: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/45.jpg)
Real and Potential Wrinkles
• Reproducibility neither necessary nor sufficient forcorrectness, but essential for dispute resolution,
• Software “lock-in” and the evolution of scientificideas (standards lock-in),
• Attribution in digital communication:– Legal attribution and academic citation not isomorphic
– Contribution tracking (RDFa)
• RRS: Need for individual scientist to act,
• “progress depends on artificial aids becoming sofamiliar they are regarded as natural” I.J. Good,“How Much Science Can You Have at Your Fingertips”, 1958
![Page 46: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/46.jpg)
Papers and Links
“Enabling Reproducible Research: OpenLicensing for Scientific Innovation”
“15 Years of Reproducible Research inComputational Harmonic Analysis”
“The Legal Framework for ReproducibleResearch in the Sciences: Licensing andCopyright”
http://www.stanford.edu/~vcs
http://www.stanford.edu/~vcs/Conferences/RoundtableNov212009/
![Page 47: Scientific Reproducibility and Software - Stanford University](https://reader031.fdocuments.in/reader031/viewer/2022020703/61fb34752e268c58cd5b66ce/html5/thumbnails/47.jpg)
Appendix: Attribution
• Legal attribution and academic citation notisomorphic.
• Minimize administrative burden
• Evolving norms / field specific norms /technology
• “keep intact all copyright notices for the Workand provide, reasonable to the medium ormeans You are utilizing… .”