Reproducible research: Practice
Tobin Magle, PhDBioinformationist
Health Science LibraryUniversity of Colorado Anschutz Medical Campus
Reproducibility
is the practice of distributing all data, software source code, and tools required to reproduce the results discussed in a research
publication. https://www.ctspedia.org/do/view/CTSpedia/ReproducibleResearchStandards
Replication vs. Reproducibility• Replication: The confirmation of results and conclusions from one study
obtained independently in another is considered the scientific gold standard. • “Again, and Again, and Again …” BR Jasny et. al. Science, 2011. 334(6060) pp. 1225 DOI: 10.1126/science.334.6060.1225
• Some studies can’t be replicated: too big, too costly, too time consuming, one time event, rare samples
• Reproducibility: minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible
• “Reproducible Research in Computational Science”. RD Peng Science, 2011. 334 (6060) pp. 1226-1227 DOI: 10.1126/science.1213847
Research Lifecycle:
FormHypothesis
Collect Data
Design Experiment
Publish research
Analyze Data
Write manuscript
1. Technological advances:• Huge, complex digital datasets• Computational power• Ability to share
2. Human Error:• Poor Reporting• Flawed analyses
Complications
Complicated Research LifecycleForm
Hypothesis
Collect Data
Design Experiment
Publish research
Clean Data
Analyze Data
Write manuscript
Share data
Curate data
Plan for data storage
Requires new expertise and infrastructure
FormHypothesis
Collect Data
Design Experiment
Publish research
Clean Data
Analyze Data
Write manuscript
Share data
Curate data
Plan for data storage
Data Management Plans
Version control
Literate Statistical Computing
Reproducible research tools
DMPTool• Developed by California Digital Libraries to help researchers write
data management plans• https://dmptool.org/user_sessions/institution• Select University of Colorado Anschutz Medical Campus
Create an account* or signin
*We’re working with OIT to allow us to log in with CU passport credentials. Stay tuned
Data management exercise• Create a DMPTool account
• Pick a template and create a DMP
• Take 5 minutes to click through the template and think about how these questions relate to your research
Version control
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
https://git-scm.com/doc
Intuitive version control
But what if you save a new file into the wrong version?
Original(V1)
V3
V2
Local version control system
Figure 1-1. Local version control.https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
But what if you need to collaborate?
• Keeps files in one place • No copies• Keeps track of changes• Like Apple’s Time machine
Centralized version control
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
Figure 1-2. Centralized version control.
But what the server goes down?
What if you can’t get online?
• Keeps files on a server• No copies• Keeps track of changes• Can work simultaneously
on different files
Distributed version control
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
Figure 1-3. Distributed version control.
Git, Mercurial, Bazaar or Darcs
• Keeps files locally AND on a server• Changes are among computers and
server• Keeps track of changes• Can work simultaneously
What is Git?• Distributed version control system developed by the Linux community• A stream of snapshots
Figure 1-5. Storing data as snapshots of the project over time.
https://git-scm.com/book/en/v2/Getting-Started-Git-Basics
3 states of repository files• Modified – the file is altered but not committed
• Staged – the file is altered and marked to go to the next commit
• Committed- the file is altered and stored in your local DB
3 Sections of your directory
Figure 1-6. Working directory, staging area, and Git directory.https://git-scm.com/book/en/v2/Getting-Started-Git-Basics
Committed
ModifiedStaged
Important git commands• Init (Initialize) – start a git repository
• Add – add files to the git repository (for initial add and staging), can be skipped with –a command
• Commit – safely store the files in your git repository
• Clone – make a copy of someone else’s git repository
File statuses and how they change
Figure 2-1. The lifecycle of the status of your files.https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository
Git your hands on git
• Create a GitHub account
• Go to the repository: https://github.com/maglet/hands-on-git
• Clone the repository
Cloning/Branching/Forking• Cloning: make a local copy of a repository online or elsewhere
• Branching: creating a separate stream to test new features, so you don’t affect the “trunk”; branches depend on the trunk• Collaboration
• Forking: Making a separate copy of a repository that is not dependent• Using others’ work is a starting point; preserving things that the owner might
delete for yourself
Exercise• Go to the repository you cloned earlier
• Create a text file with your name on it
• Add it to the name folder
• Submit a pull request
• Look at what happens to the visual representation
Literate (statistical) programming• Resulting report is a stream of text (human readable) and code
(machine readable)
• Alternate text and code• Sweave• R markdown
R Markdown
• Open• Write• Embed• Render
https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
Install knitr and markdown packages• Tools > install packages
• Enter the package name (will autocomplete)• Knitr• Markdown
• OR install.packages("knitr”)
• If it fails, try again
Write: useful syntax• Plain text
• *italics* -> italics
• **bold** -> bold
• #Header -> Header (more # decreases size)
• Can also draw: • Insert pictures• Ordered and unordered list• Tables
Embed code• Inline – Use variables in the human readable text• `r 2 + 2`
• Code chunks - Include working code that generates output• ```{r}• #Code goes here• ```
• Display Options –
Render• Won’t render unless the code runs with no errors• You know it should be reproducible
• Render using the knit function
• Output Formats• Knit HTML• Knit PDF – requires latex• Knit Word
Top Related