Reproducible Research With R and RStudio

306

description

Christopher Gandrud

Transcript of Reproducible Research With R and RStudio

  • Reproducible Research with R

    and RStudio

    K16624_FM.indd 1 5/24/13 3:23 PM

  • Chapman & Hall/CRCThe R Series

    John M. ChambersDepartment of Statistics

    Stanford University Stanford, California, USA

    Duncan Temple LangDepartment of Statistics

    University of California, DavisDavis, California, USA

    Torsten HothornInstitut fr Statistik

    Ludwig-Maximilians-UniversittMnchen, Germany

    Hadley WickhamDepartment of Statistics

    Rice UniversityHouston, Texas, USA

    Aims and ScopeThis book series reflects the recent rapid growth in the development and application of R, the programming language and software environment for statistical computing and graphics. R is now widely used in academic research, education, and industry. It is constantly growing, with new versions of the core software released regularly and more than 4,000 packages available. It is difficult for the documentation to keep pace with the expansion of the software, and this vital book series provides a forum for the publication of books covering many aspects of the development and application of R.

    The scope of the series is wide, covering three main threads: Applications of R to specific disciplines such as biology, epidemiology,

    genetics, engineering, finance, and the social sciences. Using R for the study of topics of statistical methodology, such as linear and

    mixed modeling, time series, Bayesian methods, and missing data. The development of R, including programming, building packages, and

    graphics.

    The books will appeal to programmers and developers of R software, as well as applied statisticians and data analysts in many fields. The books will feature detailed worked examples and R code fully integrated into the text, ensuring their usefulness to researchers, practitioners and students.

    Series Editors

    K16624_FM.indd 2 5/24/13 3:23 PM

  • Published Titles

    Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R, Daniel S. Putler and Robert E. Krider

    Dynamic Documents with R and knitr, Yihui Xie

    Event History Analysis with R, Gran Brostrm

    Programming Graphical User Interfaces with R, Michael F. Lawrence and John Verzani

    R Graphics, Second Edition, Paul Murrell

    Reproducible Research with R and RStudio, Christopher Gandrud

    Statistical Computing in C++ and R, Randall L. Eubank and Ana Kupresanin

    K16624_FM.indd 3 5/24/13 3:23 PM

  • This page intentionally left blankThis page intentionally left blank

  • The R Series

    Reproducible Research with R

    and RStudio

    Christopher Gandrud

    K16624_FM.indd 5 5/24/13 3:23 PM

  • CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

    2014 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

    No claim to original U.S. Government worksVersion Date: 20130524

    International Standard Book Number-13: 978-1-4665-7285-0 (eBook - PDF)

    This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

    Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor-age or retrieval system, without written permission from the publishers.

    For permission to photocopy or use material electronically from this work, please access www.copy-right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-vides licenses and registration for a variety of users. For organizations that have been granted a pho-tocopy license by the CCC, a separate system of payment has been arranged.

    Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.comand the CRC Press Web site athttp://www.crcpress.com

  • Preface

    This book has its genesis in my PhD research at the London School of Eco-nomics. I started the degree with questions about the 2008/09 financial crisisand planned to spend most of my time researching about capital adequacyrequirements. But I quickly realized much of my time would actually be spentlearning the day-to-day tasks of data gathering, analysis, and results pre-sentation. After plodding through for awhile, the breaking point came whilereentering results into a regression table after I had tweaked one of my sta-tistical models, yet again. Surely there was a better way to do research thatwould allow me to spend more time answering my research questions. Makingresearch reproducible for others also means making it better organized andecient for yourself. So, my search for a better way led me straight to thetools for reproducible computational research.

    The reproducible research community is very active, knowledgeable, andhelpful. Nonetheless, I often encountered holes in this collective knowledge,or at least had no resource to bring it all together as a whole. That is myintention for this book: to bring together the skills I have picked up for actuallydoing and presenting computational research. Hopefully, the book, along withmaking reproducible research more common, will save researchers hours ofgoogling, so they can spend more time addressing their research questions.

    I would not have been able to write this book without many peoples adviceand support. Foremost is John Kimmel, acquisitions editor at Chapman andHall. He approached me in Spring 2012 with the general idea and opportunityfor this book. Other editors at Chapman and Hall and Taylor and Francishave greatly contributed to this project, including Marcus Fontaine. I wouldalso like to thank all of the books reviewers whose helpful comments havegreatly improved it. The reviewers include:

    Jeromy Anglim, Deakin University

    Karl Broman, University of Wisconsin, Madison

    Jake Bowers, University of Illinois, Urbana-Champaign

    Corey Chivers, McGill University

    Mark M. Fredrickson, University of Illinois, Urbana-Champaign

    Benjamin Lauderdale, London School of Economics

    Ramnath Vaidyanathan, McGill University

    vii

  • viii

    The developer and blogging community has also been incredibly importantfor making this book possible. Foremost among these people is Yihui Xie. Heis the main developer behind the knitr package and also an avid writer andcommenter of blogs. Without him the ability to do reproducible research wouldbe much harder and the blogging community that spreads knowledge abouthow to do these things would be poorer. Other great contributors to thereproducible research community include Carl Boettiger, Markus Gesmann(who developed googleVis), Rob Hyndman, and Hadley Wickham (who hasdeveloped numerous very useful R packages). Thank you also to VictoriaStodden and Michael Malecki for helpful suggestions.

    The vibrant community at Stack Overflow http://stackoverflow.com/and Stack Exchange http://stackexchange.com/ are always very helpfulfor finding answers to problems that plague any computational researcher.Importantly, the sites make it easy for others to find the answers to questionsthat have already been asked.

    My students at Yonsei University were an important part of making thisbook. One of the reasons that I got interested in using many of the toolscovered in this book, like using knitr in slideshows, was to improve a course Itaught there: Introduction to Social Science Data Analysis. I tested many ofthe explanations and examples in this book on my students. Their feedbackhas been very helpful for making the book clearer and more useful. Theirexperience with using these tools on Windows computers was also importantfor improving the books Windows documentation.

    My wife, Kristina Gandrud, has been immensely supportive and patientwith me throughout the writing of this book (and pretty much my entire aca-demic career). Certainly this is not the proper forum for musing about maritalrelations, but Ill add a musing anyways. Having a person who supports yourinterests, even if they dont completely share them, is immensely helpful fora researcher. It keeps you going.

  • Contents

    Preface vii

    Stylistic Conventions xv

    Required R Packages xvii

    Additional Resources xixChapter Examples . . . . . . . . . . . . . . . . . . . . . . . . xixShort Example Project . . . . . . . . . . . . . . . . . . . . . . xix

    List of Figures xxiii

    List of Tables xxv

    I Getting Started 1

    1 Introducing Reproducible Research 31.1 What is Reproducible Research? . . . . . . . . . . . . . . . . 31.2 Why Should Research Be Reproducible? . . . . . . . . . . . 5

    1.2.1 For science . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 For you . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.3 Who Should Read This Book? . . . . . . . . . . . . . . . . . 71.3.1 Academic researchers . . . . . . . . . . . . . . . . . . . 81.3.2 Students . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.3 Instructors . . . . . . . . . . . . . . . . . . . . . . . . 81.3.4 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.5 Private sector researchers . . . . . . . . . . . . . . . . 9

    1.4 The Tools of Reproducible Research . . . . . . . . . . . . . . 101.5 Why Use R, knitr, and RStudio for Reproducible Research? 11

    1.5.1 Installing the main software . . . . . . . . . . . . . . . 131.6 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.6.1 How to read this book . . . . . . . . . . . . . . . . . . 151.6.2 Reproduce this book . . . . . . . . . . . . . . . . . . . 161.6.3 Contents overview . . . . . . . . . . . . . . . . . . . . 16

    ix

  • x2 Getting Started with Reproducible Research 172.1 The Big Picture: A Workflow for Reproducible Research . . 17

    2.1.1 Reproducible theory . . . . . . . . . . . . . . . . . . . 182.2 Practical Tips for Reproducible Research . . . . . . . . . . . 20

    2.2.1 Document everything! . . . . . . . . . . . . . . . . . . 202.2.2 Everything is a (text) file . . . . . . . . . . . . . . . . 222.2.3 All files should be human readable . . . . . . . . . . . 222.2.4 Explicitly tie your files together . . . . . . . . . . . . . 242.2.5 Have a plan to organize, store, & make your files avail-

    able . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3 Getting Started with R, RStudio, and knitr 273.1 Using R: the Basics . . . . . . . . . . . . . . . . . . . . . . . 27

    3.1.1 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.2 Component selection . . . . . . . . . . . . . . . . . . . 343.1.3 Subscripts . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Functions and commands . . . . . . . . . . . . . . . . 373.1.5 Arguments . . . . . . . . . . . . . . . . . . . . . . . . 383.1.6 The workspace & history . . . . . . . . . . . . . . . . 393.1.7 Global R options . . . . . . . . . . . . . . . . . . . . . 413.1.8 Installing new packages and loading commands . . . . 42

    3.2 Using RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Using knitr : the Basics . . . . . . . . . . . . . . . . . . . . . 44

    3.3.1 What knitr does . . . . . . . . . . . . . . . . . . . . . 453.3.2 File extensions . . . . . . . . . . . . . . . . . . . . . . 453.3.3 Code chunks . . . . . . . . . . . . . . . . . . . . . . . 473.3.4 Global chunk options . . . . . . . . . . . . . . . . . . . 493.3.5 knitr package options . . . . . . . . . . . . . . . . . . 513.3.6 Hooks . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.7 knitr & RStudio . . . . . . . . . . . . . . . . . . . . . 523.3.8 knitr & R . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4 Getting Started with File Management 594.1 File Paths & Naming Conventions . . . . . . . . . . . . . . . 60

    4.1.1 Root directories . . . . . . . . . . . . . . . . . . . . . . 604.1.2 Subdirectories & parent directories . . . . . . . . . . . 604.1.3 Spaces in directory & file names . . . . . . . . . . . . 614.1.4 Working directories . . . . . . . . . . . . . . . . . . . . 61

    4.2 Organizing Your Research Project . . . . . . . . . . . . . . . 634.3 Setting Directories as RStudio Projects . . . . . . . . . . . . 644.4 R File Manipulation Commands . . . . . . . . . . . . . . . . 644.5 Unix-like Shell Commands for File Management . . . . . . . 674.6 File Navigation in RStudio . . . . . . . . . . . . . . . . . . . 72

    II Data Gathering and Storage 73

  • xi

    5 Storing, Collaborating, Accessing Files, & Versioning 755.1 Saving Data in Reproducible Formats . . . . . . . . . . . . . 765.2 Storing Your Files in the Cloud: Dropbox . . . . . . . . . . . 77

    5.2.1 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.2 Accessing data . . . . . . . . . . . . . . . . . . . . . . 785.2.3 Collaboration . . . . . . . . . . . . . . . . . . . . . . . 805.2.4 Version control . . . . . . . . . . . . . . . . . . . . . . 80

    5.3 Storing Your Files in the Cloud: GitHub . . . . . . . . . . . 815.3.1 Setting up GitHub: basic . . . . . . . . . . . . . . . . 835.3.2 Version control with Git . . . . . . . . . . . . . . . . . 845.3.3 Remote storage on GitHub . . . . . . . . . . . . . . . 915.3.4 Accessing on GitHub . . . . . . . . . . . . . . . . . . . 94

    5.3.4.1 Collaboration with GitHub . . . . . . . . . . 965.3.5 Summing up the GitHub workflow . . . . . . . . . . . 96

    5.4 RStudio & GitHub . . . . . . . . . . . . . . . . . . . . . . . 975.4.1 Setting up Git/GitHub with Projects . . . . . . . . . 985.4.2 Using Git in RStudio Projects . . . . . . . . . . . . . 99

    6 Gathering Data with R 1016.1 Organize Your Data Gathering: Makefiles . . . . . . . . . . . 101

    6.1.1 R Make-like files . . . . . . . . . . . . . . . . . . . . . 1026.1.2 GNU Make . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.1.2.1 Example makefile . . . . . . . . . . . . . . . 1046.1.2.2 Makefiles and RStudio Projects . . . . . . . 1086.1.2.3 Other information about makefiles . . . . . . 109

    6.2 Importing Locally Stored Data Sets . . . . . . . . . . . . . . 1096.3 Importing Data Sets from the Internet . . . . . . . . . . . . 110

    6.3.1 Data from non-secure (http) URLs . . . . . . . . . . . 1116.3.2 Data from secure (https) URLs . . . . . . . . . . . . 1116.3.3 Compressed data stored online . . . . . . . . . . . . . 1146.3.4 Data APIs & feeds . . . . . . . . . . . . . . . . . . . . 115

    6.4 Advanced Automatic Data Gathering: Web Scraping . . . . . 117

    7 Preparing Data for Analysis 1217.1 Cleaning Data for Merging . . . . . . . . . . . . . . . . . . . 121

    7.1.1 Get a handle on your data . . . . . . . . . . . . . . . . 1217.1.2 Reshaping data . . . . . . . . . . . . . . . . . . . . . . 1237.1.3 Renaming variables . . . . . . . . . . . . . . . . . . . . 1267.1.4 Ordering data . . . . . . . . . . . . . . . . . . . . . . . 1267.1.5 Subsetting data . . . . . . . . . . . . . . . . . . . . . . 1277.1.6 Recoding string/numeric variables . . . . . . . . . . . 1297.1.7 Creating new variables from old . . . . . . . . . . . . 1317.1.8 Changing variable types . . . . . . . . . . . . . . . . . 134

    7.2 Merging Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 1357.2.1 Binding . . . . . . . . . . . . . . . . . . . . . . . . . . 135

  • xii

    7.2.2 The merge command . . . . . . . . . . . . . . . . . . . 1357.2.3 Duplicate values . . . . . . . . . . . . . . . . . . . . . 1387.2.4 Duplicate columns . . . . . . . . . . . . . . . . . . . . 139

    III Analysis and Results 143

    8 Statistical Modelling and knitr 1458.1 Incorporating Analyses into the Markup . . . . . . . . . . . 146

    8.1.1 Full code chunks . . . . . . . . . . . . . . . . . . . . . 1468.1.2 Showing code & results inline . . . . . . . . . . . . . . 148

    8.1.2.1 LaTeX . . . . . . . . . . . . . . . . . . . . . 1488.1.2.2 Markdown . . . . . . . . . . . . . . . . . . . 150

    8.1.3 Dynamically including non-R code in code chunks . . 1508.2 Dynamically Including Modular Analysis Files . . . . . . . . 152

    8.2.1 Source from a local file . . . . . . . . . . . . . . . . . . 1528.2.2 Source from a non-secure URL (http) . . . . . . . . . 1538.2.3 Source from a secure URL (https) . . . . . . . . . . . 154

    8.3 Reproducibly Random: set.seed . . . . . . . . . . . . . . . 1558.4 Computationally Intensive Analyses . . . . . . . . . . . . . . 157

    9 Showing Results with Tables 1599.1 Basic knitr Syntax for Tables . . . . . . . . . . . . . . . . . . 1609.2 Table Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    9.2.1 Tables in LaTeX . . . . . . . . . . . . . . . . . . . . . 1619.2.2 Tables in Markdown/HTML . . . . . . . . . . . . . . 165

    9.3 Creating Tables from R Objects . . . . . . . . . . . . . . . . 1699.3.1 xtable & apsrtable basics with supported class objects 169

    9.3.1.1 apsrtable for LaTeX . . . . . . . . . . . . . . 1729.3.2 xtable with non-supported class objects . . . . . . . . 1759.3.3 Creating variable description documents with xtable . 177

    10 Showing Results with Figures 18110.1 Including Non-knitted Graphics . . . . . . . . . . . . . . . . 181

    10.1.1 Including graphics in LaTeX . . . . . . . . . . . . . . 18210.1.2 Including graphics in Markdown/HTML . . . . . . . . 184

    10.2 Basic knitr Figure Options . . . . . . . . . . . . . . . . . . . 18510.2.1 Chunk options . . . . . . . . . . . . . . . . . . . . . . 18510.2.2 Global options . . . . . . . . . . . . . . . . . . . . . . 187

    10.3 Knitting Rs Default Graphics . . . . . . . . . . . . . . . . . 18710.4 Including ggplot2 Graphics . . . . . . . . . . . . . . . . . . . 192

    10.4.1 Showing regression results with caterpillar plots . . . . 19510.5 JavaScript Graphs with googleVis . . . . . . . . . . . . . . . 198

    IV Presentation Documents 203

  • xiii

    11 Presenting with LaTeX 20511.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

    11.1.1 Getting started with LaTeX editors . . . . . . . . . . 20511.1.2 Basic LaTeX command syntax . . . . . . . . . . . . . 20611.1.3 The LaTeX preamble & body . . . . . . . . . . . . . . 20711.1.4 Headings . . . . . . . . . . . . . . . . . . . . . . . . . 21011.1.5 Paragraphs & spacing . . . . . . . . . . . . . . . . . . 21011.1.6 Horizontal lines . . . . . . . . . . . . . . . . . . . . . . 21111.1.7 Text formatting . . . . . . . . . . . . . . . . . . . . . . 21111.1.8 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . 21211.1.9 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21311.1.10Footnotes . . . . . . . . . . . . . . . . . . . . . . . . . 21411.1.11Cross-references . . . . . . . . . . . . . . . . . . . . . . 214

    11.2 Bibliographies with BibTeX . . . . . . . . . . . . . . . . . . 21511.2.1 The .bib file . . . . . . . . . . . . . . . . . . . . . . . . 21511.2.2 Including citations in LaTeX documents . . . . . . . . 21611.2.3 Generating a BibTeX file of R package citations . . . 217

    11.3 Presentations with LaTeX Beamer . . . . . . . . . . . . . . . 22011.3.1 Beamer basics . . . . . . . . . . . . . . . . . . . . . . 22011.3.2 knitr with LaTeX slideshows . . . . . . . . . . . . . . 223

    12 Large LaTeX Documents: Theses, Books, & Batch Reports 22512.1 Planning Large Documents . . . . . . . . . . . . . . . . . . . 22512.2 Large Documents with Traditional LaTeX . . . . . . . . . . 226

    12.2.1 Inputting/including children . . . . . . . . . . . . . . 22712.2.2 Other common features of large documents . . . . . . 228

    12.3 knitr and Large Documents . . . . . . . . . . . . . . . . . . . 22912.3.1 The parent document . . . . . . . . . . . . . . . . . . 22912.3.2 Knitting child documents . . . . . . . . . . . . . . . . 230

    12.4 Child Documents in a Dierent Markup Language . . . . . . 23112.5 Creating Batch Reports . . . . . . . . . . . . . . . . . . . . . 232

    13 Presenting on the Web with Markdown 23913.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

    13.1.1 Getting started with Markdown editors . . . . . . . . 24013.1.2 Preamble and document structure . . . . . . . . . . . 24013.1.3 Headers . . . . . . . . . . . . . . . . . . . . . . . . . . 24313.1.4 Horizontal lines . . . . . . . . . . . . . . . . . . . . . . 24313.1.5 Paragraphs and new lines . . . . . . . . . . . . . . . . 24313.1.6 Italics and bold . . . . . . . . . . . . . . . . . . . . . . 24413.1.7 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . 24413.1.8 Special characters and font customization . . . . . . . 24413.1.9 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24413.1.10Escape characters . . . . . . . . . . . . . . . . . . . . 24513.1.11Math with MathJax . . . . . . . . . . . . . . . . . . . 245

  • xiv

    13.2 Markdown with Pandoc and Custom CSS . . . . . . . . . . . 24613.2.1 Pandoc . . . . . . . . . . . . . . . . . . . . . . . . . . 24613.2.2 CSS style files and Markdown . . . . . . . . . . . . . . 25013.2.3 knitr s pandoc command . . . . . . . . . . . . . . . . 251

    13.3 Slideshows with Markdown, knitr, and HTML . . . . . . . . 25413.3.1 Slideshows with Markdown, knitr, and RStudios R Pre-

    sentations . . . . . . . . . . . . . . . . . . . . . . . . . 25413.3.2 Slideshows with Markdown, knitr, and slidify . . . . 256

    13.4 Publishing Markdown Documents . . . . . . . . . . . . . . . 26213.4.1 Stand alone HTML files . . . . . . . . . . . . . . . . . 26213.4.2 Hosting webpages with Dropbox . . . . . . . . . . . . 26313.4.3 GitHub Pages . . . . . . . . . . . . . . . . . . . . . . . 263

    14 Conclusion 26514.1 Citing Reproducible Research . . . . . . . . . . . . . . . . . 26514.2 Licensing Your Reproducible Research . . . . . . . . . . . . . 26714.3 Sharing Your Code in Packages . . . . . . . . . . . . . . . . 26714.4 Project Development: Public or Private? . . . . . . . . . . . 26814.5 Is it Possible to Completely Future Proof Your Research? . . 269

    Bibliography 271

    Index 279

  • Stylistic Conventions

    I use the following conventions throughout this book: Abstract VariablesAbstract variables, i.e. variables that do not represent specific objects in anexample, are in ALL CAPS TYPEWRITER TEXT. Clickable ButtonsClickable Buttons are in typewriter text. CodeAll code is in typewriter text. Filenames and DirectoriesFilenames and directories more generally are printed in italics. I use Camel-Back for file and directory names. File extensionsLike filenames, file extensions are italicized. Individual variable valuesIndividual variable values mentioned in the text are in italics. ObjectsObjects are printed in italics. I use CamelBack for object names. Object ColumnsData frame object columns are printed in italics. PackagesR packages are printed in italics. Windows an RStudio PanesOpen windows and RStudio panes are written in italics. Variable NamesVariable names are printed in bold. I use CamelBack for individual variablenames.

    xv

  • This page intentionally left blankThis page intentionally left blank

  • Required R Packages

    In this book I discuss how to use a number of user-written R packages forreproducible research. Many of these packages are not included in the defaultR installation. They need to be installed separately. To install most of theuser-written packages discussed in this book, copy the following code andpaste it into your R console:

    install.packages(c("apsrtable", "brew", "countrycode","devtools", "digest" "formatR", "gdata","ggplot2", "googleVis", "httr", "knitcitations","knitr", "markdown", "openair", "plyr","quantmod", "repmis", "reshape2","RCurl", "rjson", "RJSONIO", "stargazer","texreg", "tools", "treebase","twitteR", "WDI", "XML","xtable", "Zelig"))

    Once you enter this code, you may be asked to select a CRAN mirror todownload the packages from.1 Simply select the mirror closest to you.

    In Chapter 9 we use the Zelig package (Owen et al., 2013) to create asimple Bayesian normal linear regression. For this to work properly you willneed to install an additional package called ZeligBayesian (Owen, 2011). Todo this type the following code into your R console:

    install.packages("ZeligBayesian",repos = "http://r.iq.harvard.edu/",type = "source")

    Ramnath Vaidyanathans slidify package (2012) for creating R Mark-down/HTML slideshows (see Chapter 13) is not currently on CRAN. It can be

    1CRAN stands for the Comprehensive R Archive Network.

    xvii

  • xviii

    downloaded directly from GitHub. To do this first load the devtools package(Wickham and Chang, 2013a). Then download slidify. Here is the completecode:

    # Load devtoolslibrary(devtools)

    # Install slidify and ancillary librariesinstall_github("slidify", "ramnathv")install_github("slidifyLibraries", "ramnathv")

    For more details see the slidify website: http://ramnathv.github.com/slidify/start.html#.

    If you are using Windows you will also need to install Rtools (Ripley andMurdoch, 2012). You can download Rtools from: http://cran.r-project.org/bin/windows/Rtools/. Please use the recommended installation to en-sure that your system PATH is set up correctly. Otherwise your computer willnot know where the tools are.

  • Additional Resources

    Additional resources that supplement the examples in this book can be freelydownloaded and experimented with. These resources include short updates,longer examples discussed in individual chapters as well as a complete shortreproducible research project.

    UpdatesMany of the reproducible research tools discussed in this book are improvingrapidly. Because of this I will regularly post updates to the content covered inthe book at: http://christophergandrud.github.io/RepResR-RStudio/.

    Chapter ExamplesLonger examples discussed in individual chapters, including files to dynam-ically download data, code for creating figures, and markup files for cre-ating presentation documents, can be accessed at: https://github.com/christophergandrud/Rep-Res-Examples. Throughout the book I refer toeach files specific URL so that you can locate it from any computer. Pleasesee Chapter 5 for more information on downloading files from GitHub, wherethe examples are stored.

    Short Example ProjectTo download a full (though very short) example of a reproducible re-search project created using the tools covered in this book go to: https://github.com/christophergandrud/Rep-Res-ExampleProject1. Please fol-low the replication instructions in the main README.md file to fully replicatethe project. It is probably a good idea to hold o looking at this completeexample in detail until after you have become acquainted with the individualtools it uses. Become acquainted with the tools by reading through this bookand working with the individual chapter examples.

    The following two figures give you a sense of how the examples files areorganized. Figure 1 shows how the files are organized in the file system. Figure2 illustrates how the main files are dynamically tied together. In the Datadirectory we have files to gather raw data from the World Bank (2013) onfertilizer consumption and from Pemstein et al. (2010) on countries levels of

    xix

  • xx

    democracy. They are tied to the data through the WDI and download.filecommands. A Makefile can run Gather1.R and Gather2.R to gather and cleanthe data. It runs MergeData.R to merge the data into one data file calledMainData.csv. It also automatically generates a variable description file anda README.md recording the session info.

    The Analysis folder contains two files that create figures presenting thisdata. They are tied toMainData.csv with the read.csv command. These filesare run by the presentation documents when they are knitted. The presen-tation documents tie to the analysis documents with knitr and the sourcecommand.

    Though a simple example, hopefully these files will give you a completesense of how a reproducible research project can be organized. Please feelfree to experiment with dierent ways of organizing the files and tying themtogether to make your research really reproducible.

  • FIGURE1

    ShortE

    xampleProjectF

    ileTree

    Root

    Rep-Res-ExampleProject1

    Data

    MainD

    ata.csv

    Makefile

    MergeData.R

    Gather1.R

    Gather2.R

    MainD

    ata_

    Varia

    bleD

    escriptio

    ns.md

    REA

    DME.Rmd

    Analysis

    GoogleVisM

    ap.R

    ScatterUDSFert.R

    Presentatio

    n

    Article

    Paper.R

    nw

    Main.bib

    Other

    Slideshow Slideshow.Rnw

    Website W

    ebsite.Rmd

    REA

    DME.md

    xxi

  • FIGURE2

    ShortE

    xampleMainFileTies

    Raw

    WDIData

    Gather1.R

    Raw

    UDSData

    Gather2.R

    Makefile

    MergeData.R

    MainD

    ata.csv

    ScatterU

    DSF

    ert.R

    GoogleV

    isMap.R

    Article.Rnw

    Slideshow.Rnw

    Website.Rmd

    Article.pdf

    Slideshow.pdf

    Website.htm

    l

    download.file

    Make

    merge

    WDI

    read.csv

    knitr

    source

    xxii

  • List of Figures

    1 Short Example Project File Tree . . . . . . . . . . . . . . . . xxi2 Short Example Main File Ties . . . . . . . . . . . . . . . . . . xxii

    2.1 Example Workflow & A Selection of Commands to Tie it To-gether . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.1 R Startup Console . . . . . . . . . . . . . . . . . . . . . . . . 283.2 RStudio Startup Panel . . . . . . . . . . . . . . . . . . . . . . 433.3 RStudio Source Code Pane Top Bars . . . . . . . . . . . . . . 453.4 The knitr Process . . . . . . . . . . . . . . . . . . . . . . . . . 463.5 RStudio Notebook Example . . . . . . . . . . . . . . . . . . . 533.6 Folding Code Chunks in RStudio . . . . . . . . . . . . . . . . 54

    4.1 Example Research Project File Tree . . . . . . . . . . . . . . 624.2 An Example RStudio Project Menu . . . . . . . . . . . . . . 634.3 The RStudio Files Pane . . . . . . . . . . . . . . . . . . . . . 72

    5.1 A Basic Git Repository with Hidden .git Folder Revealed . . 835.2 Part of this Books GitHub Repository Webpage . . . . . . . 865.3 Part of this Books GitHub Repository Commit History Page 885.4 Creating RStudio Projects . . . . . . . . . . . . . . . . . . . . 975.5 The RStudio Git More Drop Down Menu . . . . . . . . . . . 985.6 The RStudio Git Tab . . . . . . . . . . . . . . . . . . . . . . 99

    6.1 The RStudio Build Tab . . . . . . . . . . . . . . . . . . . . . 108

    7.1 Density Plot of Fertilizer Consumption (kilograms per hectareof arable land) . . . . . . . . . . . . . . . . . . . . . . . . . . 128

    10.1 An Example Figure in LaTeX . . . . . . . . . . . . . . . . . . 18410.2 Example Simple Scatter Plot Using plot . . . . . . . . . . . 18910.3 Example of a Scatterplot Matrix in a Markdown Document . 19110.4 Example Multi-line Time Series Plot Created with ggplot2 . . 19510.5 An Example Caterpillar Plot Created with ggplot2 . . . . . . 19810.6 Screenshot of a googleVis Geo Map . . . . . . . . . . . . . . . 200

    11.1 RStudio TeX Format Options . . . . . . . . . . . . . . . . . . 20611.2 Knitted Beamer PDF Example . . . . . . . . . . . . . . . . . 221

    xxiii

  • xxiv

    12.1 The brew + knitr Process . . . . . . . . . . . . . . . . . . . . 23212.2 Snippet of an Example PDF Document Created with brew +

    knitr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

    13.1 Example Rendered R Markdown Document . . . . . . . . . . 24213.2 Example Rendered R Markdown Document with Custom CSS 25213.3 RStudio R Presentation Pane . . . . . . . . . . . . . . . . . . 257

  • List of Tables

    2.1 A Selection of Commands/Packages/Programs for Tying To-gether Your Research Files . . . . . . . . . . . . . . . . . . . 26

    3.1 A Selection of knitr Code Chunk Options . . . . . . . . . . . 50

    5.1 A Selection of Git Commands . . . . . . . . . . . . . . . . . . 87

    7.1 Long Formatted Data Example . . . . . . . . . . . . . . . . . 1237.2 Long Formatted Time-series Cross-sectional Data Example . 1247.3 Wide Formatted Data Example . . . . . . . . . . . . . . . . . 1247.4 Rs Logical Operators . . . . . . . . . . . . . . . . . . . . . . 1307.5 Example Factor Levels . . . . . . . . . . . . . . . . . . . . . . 132

    8.1 A Selection of knitr engine Values . . . . . . . . . . . . . . . 151

    9.1 Example Simple LaTeX Table . . . . . . . . . . . . . . . . . . 1649.2 Linear Regression, Dependent Variable: Exam Score . . . . . 1719.3 Example Nested Estimates Table with aprstable . . . . . . . . 1749.4 Coecient Estimates Predicting Examination Scores in Swiss

    Cantons (1888) Found Using Bayesian Normal Linear Regres-sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

    11.1 LaTeX Font Size Commands . . . . . . . . . . . . . . . . . . 21211.2 A Selection of natbib In-text Citation Style Commands . . . . 217

    13.1 A Selection of HTML5 Slideshow Frameworks . . . . . . . . . 258

    xxv

  • This page intentionally left blankThis page intentionally left blank

  • Part I

    Getting Started

    1

  • This page intentionally left blankThis page intentionally left blank

  • 1Introducing Reproducible Research

    Research is often presented in very abridged packages: slideshows, journal arti-cles, books, or maybe even websites. These presentation documents announcea projects findings and try to convince us that the results are correct (Mesirov,2010). Its important to remember that these documents are not the research.Especially in the computational and statistical sciences, these documents arethe advertising. The research is the full software environment, code, anddata that produced the results (Buckheit and Donohue, 1995; Donohue, 2010,385). When we separate the research from its advertisement we are making itdicult for others to verify the findings by reproducing them.

    This book gives you the tools to dynamically combine your research withthe presentation of your findings. The first tool is a workflow for reproducibleresearch that weaves the principles of reproducibility throughout your entireresearch project, from data gathering to the statistical analysis, and the pre-sentation of results. You will also learn how to use a number of computer toolsthat make this workflow possible. These tools include:

    the R statistical language that will allow you to gather data and analyzeit,

    the LaTeX andMarkdown markup languages that you can use to createdocumentsslideshows, articles, books, and webpagesfor presenting yourfindings,

    the knitr package for R and other tools, including command line shellprograms like GNU Make and Git version control, for dynamically tyingyour data gathering, analysis, and presentation documents together sothat they can be easily reproduced,

    RStudio, a program that brings all of these tools together in one place.

    1.1 What is Reproducible Research?Research results are replicable if there is sucient information available forindependent researchers to make the same findings using the same procedures(King, 1995, 444). For research that relies on experiments, this can mean a

    3

  • 4 Reproducible Research with R and RStudio

    researcher not involved in the original research being able to rerun the exper-iment and validate that the new results match the original ones. In computa-tional and quantitative empirical sciences results are replicable if independentresearchers can recreate findings by following the procedures originally usedto gather the data and run the computer code. Of course it is sometimes dif-ficult to replicate the original data set because of limited resources.1 So as anext-best standard we can aim for really reproducible research (Peng, 2011,1226).2 In computational sciences3 this means:

    the data and code used to make a finding are available and they aresucient for an independent researcher to recreate the finding.

    In practice, research needs to be easy for independent researchers to repro-duce (Ball and Medeiros, 2011). If a study is dicult to reproduce its morelikely that no one will reproduce it. If someone does attempt to reproducethis research, it will be dicult for them to tell if any errors they find werein the original research or problems they introduced during the reproduction.In this book you will learn how to avoid these problems.

    In particular you will learn tools for dynamically knitting4 the dataand the source code together with your presentation documents. Combinedwith well organized source files and clearly and completely commented code,independent researchers will be able to understand how you obtained yourresults. This will make your computational research easily reproducible.

    1In this book we will actually aim for replicable research, even if we dont always achieveit. New technologies make it possible to easily replicate some kinds of data sets, especiallyif the original data is available over the internet.

    2The idea of really reproducible computational research was originally thought of andimplemented by Jon Claerbout and the Stanford Exploration Project beginning in the 1980sand early 1990s (Fomel and Claerbout, 2009; Donohue et al., 2009). Further seminal ad-vances were made by Jonathan B. Buckheit and David L. Donohue who created the Wavelablibrary of MatLab routines for their research on wavelets in the mid-1990s (Buckheit andDonohue, 1995).

    3Reproducibility is important for both quantitative and qualitative research (King et al.,1994). Nonetheless, we will focus mainly on on methods for reproducibility in quantitativecomputational research.

    4Much of the reproducible computational research and literate programming literatureshave traditionally used the term weave to describe the process of combining source codeand presentation documents (see Knuth, 1992, 101). In the R community weave is usuallyused to describe the combination of source code and LaTeX documents. The term knitreflects the vocabulary of the knitr R package (knit + R). It is used more generally todescribe weaving with a variety of markup languages. Because of this, I use the term knitrather than weave in this book.

  • Introducing Reproducible Research 5

    1.2 Why Should Research Be Reproducible?

    Reproducibility research is one of the main components of science. If thats notenough reason for you to make your research reproducible, consider that thetools of reproducible research also have direct benefits for you as a researcher.

    1.2.1 For science

    Replicability has been a key part of scientific inquiry from perhaps the 1200s(Bacon, 1859; Nosek et al., 2012). It has even been called the demarcationbetween science and non-science (Braude, 1979, 2). Why is replication soimportant for scientific inquiry?

    Standard to judge scientific claims

    Replication, or at the least reproducibility, opens claims to scrutiny; allowingus to keep what works and discard what doesnt. Science, according to theAmerican Physical Society, is the systematic enterprise of gathering knowl-edge . . . organizing and condensing that knowledge into testable laws andtheories. The ultimate standard for evaluating these scientific claims iswhether or not the claims can be replicated (Peng, 2011; Kelly, 2006). Re-search findings cannot even really be considered genuine contribution[s] tohuman knowledge until they have been verified through replication (Stod-den, 2009b, 38). Replication requires the complete and open exchange ofdata, procedures, and materials. Scientific conclusions that are not replicableshould be abandoned or modified when confronted with more complete orreliable . . . evidence.5

    Avoiding eort duplication & encouraging cumulative knowledge development

    Not only is reproducibility crucial for evaluating scientific claims, it can alsohelp enable the cumulative growth of future scientific knowledge (Kelly, 2006;King, 1995). Reproducible research cuts down on the amount of time scientistshave to spend gathering data or developing procedures that have already beencollected or figured out. Because researchers do not have to discover on theirown things that have already been done, they can more quickly apply thesedata and procedures to building on established findings and developing newknowledge.

    5See the American Physical Societys website at http://www.aps.org/policy/statements/99_6.cfm. See also Fomel and Claerbout (2009).

  • 6 Reproducible Research with R and RStudio

    1.2.2 For youWorking to make your research reproducible does require extra upfront eort.For example, you need to put eort into learning the tools of reproducibleresearch by doing things such as reading this book. But beyond the clear ben-efits for science, why should you make this eort? Using reproducible researchtools can make your research process more eective and (hopefully) ultimatelyeasier.

    Better work habits

    Making a project reproducible from the start encourages you to use betterwork habits. It can spur you to more eectively plan and organize your re-search. It should push you to bring your data and source code up to a higherlevel of quality than you might if you thought no one was looking (Dono-hue, 2010, 386). This forces you to root out errorsa ubiquitous part of compu-tational researchearlier in the research process (Donohue, 2010, 385). Cleardocumentation also makes it easier to find errors.6

    Reproducible research needs to be stored so that other researchers can ac-tually access the data and source code. By taking steps to make your researchaccessible for others you are also making it easier for you to find your data andmethods when you revise your work or begin new projects. You are avoidingpersonal eort duplication, allowing you to cumulatively build on your ownwork more eectively.

    Better teamwork

    The steps you take to make sure an independent researcher can figure outwhat you have done also make it easier for your collaborators to understandyour work and build on it. This applies not only to current collaborators, butalso future collaborators. Bringing new members of a research team up tospeed on a cumulatively growing research project is faster if they can easilyunderstand what has been done already (Donohue, 2010, 386).

    Changes are easier

    A third person may or may not actually reproduce your research even if youmake it easy for them to do so. But, you will almost certainly reproduce parts oreven all of your own research. Almost no actual research process is completelylinear. You almost never gather data, run analyses, and present your resultswithout going backwards to add variables, make changes to your statisticalmodels, create new graphs, alter results tables in light of new findings, and soon. You will probably try to make these changes long after you last workedon the project and long since you remembered the details of how you did

    6Of course, its important to keep in mind that reproducibility is neither necessary norsucient to prevent mistakes (Stodden, 2009a).

  • Introducing Reproducible Research 7

    it. Whether your changes are because of journal reviewers and conferenceparticipants comments or you discover that new and better data has beenmade available since beginning the project, designing your research to bereproducible from the start makes it much easier to change things later on.

    Dynamically reproducible documents in particular can make changingthings much easier. Changes made to one part of a research project havea way of cascading through the other parts. For example, adding a new vari-able to a largely completed analysis requires gathering new data and mergingit with existing data sets. If you used data imputation or matching methodsyou may need to rerun these models. You then have to update your mainstatistical analyses, and recreate the tables and graphs you used to presentthe results. Adding a new variable essentially forces you to reproduce largeportions of your research. If when you started the project you used tools thatmake it easier for others to reproduce your research, you also made it easierto reproduce the work yourself. You will have taken steps to have a betterrelationship with [your] future [self] (Bowers, 2011, 2).

    Higher research impactReproducible research is more likely to be useful for other researchers thannon-reproducible research. Useful research is cited more frequently (Dono-hue, 2002; Piwowar et al., 2007; Vandewalle, 2012). Research that is fullyreproducible contains more information, i.e. more reasons to use and citeit, than presentation documents merely showing findings. Independent re-searchers may use the reproducible data or code to look at other, often unan-ticipated, questions. When they use your work for a new purpose they will(should) cite your work. Because of this, Vandewalle et al. even argue thatthe goal of reproducible research is to have more impact with our research(2007, 1253).

    A reason researchers often avoid making their research fully reproducibleis that they are afraid other people will use their data and code to competewith them. Ill let Donohue et al. address this one:

    True. But competition means that strangers will read your papers, tryto learn from them, cite them, and try to do even better. If you preferobscurity, why are you publishing? (2009, 16)

    1.3 Who Should Read This Book?This book is intended primarily for researchers who want to use a system-atic workflow that encourages reproducibility and the practical state-of-the-art computer tools to put this workflow into practice. This includes profes-sional researchers, upper-level undergraduate, and graduate students working

  • 8 Reproducible Research with R and RStudio

    on computational data-driven projects. Hopefully, editors at academic pub-lishers will also find the book useful for improving their ability to evaluateand edit reproducible research.

    The more researchers that use the tools of reproducibility the better. SoI include enough information in the book for people who have very limitedexperience with these tools, including limited experience with R, LaTeX, andMarkdown. They will be able to start incorporating these tools into theirworkflow right away. The book will also be helpful for people who alreadyhave general experience using technologies such as R and LaTeX, but wouldlike to know how to tie them together for reproducible research.

    1.3.1 Academic researchersHopefully so far in this chapter Ive convinced you that reproducible researchhas benefits for you as a member of the scientific community and personallyas a computational researcher. This book is intended to be a practical guidefor how to actually make your research reproducible. Even if you already usetools such as R and LaTeX you may not be leveraging their full potential.This book will teach you useful ways to get the most out of them as part ofa reproducible research workflow.

    1.3.2 StudentsUpper-level undergraduate and graduate students conducting original compu-tational research should make their research reproducible for the same reasonsthat professional researchers should. Forcing yourself to clearly document thesteps you took will also encourage you to think more clearly about what youare doing and reinforce what you are learning. It will hopefully give you agreater appreciation of research accountability and integrity early in your ca-reer (Barr, 2012; Ball and Medeiros, 2011, 183).

    Even if you dont have extensive experience with computer languages, thisbook will teach you specific habits and tools that you can use throughout yourstudent research and hopefully your careers. Learning these things earlier willsave you considerable time and eort later.

    1.3.3 InstructorsWhen instructors incorporate the tools of reproducible research into their as-signments they not only build students understanding of research best prac-tice, but are also better able to evaluate and provide meaningful feedback onstudents work (Ball and Medeiros, 2011, 183). This book provides a resourcethat you can use with students to put reproducibility into practice.

    If you are teaching computational courses, you may also benefit from mak-ing your lecture material dynamically reproducible. Your slides will be easierto update for the same reasons that it is easier to update research. Making the

  • Introducing Reproducible Research 9

    methods you used to create the material available to students will give themmore information. Clearly documenting how you created lecture material canalso pass information on to future instructors.

    1.3.4 Editors

    Beyond a lack of reproducible research skills among researchers, an imped-iment to actually creating reproducible research is a lack of infrastructureto publish it (Peng, 2011). Hopefully, this book will be useful for editors atacademic publishers who want to be better at evaluating reproducible re-search, editing it, and developing systems to make it more widely available.The journal Biostatistics is a good example of a publication that is encourag-ing (actually requiring) reproducible research. From 2009 the journal has hadan editor for reproducibility that ensures replication files are available andthat results can be replicated using these files (Peng, 2009). The more editorsthere are with the skills to work with reproducible research the more likely itis that researchers will do it.

    1.3.5 Private sector researchers

    Researchers in the private sector may or may not want to make their workeasily reproducible outside of their organization. However, that does not meanthat significant benefits cannot be gained from using the methods of repro-ducible research. First, even if public reproducibility is ruled out to guardproprietary information,7 making your research reproducible to members ofyour organization can spread valuable information about how analyses weredone and data was collected. This will help build your organizations knowl-edge and avoid eort duplication. Just as a lack of reproducibility hinders thespread of information in the scientific community, it can hinder it inside of aprivate organization. Using the sort of dynamic automated processes run withclearly documented source code we will learn in this book can also help createrobust data analysis methods that help your organization avoid errors thatmay come from cutting-and-pasting data across spreadsheets.8

    Also, the tools of reproducible research covered in this book enable you tocreate professional standardized reports that can be easily updated or changedwhen new information is available. In particular, you will learn how to createbatch reports based on quantitative data.

    7There are ways to enable some public reproducibility without revealing confidentialinformation. See Vandewalle et al. (2007) for a discussion of one approach.

    8See this post by David Smith about how the J.P. Morgan Lon-don Whale problem may have been prevented with the type of pro-cesses covered in this book: http://blog.revolutionanalytics.com/2013/02/did-an-excel-error-bring-down-the-london-whale.html (posted 11 February 2013).

  • 10 Reproducible Research with R and RStudio

    1.4 The Tools of Reproducible ResearchThis book will teach you the tools you need to make your research highlyreproducible. Reproducible research involves two broad sets of tools. The firstis a reproducible research environment that includes the statistical toolsyou need to run your analyses as well as the ability to automatically trackthe provenance of data, analyses, and results and to package them (or pointersto persistent versions of them) for redistribution. The second set of tools is areproducible research publisher, which prepares dynamic documents forpresenting results and is easily linked to the reproducible research environment(Mesirov, 2010, 415).

    In this book we will focus on learning how to use the widely availableand highly flexible reproducible research environmentR/RStudio (R CoreTeam, 2013; RStudio, 2013).9 R/RStudio can be linked to numerous repro-ducible research publishers such as LaTeX and Markdown with Yihui Xiesknitr package (2013c). The main tools covered in this book include:

    R: a programming language primarily for statistics and graphics. It canalso be useful for data gathering and creating presentation documents.

    knitr : an R package for literate programming, i.e. it allows you to com-bine your statistical analysis and the presentation of the results into onedocument. It works with R and a number of other languages such as Bash,Python, and Ruby.

    Markup languages: instructions for how to format a presentation doc-ument. In this book we cover LaTeX, Markdown, and a little HTML.

    RStudio: an integrated developer environment (IDE) for R that tightlyintegrates R, knitr , and markup languages.

    Cloud storage & versioning: Services such as Dropbox and Git/Githubthat can store data, code, and presentation files, save previous versions ofthese files, and make this information widely available.

    Unix-like shell programs: These tools are useful for working with largeresearch projects.10 They also allow us to use command line tools includ-ing GNU Make for compiling projects and Pandoc, a program useful forconverting documents from one markup language to another.

    9The book was created with R version 3.0.0 and developer builds of RStudio version0.98.

    10In this book I cover the Bash shell for Linux and Mac as well as Windows PowerShell.

  • Introducing Reproducible Research 11

    1.5 Why Use R, knitr, and RStudio for ReproducibleResearch?

    Why R?

    Why use a statistical programming language like R for reproducible research?R has a very active development community that is constantly expanding whatit is capable of. As we will see in this book this enables researchers across awide range of disciplines to gather data and run statistical analyses. Usingthe knitr package, you can connect your R-based analyses to presentationdocuments created with markup languages such as LaTeX and Markdown.This allows you to dynamically and reproducibly present results in articles,slideshows, and webpages.

    The way you interact with R has benefits for reproducible research. In gen-eral you interact with R (or any other programming and markup language)by explicitly writing down your steps as source code. This promotes repro-ducibility more than your typical interactions with Graphical User Interface(GUI) programs like SPSS11 and Microsoft Word. When you write R code andembed it in presentation documents created using markup languages, you areforced to explicitly state the steps you took to do your research. When you doresearch by clicking through drop down menus in GUI programs, your stepsare lost, or at least documenting them requires considerable extra eort. Alsoit is generally more dicult to dynamically embed your analysis in presenta-tion documents created by GUI word processing programs in a way that willbe accessible to other researchers both now and in the future. Ill come backto these points in Chapter 2.

    Why knitr?

    Literate programming is a crucial part of reproducible quantitative research.12Being able to directly link your analyses, your results, and the code you usedto produce the results makes tracing your steps much easier. There are manydierent literate programming tools for a number of dierent programminglanguages.13 Previously, one of the most common tools for researchers usingR and the LaTeX markup language was Sweave (Leisch, 2002). The package

    11I know you can write scripts in statistical programs like SPSS, but doing so is notencouraged by the programs interface and you often have to learn multiple languages forwriting scripts that run analyses, create graphics, and deal with matrices.

    12Donald Knuth coined the term literate programming in the 1970s to refer to a source filethat could be both run by a computer and woven with a formatted presentation document(Knuth, 1992).

    13A very interesting tool that is worth taking a look at for the Python programminglanguage is HTML Notebooks created with IPython. For more details see http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html.

  • 12 Reproducible Research with R and RStudio

    I am going to focus on in this book is newer and is called knitr . Why are wegoing to use knitr in this book and not Sweave or some other tool?

    The simple answer is that knitr has the same capabilities as Sweave andmore. It can work with markup languages other than LaTeX14 and can evenwork with programming languages other than R. It highlights R code in pre-sentation documents making it easier for your readers to follow.15 It gives youbetter control over the inclusion of graphics and can cache code chunkssavethe output for later. It has the ability to understand Sweave-like syntax, so itwill be easy to convert backwards to Sweave if you want to.16 You also havethe choice to use much simpler and more straightforward syntax with knitr .

    Why RStudio?

    Why use the RStudio integrated development environment for reproducibleresearch? R by itself has the capabilities necessary to gather data, analyze it,and, with a little help from knitr and markup languages, present results in away that is highly reproducible. RStudio allows you to do all of these things,but simplifies many of them and allows you to navigate through them moreeasily. It is a happy medium between Rs text-based interface and a pure GUI.

    Not only does RStudio do many of the things that R can do but more easily,it is also a very good stand alone editor for writing documents with LaTeX andMarkdown. For LaTeX documents it can, for example, insert frequently usedcommands like \section{} for numbered sections (see Chapter 11).17 Thereare many LaTeX editors available, both open source and paid. But RStudio iscurrently the best program for creating reproducible LaTeX and Markdowndocuments. It has full syntax highlighting. Its syntax highlighting can evendistinguish between R code and markup commands in the same document.It can spell check LaTeX and Markdown documents. It handles knitr codechunks beautifully (see Chapter 3).

    Finally, RStudio not only has tight integration with various markup lan-guages, it also has capabilities for using other tools such as C++, CSS,JavaScript, and a few other programming languages. It is closely integratedwith the version control programs Git and SVN. Both of these programs allowyou to keep track of the changes you make to your documents (see Chapter

    14It works with LaTeX, Markdown, HTML, and reStructuredText. We cover the first twoin this book.

    15Syntax highlighting uses dierent colors and fonts to distinguish dierent types of text.For example in the PDF version of this book R commands are highlighted in maroon,while character strings are in lavender.

    16Note that the Sweave-style syntax is not identical to actual Sweave syntax. See YihuiXies discussion of the dierences between the two at: http://yihui.name/knitr/demo/sweave/. knitr has a function (Sweave2knitr) for converting Sweave to knitr syntax.

    17If you are more comfortable with a what-you-see-is-what-you-get (WYSIWYG) wordprocessor like Microsoft Word, you might be interested in exploring Lyx. It is a WYSIWYG-like LaTeX editor that works with knitr. It doesnt work with the other markup languagescovered in this book. For more information see: http://www.lyx.org/. I give some briefinformation on using Lyx with knitr in Chapter 3s Appendix.

  • Introducing Reproducible Research 13

    5). This is important for reproducible research since version control programscan document many of your research steps. It also has a built-in ability tomake HTML slideshows from knitr Markdown documents. Basically, RStudiomakes it easy to create and navigate through complex reproducible researchdocuments.

    1.5.1 Installing the main softwareBefore you read this book you should install the main software. All of thesoftware programs covered in this book are open source and can be easilydownloaded for free. They are available for Windows, Mac, and Linux oper-ating systems. They should run well on most modern computers.

    You should install R before installing RStudio. You can download theprograms from the following websites:

    R: http://www.r-project.org/,

    RStudio: http://www.rstudio.com/ide/download/.

    The download webpages for these programs have comprehensive informationon how to install them, so please refer to those pages for more information.

    After installing R and RStudio you will probably also want to install anumber of user-written packages that are covered in this book. To install allof these user-written packages, please see page xvii.

    Installing markup languagesIf you are planning to create LaTeX documents you need to install a TeXdistribution.18 They are available for Windows, Mac, and Linux systems. Theycan be found at: http://www.latex-project.org/ftp.html. Please refer tothat site for more installation information.

    If you want to create Markdown documents you can separately install themarkdown package in R. You can do this the same way that you install anypackage in R, with the install.packages command.19

    GNU MakeIf you are using a Linux computer you already have GNU Make installed.20Mac users should go to the App Store and download Xcode (its free). OnceXcode is installed, install command line tools, which you will find by openingXcode clicking on Preference Downloads. Windows users will have Make

    18LaTeX is is really a set of macros for the the TeX typesetting system. It is included inall major TeX distributions.

    19The exact command is: install.packages("markdown").20To verify this open the Terminal and type: make version (I used version 3.81 for

    this book). This should output details about the current version of Make installed on yourcomputer.

  • 14 Reproducible Research with R and RStudio

    installed if they have already installed Rtools (see page xviii). Mac and Win-dows users will need to install this software not only so that GNU Make runsproperly, but also so that other command line tools work well.

    Other ToolsWe will discuss other tools such as Git that can be a useful part of a re-producible research workflow. Installation instructions for these tools will bediscussed below.

    1.6 Book OverviewThe purpose of this book is to give you the tools that you will need to doreproducible research with R and RStudio. This book describes a workflowfor reproducible research primarily using R and RStudio. It is designed to giveyou the necessary tools to use this workflow for your own research. It is notdesigned to be a complete reference for R, RStudio, knitr , Git, or any otherprogram that is a part of this workflow. Instead it shows you how these toolscan fit together to make your research more reproducible. To get the most outof these individual programs I will along the way point you to other resourcesthat cover these programs in more detail.

    To that end, I can recommend a number of resources that cover more ofthe nitty-gritty:

    Michael J. Crawleys (2013) encyclopaedic R book, appropriately titledThe R Book, published by Wiley.

    Robert I. Kabacos (2011) book R in Action, published by Manning, isalso useful. In addition he maintains a very helpful website called Quick-R(http://www.statmethods.net/).

    Yihui Xies (2013b) book Dynamic Documents with R and knitr,published by Chapman and Hall, provides a comprehensive look at howto create documents with knitr. Its a good complement to this booksgenerally more research project-level focus.

    Norman Matlos (2011) tour through the programming language aspectsof R called The Art of R Programming: A Tour of StatisticalDesign Software, published by No Starch Press.

    For an excellent introduction to the command line in Linux and Mac, seeWilliam E. Shotts Jr.s (2012) book The Linux Command Line: AComplete Introduction also published by No Starch Press. It is alsohelpful for Windows users running PowerShell (see Chapter 4).

  • Introducing Reproducible Research 15

    The RStudio website (http://www.rstudio.com/ide/docs/) has a num-ber of useful tutorials on how to use knitr with LaTeX and Markdown.

    That being said, my goal is for this book to be self-sucient. A readerwithout a detailed understanding of these programs will be able to understandand use the commands and procedures I cover in this book. While learninghow to use R and the other programs I personally often encountered illustra-tive examples that included commands, variables, and other things that werenot well explained in the texts that I was reading. This caused me to wastemany hours trying to figure out, for example, what the $ is used for (preview:its the component selector, see Section 3.1.2). I hope to save you from thiswasted time by either providing a brief explanation of possibly frustrating andmysterious things and/or pointing you in the direction of good explanations.

    1.6.1 How to read this bookThis book gives you a workflow. It has a beginning, middle, and end. So, unlikea reference book it can and should be read linearly as it takes you throughan empirical research processes from an empty folder to a completed set ofdocuments that reproducibly showcase your findings.

    That being said, readers with more experience using tools like R or LaTeXmay want to skip over the nitty-gritty parts of the book that describe how tomanipulate data frames or compile LaTeX documents into PDFs. Please feelfree to skip these sections.

    More experienced R users

    If you are an experienced R user you may want to skip over the first sectionof Chapter 3: Getting Started with R, RStudio, and knitr. But dont skipover the whole chapter. The latter parts contain important information onthe knitr package. If you are experienced with R data manipulation you mayalso want to skip all of Chapter 7.

    More experienced LaTeX users

    If you are familiar with LaTeX you might want to skip the first part of Chap-ter 11. The second part may be useful as it includes information on how todynamically create BibTeX bibliographies with knitr and how to include knitroutput in a Beamer slideshow.

    Less experienced LaTeX/Markdown users

    If you do not have experience with LaTeX or Markdown you may benefit fromreading, or at least skimming, the introductory chapters on these top topics(chapters 11 and 13) before reading Part III.

  • 16 Reproducible Research with R and RStudio

    1.6.2 Reproduce this bookThis book practices what it preaches. It can be reproduced. I wrote thebook using the programs and methods that I describe. Full documentationand source files can be found at the books GitHub repository. Feel free toread and even use (within reason and with attribution, of course) the bookssource code. You can find it at: https://github.com/christophergandrud/Rep-Res-Book. This is especially useful if you want to know how to do some-thing in the book that I dont directly cover in the text.

    If you notice any errors or places where the book can be improvedplease report them on the books GitHub Issues page: https://github.com/christophergandrud/Rep-Res-Book/issues. Updates and correctionsmade to the book will be posted at: http://christophergandrud.github.io/RepResR-RStudio/.

    1.6.3 Contents overviewThe book is broken into four parts. The first part (chapters 2, 3, and 4)gives an overview of the reproducible research workflow as well as the generalcomputer skills that youll need to use this workflow. Each of the next threeparts of the book guides you through the specific skills you will need for eachpart of the reproducible research process. Part two (chapters 5, 6, and 7)covers the data gathering and file storage process. The third part (chapters8, 9, and 10) teaches you how to dynamically incorporate your statisticalanalysis, results figures, and tables into your presentation documents. The finalpart (chapters 11, 12, and 13) covers how to create reproducible presentationdocuments including LaTeX articles, books, slideshows, and batch reports aswell as Markdown webpages and slideshows.

  • 2Getting Started with Reproducible Research

    Researchers often start thinking about making their work reproducible nearthe end of the research process when they write up their results or maybeeven later when a journal requires their data and code be made available forpublication. Or maybe even later when another researcher asks if they canuse the data from a published article to reproduce the findings. By then theremay be numerous versions of the data set and records of the analyses storedacross multiple folders on the researchers computers. It can be dicult andtime consuming to sift through these files to create an accurate account of howthe results were reached. Waiting until near the end of the research processto start thinking about reproducibility can lead to incomplete documentationthat does not give an accurate account of how findings were made. Focusingon reproducibility from the beginning of the process and continuing to followa few simple guidelines throughout your research can help you avoid theseproblems. Remember reproducibility is not an afterthoughtit is somethingthat must be built into the project from the beginning (Donohue, 2010, 386).

    This chapter first gives you a brief overview of the reproducible researchprocess: a workflow for reproducible research. Then it covers some of the keyguidelines that can help make your research more reproducible.

    2.1 The Big Picture: A Workflow for Reproducible Re-search

    The three basic stages of a typical computational empirical research projectare:

    data gathering,

    data analysis,

    results presentation.

    Each stage is part of the reproducible research workflow covered in this book.Tools for reproducibly gathering data are covered in Part II. Part III teachestools for tying the data we gathered to our statistical analyses and presenting

    17

  • 18 Reproducible Research with R and RStudio

    the results with tables and figures. Part IV discusses how to tie these findingsinto a variety of documents you can use to advertise your findings.

    Instead of starting to use the individual tools of reproducible research assoon as you learn them I recommend briefly stepping back and considering howthe stages of reproducible research tie together overall. This will make yourworkflow more coherent from the beginning and save you a lot of backtrackinglater on. Figure 2.1 illustrates the workflow. Notice that most of the arrowsconnecting the workflows parts point in both directions, indicating that youshould always be thinking how to make it easier to go backwards through yourresearch, i.e. reproduce it, as well as forwards.

    Around the edges of the figure are some of the commands you will learnto make it easier to go forwards and backwards through the process. Thesecommands tie your research together. For example, you can use API-basedR packages to gather data from the internet. You can use Rs merge com-mand to combine data gathered from dierent sources into one data set. ThegetURL from Rs RCurl package (Temple Lang, 2013a) and the read.tablecommands can be used to bring this data set into your statistical analyses.The knitr package then ties your analyses into your presentation documents.This includes the code you used, the figures you created, and, with the helpof tools such as the xtable package, tables of results. You can even tie multi-ple presentation documents together. For example, you can access the samefigure for use in a LaTeX article and a Markdown created website with theincludegraphics and ![]() commands, respectively. This helps you main-tain a consistent presentation of results across multiple document types. Wellcover these commands in detail throughout the book. See Table 2.1 for a briefbut more complete overview of the main tie commands.

    2.1.1 Reproducible theoryAn important part of the research process that I do not discuss in this bookis the theoretical stage. Ideally, if you are using a deductive research design,the bulk of this work will precede and guide the data gathering and analysisstages. Just because I dont cover this stage of the research process doesntmean that theory building cant and shouldnt be reproducible. It can in factbe the easiest part to make reproducible (Vandewalle et al., 2007, 1254).Quotes and paraphrases from previous works in the literature obviously needto be fully cited so that others can verify that they accurately reflect the sourcematerial. For mathematically based theory, clear and complete descriptions ofthe proofs should be given.

    Though I dont actively cover theory replication in depth in this book, Ido touch on some of the ways to incorporate proofs and citations into yourpresentation documents. These tools are covered in Part IV.

  • Getting Started with Reproducible Research 19

    FIGURE2.1

    Exam

    pleWorkflow

    &ASelec

    tionof

    Commands

    toTieitTo

    gether

    Raw

    Data

    Raw

    Data

    Raw

    Data

    DataGather

    Analysis

    LaTeXBo

    ok,

    Article,&

    Slideshow

    Presentatio

    ns

    Markdow

    n/HTMLWebsite

    Presentatio

    ns

    knitr

    inpu

    tin

    clud

    ein

    clud

    egra

    phic

    sPa

    ndoc

    ![](

    )

    knitr

    sour

    ceso

    urce

    _url

    prin

    t(xt

    able

    ())

    sour

    ce_d

    ata

    sour

    ce_D

    ropb

    oxDa

    tare

    ad.t

    able

    getU

    RL

    Make

    down

    load

    .file

    sour

    ce_d

    ata

    sour

    ce_D

    ropbox

    Data

    read

    .tab

    leme

    rge

    getU

    RLAPI-based

    packages

  • 20 Reproducible Research with R and RStudio

    2.2 Practical Tips for Reproducible ResearchBefore we start learning the details of the reproducible research workflow withR and RStudio, its useful to cover a few broad tips that will help you organizeyour research process and put these skills in perspective. The tips are:

    1. Document everything!,2. Everything is a (text) file,3. All files should be human readable,4. Explicitly tie your files together,5. Have a plan to organize, store, and make your files available.

    Using these tips will help make your computational research really repro-ducible.

    2.2.1 Document everything!In order to reproduce your research others must be able to know what you did.You have to tell them what you did by documenting as much of your researchprocess as possible. Ideally, you should tell your readers how you gatheredyour data, analyzed it, and presented the results. Documenting everything isthe key to reproducible research and lies behind all of the other tips in thischapter and tools you will learn throughout the book.

    Document your R session infoBefore discussing the other tips its important to learn a key part of docu-menting with R. You should record your session info. Many things in R havestayed the same since it was introduced in the early 1990s. This makes it easyfor future researchers to recreate what was done in the past. However, thingscan change from one version of R to another and from one version of an Rpackage to another. Also, the way R functions and especially how R packagesare handled can vary across dierent operating systems, so its important tonote what system you used. Finally, you may have R set to load packagesby default (see Section 3.1.8 for information about packages). These packagesmight be necessary to run your code, but other people might not know whatpackages and what versions of the packages were loaded from just looking atyour source code. The sessionInfo command in R prints a record of all ofthese things. The information from the session I used to create this book is:

    # Print R session infosessionInfo()

  • Getting Started with Reproducible Research 21

    ## R version 3.0.1 (2013-05-16)## Platform: x86_64-apple-darwin10.8.0 (64-bit)#### locale:## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8#### attached base packages:## [1] tools grid splines stats graphics grDevices utils## [8] datasets methods base#### other attached packages:## [1] ZeligBayesian_0.1 MCMCpack_1.3-3 coda_0.16-1## [4] Zelig_4.1-3 sandwich_2.2-10 boot_1.3-9## [7] xtable_1.7-1 WDI_2.2 ROAuth_0.9.3## [10] treebase_0.0-6 ape_3.0-8 texreg_1.25## [13] tables_0.7 sqldf_0.4-6.4 RSQLite.extfuns_0.0.1## [16] RSQLite_0.11.3 chron_2.3-43 gsubfn_0.6-5## [19] proto_0.3-10 DBI_0.2-7 slidify_0.3.3## [22] stargazer_3.0.1 RJSONIO_1.0-3 RCurl_1.95-4.1## [25] bitops_1.0-5 reshape2_1.2.2 repmis_0.2.5## [28] quantmod_0.4-0 TTR_0.22-0 xts_0.9-3## [31] zoo_1.7-9 Defaults_1.1-1 plyr_1.8## [34] openair_0.8-5 memisc_0.96-4 MASS_7.3-26## [37] lattice_0.20-15 markdown_0.5.4 knitcitations_0.4-4## [40] bibtex_0.3-5 Hmisc_3.10-1.1 survival_2.37-4## [43] httr_0.2 googleVis_0.4.2 ggplot2_0.9.3.1## [46] gdata_2.12.0.2 formatR_0.7 extrafont_0.14## [49] estout_1.2 digest_0.6.3 devtools_1.2## [52] data.table_1.8.8 countrycode_0.13 brew_1.0-6## [55] apsrtable_0.8-8 animation_2.2 knitr_1.2#### loaded via a namespace (and not attached):## [1] car_2.0-17 cluster_1.14.4 colorspace_1.2-2## [4] dichromat_2.0-0 evaluate_0.4.3 gtable_0.1.2## [7] gtools_2.7.1 labeling_0.1 latticeExtra_0.6-24## [10] Matrix_1.0-12 memoise_0.1 mgcv_1.7-22## [13] munsell_0.4 nlme_3.1-109 parallel_3.0.1## [16] RColorBrewer_1.0-5 rjson_0.2.12 Rttf2pt1_1.2## [19] scales_0.2.3 stringr_0.6.2 tcltk_3.0.1## [22] twitteR_1.1.6 whisker_0.3-2 XML_3.95-0.2## [25] yaml_2.1.7

    Chapter 4 gives specific details about how to create files with dynamicallyincluded session information.

    If you use non-R tools such as Pandoc, you should also record what versionsyou used.

    2.2.2 Everything is a (text) fileYour documentation is stored in files that include data, analysis code, the writeup of results, and explanations of these files (e.g. data set codebooks, session

  • 22 Reproducible Research with R and RStudio

    info files, and so on). Ideally, you should use the simplest file format possibleto store this information. Usually the simplest file format is the humble, butversatile, text file.1

    Text files are extremely nimble. They can hold your data in, for example,comma-separated values (.csv) format. They can contain your analysis codein .R files. And they can be the basis for your presentations as markup doc-uments like .tex or .md, for LaTeX and Markdown files, respectively. All ofthese files can be opened by any program that can read text files.

    One reason reproducible research is best stored in text files is that thishelps future proof your research. Other file formats, like those used by Mi-crosoft Word (.docx) or Excel (.xlsx), change regularly and may not becompatible with future versions of these programs. Text files, on the otherhand, can be opened by a very wide range of currently existing programs and,more likely than not, future ones as well. Even if future researchers do nothave R or a LaTeX distribution, they will still be able to open your text filesand, aided by frequent comments (see below), be able to understand how weconducted your research (Bowers, 2011, 3).

    Text files are also very easy to search and manipulate with a wide rangeof programssuch as R and RStudiothat can find and replace text charactersas well as merge and separate files. Finally, text files are easy to version andchanges can be tracked using programs such as Git (see Chapter 5).

    2.2.3 All files should be human readable

    Treat all of your research files as if someone who has not worked on theproject will, in the future, try to understand them. Computer code is a wayof communicating with the computer. It is machine readable in that thecomputer is able to use it to understand what you want to do.2 However, thereis a very good chance that other people (or you six months in the future) willnot understand what you were telling the computer. So, you need to makeall of your files human readable. To make them human readable, you shouldcomment on your code with the goal of communicating its design and purpose(Wilson et al., 2012). With this in mind it is a good idea to comment frequently(Bowers, 2011, 3) and format your code using a style guide (Nagler, 1995). Forespecially important pieces of code you should use literate programmingwherethe source code and the presentation text describing its design and purposeappear in the same document. Doing this will make it very clear to othershow you accomplished a piece of research.

    1Plain text files are usually given the file extension .txt. Depending on the size of yourdata set it may not be feasible to store it as a text file. Nonetheless, text files can still beused for analysis code and presentation files.

    2Of course, if the computer does not understand it will usually give an error message.

  • Getting Started with Reproducible Research 23

    Commenting

    In R everything on a line after a hash character#(also known as number,pound, or sharp) is ignored by R, but is readable to people who open the file.The hash character is a comment declaration character. You can use the #to place comments telling other people what you are doing. Here are someexamples:

    # A complete comment line2 + 2 # A comment after R code

    ## [1] 4

    On the first line the # is placed at the very beginning, so the entire line istreated as a comment. On the second line the # is placed after the simpleequation 2 + 2. R runs the equation and finds the answer 4, but it ignores allof the words after the hash.

    Dierent languages have dierent comment declaration characters. In La-TeX everything after the % percent sign is treated as a comment, and inmarkdown/HTML comments are placed inside of . The hash characteris used for comment declaration in command line shell scripts.

    Nagler (1995, 491) gives some advice on when and how to use comments:

    write a comment before a block of code describing what the code does,

    comment on any line of code that is ambiguous.

    In this book I follow these guidelines when displaying code.He also suggests that all of your source code files should begin with a

    comment header. At the least the header should include:

    a description of what the file does,

    the date it was last updated,

    the name of the files creator and any contributors.

    You may also want to include other information in the header such as what filesit depends on, what output files it produces, what version of the programminglanguage you are using, or sources that may have influenced the code. Here isan example of a minimal file header for an R source code file that creates thethird figure in an article titled My Article:

  • 24 Reproducible Research with R and RStudio

    ################### R Source code file used to create Figure 3 in My Article# Created by Christopher Gandrud# Updated 1 May 2013##################

    Feel free to use things like the long series of hash marks above and below theheader, white space, and indentations to make your comments more readable.

    Style guidesIn natural language writing you dont necessarily have to follow a style guide.People could probably figure out what you are trying to say, but it is a loteasier for your readers if you use consistent rules. The same is true whenwriting computer code. Its good to follow consistent rules for formatting yourcode so that its easier for you and others to understand.

    There are a number of R style guides. Most of them are similar to theGoogle R Style Guide.3 Hadley Wickham also has a nicely presented R styleguide.4 You may want to use the formatR (Xie, 2012) package to automaticallyreformat your code so that it is easier to read.

    Literate programmingFor particularly important pieces of research code it may be useful to notonly comment on the source file, but also display code in presentation text.For example, you may want to include key parts of the code you used foryour main statistical models and an explanation of this code in an appendixfollowing your article. This is commonly referred to as literate programming(Knuth, 1992).

    2.2.4 Explicitly tie your files togetherIf everything is just a text file then research projects can be thought of asindividual text files that have a relationship with one another. They are tiedtogether. A data file is used as input for an analysis file. The results of ananalysis are shown and discussed in a markup file that is used to create aPDF document. Researchers often do not explicitly document the relationshipsbetween files that they used in their research. For example, the results ofan analysisa table or figuremay be copied and pasted into a presentationdocument. It can be very dicult for future researchers to trace the table orfigure back to a particular statistical model and a particular data set without

    3See: http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html.4You can find it at https://github.com/hadley/devtools/wiki/Style.

  • Getting Started with Reproducible Research 25

    clear documentation. Therefore, it is important to make the links betweenyour files explicit.

    Tie commands are the most dynamic way to explicitly link your files to-gether. These commands instruct the computer program you are using to useinformation from another file. In Table 2.1 I have compiled a selection of keytie commands you will learn how to use in this book. Well discuss many more,but these are some of the most important.

    2.2.5 Have a plan to organize, store, & make your files avail-able

    Finally, in order for independent researchers to reproduce your work they needto be able access the files that instruct them how to do this. Files also needto be organized so that independent researchers can figure out how they fittogether. So, from the beginning of your research process you should have aplan for organizing your files and a way to make them accessible.

    One rule of thumb for organizing your research in files is to limit theamount of content any one file has. Files that contain many dierent opera-tions can be very dicult to navigate, even if they have detailed comments.For example, it would be very dicult to find any particular operation in a filethat contained the code used to gather the data, run all of the statistical mod-els, and create the results figures and tables. If you have a hard time findingthings in a file you created, think of the diculties independent researcherswill have!

    Because we have so many ways to link files together there is really no needto lump many dierent operations into one file. So, we can make our filesmodular. One source code file should be used to complete one or just a fewtasks. Breaking your operations into discrete parts will also make it easier foryou and others to find errors (Nagler, 1995, 490).

    Chapter 4 discusses file organization in much more detail. Chapter 5teaches you a number of ways to make your files accessible through the cloudcomputing services Dropbox and GitHub.

  • 26 Reproducible Research with R and RStudio

    TABLE 2.1A Selection of Commands/Packages/Programs for Tying Together Your Re-search FilesCommand/Package/Program

    Language Description ChaptersDiscussed

    knitr R R package with commands for tyinga