RecordKeepingin Computing - GitHub Pages · RecordKeepingin Computing PERSPECTIVE Good enough...

Record Keeping in Computing

PERSPECTIVE

Good enough practices in scientific computingGreg Wilson1☯*, Jennifer Bryan2☯, Karen Cranston3☯, Justin Kitzes4☯, Lex Nederbragt5☯,Tracy K. Teal6☯

1 Software Carpentry Foundation, Austin, Texas, United States of America, 2 RStudio and Department ofStatistics, University of British Columbia, Vancouver, British Columbia, Canada, 3 Department of Biology,Duke University, Durham, North Carolina, United States of America, 4 Energy and Resources Group,University of California, Berkeley, Berkeley, California, United States of America, 5 Centre for Ecological andEvolutionary Synthesis, University of Oslo, Oslo, Norway, 6 Data Carpentry, Davis, California, United Statesof America

☯ These authors contributed equally to this work.* [email protected]

Author summary

Computers are now essential in all branches of science, but most researchers are nevertaught the equivalent of basic lab skills for research computing. As a result, data can getlost, analyses can take much longer than necessary, and researchers are limited in howeffectively they can work with software and data. Computing workflows need to followthe same practices as lab projects and notebooks, with organized data, documented steps,and the project structured for reproducibility, but researchers new to computing oftendon’t know where to start. This paper presents a set of good computing practices thatevery researcher can adopt, regardless of their current level of computational skill. Thesepractices, which encompass data management, programming, collaborating with col-leagues, organizing projects, tracking work, and writing manuscripts, are drawn from awide variety of published sources from our daily lives and from our work with volunteerorganizations that have delivered workshops to over 11,000 people since 2010.

Overview

We present a set of computing tools and techniques that every researcher can and should con-sider adopting. These recommendations synthesize inspiration from our own work, from theexperiences of the thousands of people who have taken part in Software Carpentry and DataCarpentry workshops over the past 6 years, and from a variety of other guides. Our recom-mendations are aimed specifically at people who are new to research computing.

Introduction

Three years ago, a group of researchers involved in Software Carpentry and Data Carpentrywrote a paper called "Best Practices for Scientific Computing" [1]. That paper provided recom-mendations for people who were already doing significant amounts of computation in theirresearch. However, as computing has become an essential part of science for all researchers,there is a larger group of people new to scientific computing, and the question then becomes,"where to start?"

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005510 June 22, 2017 1 / 20

a1111111111a1111111111a1111111111a1111111111a1111111111

OPENACCESS

Citation: Wilson G, Bryan J, Cranston K, Kitzes J,Nederbragt L, Teal TK (2017) Good enoughpractices in scientific computing. PLoS ComputBiol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

Editor: Francis Ouellette, Ontario Institute forCancer Research, CANADA

Published: June 22, 2017

Copyright: © 2017 Wilson et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, whichpermits unrestricted use, distribution, andreproduction in any medium, provided the originalauthor and source are credited.

Funding: The authors received no specific fundingfor this work.

Competing interests: The authors have declaredthat no competing interests exist.

PERSPECTIVE

Good enough practices in scientific computingGreg Wilson1☯*, Jennifer Bryan2☯, Karen Cranston3☯, Justin Kitzes4☯, Lex Nederbragt5☯,Tracy K. Teal6☯

1 Software Carpentry Foundation, Austin, Texas, United States of America, 2 RStudio and Department ofStatistics, University of British Columbia, Vancouver, British Columbia, Canada, 3 Department of Biology,Duke University, Durham, North Carolina, United States of America, 4 Energy and Resources Group,University of California, Berkeley, Berkeley, California, United States of America, 5 Centre for Ecological andEvolutionary Synthesis, University of Oslo, Oslo, Norway, 6 Data Carpentry, Davis, California, United Statesof America

☯ These authors contributed equally to this work.* [email protected]

Author summary

Computers are now essential in all branches of science, but most researchers are nevertaught the equivalent of basic lab skills for research computing. As a result, data can getlost, analyses can take much longer than necessary, and researchers are limited in howeffectively they can work with software and data. Computing workflows need to followthe same practices as lab projects and notebooks, with organized data, documented steps,and the project structured for reproducibility, but researchers new to computing oftendon’t know where to start. This paper presents a set of good computing practices thatevery researcher can adopt, regardless of their current level of computational skill. Thesepractices, which encompass data management, programming, collaborating with col-leagues, organizing projects, tracking work, and writing manuscripts, are drawn from awide variety of published sources from our daily lives and from our work with volunteerorganizations that have delivered workshops to over 11,000 people since 2010.

Overview

We present a set of computing tools and techniques that every researcher can and should con-sider adopting. These recommendations synthesize inspiration from our own work, from theexperiences of the thousands of people who have taken part in Software Carpentry and DataCarpentry workshops over the past 6 years, and from a variety of other guides. Our recom-mendations are aimed specifically at people who are new to research computing.

Introduction

Three years ago, a group of researchers involved in Software Carpentry and Data Carpentrywrote a paper called "Best Practices for Scientific Computing" [1]. That paper provided recom-mendations for people who were already doing significant amounts of computation in theirresearch. However, as computing has become an essential part of science for all researchers,there is a larger group of people new to scientific computing, and the question then becomes,"where to start?"

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005510 June 22, 2017 1 / 20

a1111111111a1111111111a1111111111a1111111111a1111111111

OPENACCESS

Citation: Wilson G, Bryan J, Cranston K, Kitzes J,Nederbragt L, Teal TK (2017) Good enoughpractices in scientific computing. PLoS ComputBiol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

Editor: Francis Ouellette, Ontario Institute forCancer Research, CANADA

Published: June 22, 2017

Copyright: © 2017 Wilson et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, whichpermits unrestricted use, distribution, andreproduction in any medium, provided the originalauthor and source are credited.

Funding: The authors received no specific fundingfor this work.

Competing interests: The authors have declaredthat no competing interests exist.

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510


https://dbsloan.github.io/TS2018/exercises/r_markdown.html

Introduction to R and R Markdown

https://dbsloan.github.io/TS2018/exercises/r_markdown.html

Data Visualization

Images: Circos.ca

Data Visualization

Data Visualization

Quality Figures for Papers and Presentations

• Clear and accurate representation of your data

• Clean, professional, and aesthetically pleasing appearance

• Efficient, reproducible, and automated

Data Visualization

Writing Code to Generate Figuresggplot(cnld) + geom_point(aes(x=CumPos, y=r2, size=0.75, colour=as.factor(ChromPrint), alpha = 1/8)) + scale_size_identity() + theme_bw(base_size=15) + scale_color_manual(values=c(rep(c('black', 'dark gray'),11), 'black', 'red')) + scale_x_continuous(expand = c(0.015, 0.015),labels=c(as.character(1:chrNum), "X"), breaks=bpMidVec) + theme(plot.margin = unit( c(0.03,0.03,0.03,0.03) , "in" ), legend.position='none', axis.text.x = element_text(size=6), axis.text.y = element_text(size=7), axis.title.x = element_text(size=8), axis.title.y = element_text(size=8)) + xlab('Chromosome Position') + ylab(expression(paste("Mitonuclear LD (",r^2, ")")))

Data VisualizationG

ene

Expr

essi

on (R

NA-

seq

Rea

d D

epth

)

Chromosome Position (kb)

Data Visualization

https://dbsloan.github.io/TS2018/exercises/ggplot.html

Plotting Genomic Data with R and ggplot

https://dbsloan.github.io/TS2018/exercises/ggplot.html

Data Visualization

inferences based on gene-order support earlier sequence-based analyses that suggested that these two insect endosym-bionts form a monophyletic group (fig. 5a–c) (Spaulding andvon Dohlen 1998; Thao and Baumann 2004; Sloan andMoran 2012b).

Proliferation of Short Tandem Repeats and the IndelSpectrum in Portiera BT Intergenic Regions

Despite containing a slightly larger number of genes, thePortiera TV genome is approximately 20% smaller in sizethan its counterparts in B. tabaci (table 2). The size differencereflects the presence of much longer intergenic sequences inPortiera BT (fig. 6), which may have resulted from a relaxationof the typical bias in the indel mutation spectrum. In contrastto the pattern observed in other bacteria (Mira et al. 2001;Kuo and Ochman 2009), recent intergenic indels in Portiera BTdo not exhibit an excess of deletions (fig. 7). Notably, thePortiera BT indel spectrum is highly enriched for 7-bp changes(fig. 7). All these indels occur in the context of short tandemrepeats, and the distribution of tandem repeat lengths in thePortiera BT genome exhibits a dramatic spike at 7 bp that isnot found in other insect endosymbionts, including Portiera TV(fig. 8). These repeats are concentrated in intergenic regions(fig. 1) and spatially correlated with structural variation in thePortiera BT genome identified by paired-end read conflicts(r¼ 0.45; P<0.0001). In contrast, tandem repeats in thePortiera TV genome are rare and not significantly correlatedwith paired-end read conflicts (r¼ 0.03; P¼0.43).

Accurately characterizing an indel spectrum requires high-quality genome sequences and an appropriate model for in-ferring ancestral states. In identifying indels that were uniqueamong the four Portiera BT genomes, we found that the twogenomes from B. tabaci biotype B exhibited a higher degree ofsimilarity than the two frombiotype Q. Althoughwe identifiednumerous indels between the two Portiera BT-B genomes,these only occurred in structurally variable regions that werenot conserved with Portiera BT-Q and, therefore, had to beexcluded from the analysis. We found that 23 of 28 of theshort indels (<4bp) that were unique to one of the twoPortiera BT-Q genomes were in or adjacent to single-nucleotide repeat regions (homopolymers). By itself, this is

not surprising because homopolymers are prone to highrates of indel mutations, but we unexpectedly found that aclear majority (18 of 23) of these indels occurred in one of thetwo genomes (GenBank accession CP003835). Although thisgenome may have experienced a higher rate of structuralchange, it is also possible that some of the homopolymer-associated indels in this genome represent sequencingerrors, because it was generated largely with the Roche 454platform, a technology that is known to produce impreciseestimates of homopolymer length (Santos-Garcia et al. 2012).Therefore, sequencing errors probably added noise to ourestimate of the indel spectrum. Although these inaccuraciesshould not pertain to the large number of indels found inmicrosatellite regions, short tandem repeats are prone to re-current mutations (homoplasy) and, therefore, may violateour parsimony assumption for inferring ancestral states.

E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT

E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT

E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT Carsonella

Portiera BT

Portiera TV 14.5

2.5

21.5

A B D C

1.00

1.00 1.00

0.41 100

100 100

97

FIG. 5.—Phylogenetic relationships among select Gammaproteobacteria as inferred from gene order and orientation with MGR (A), TIBA (B), and

BADGER (C). Bipartition support is indicated with bootstrap values and Bayesian posterior probabilities from the TIBA and BADGER analyses, respectively. A

relative rate test using Carsonella as the closest outgroup to Portiera found that inversions had accumulated asymmetrically in the Portiera BT genome (D).

Portiera BT (244.9-278.6 kb)

Portiera TV (183.5-203.6 kb)

FIG. 6.—Abnormally large intergenic regions in the Portiera BT

genome. Gray shading connects orthologous genes that are shared be-

tween strains in a region of conserved synteny. Yellow blocks represent

genes without an intact ortholog in the other strain.

0

4

8

12

16

20

<-20

-1

9 -1

7 -1

5 -1

3 -1

1 -9

-7

-5

-3

-1 1 3 5 7 9 11

13

15

17

19

>20

Num

ber

of In

dels

Indel Size (bp)

32 Insertions (170 bp total) 22 Deletions (-133 bp total)

Mean Indel = +0.69 bp

FIG. 7.—Size distribution of indels in Portiera BT.

Evolution of Genomic Instability GBE

Genome Biol. Evol. 5(5):783–793. doi:10.1093/gbe/evt044 Advance Access publication March 29, 2013 789

by guest on May 3, 2013

http://gbe.oxfordjournals.org/D

ownloaded from

Data Visualization

trichopoda. When mapped back against their correspondingplastid DNA sequences, these fragments cover anywhere from0.5% to 87.2% of the plastid genome (table 1 and fig. 2).Plastid-derived sequences accounted for less than 1% ofmany angiosperm mitochondrial genomes. At the other ex-treme, they represent 10.3% of the Boea hygrometrica mito-chondrial genome. An even higher percentage has beenreported for the mitochondrial genome of Cucurbita pepo(Alverson et al. 2010), but this species was not included inour study because it lacks a sequenced plastid genome. Thesevalues are based on identified fragments of at least 200 bp inlength, but including smaller fragments does not increase thetotals substantially.

The largest mtpt fragment was 12.6 kb in length (foundin Zea mays), but there was clear evidence that some of the ex-isting sequences were part of larger transfers that were sub-sequently broken up by large deletions and rearrangements.

For example, the Silene conica mitochondrial genome con-tains two mtpt fragments from a 35-kb region of plastidDNA (fig. 3). Although these fragments are now located ondifferent chromosomes in the S. conica mitochondrialgenome, their corresponding boundaries precisely abutin the plastid genome, suggesting that they were derivedfrom a single transfer that was subsequently split by a rear-rangement. This transfer appears to have occurred relativelyrecently because it shares the derived inversion found in theS. conica plastid genome (Sloan, Alverson, Wu, et al. 2012).The transferred 35 kb sequence has been reduced to only18 kb by a series of 23 large deletions ranging from 96to 4,615 bp in size. Many of these were likely associatedwith a microhomology-mediated repair process (Derianoand Roth 2013), as 14 of the 23 deletions show small regions(7–18 bp) of sequence similarity between the pair of deletionbreakpoints (fig. 3).

FIG. 2.—Origins of mtpts from the plastid genome. The location of each mtpt fragment (minimum 200 bp) within the plastid genome. Shading indicates

nucleotide sequence identity (excluding gaps) relative to the corresponding plastid sequence. The Nicotiana tabacum plastid genome was used as a reference

for defining position. The map of the N. tabacum plastid genome at the bottom of the figure was generated with OGDRAW v1.2 (Lohse et al. 2007) after

removing the second copy of the inverted repeat.

Sloan and Wu GBE

3214 Genome Biol. Evol. 6(12):3210–3221. doi:10.1093/gbe/evu253 Advance Access publication November 21, 2014

by guest on Decem

ber 20, 2014http://gbe.oxfordjournals.org/

Dow

nloaded from

Data Visualization

DiphylleiaThecamonasHomoCrassostreaTrichoplaxOscarellaMnemiopsisAllomycesSaccharomycesMonosigaNucleariaHartmannellaAcanthamoebaDictyosteliumMa. jakobiformisMa. californianaReclinomonasAndaluciaNaegleriaAcrasisAmoeboid BB2TsukubamonasTrypanosomaEuglena gracilisAlexandriumChromeraPlasmodiumTetrahymenaColponemid−likeNyctotherusChattonellaPhaeodactylumCafeteriaBlastocystisBigelowiellaParacercomonasBrevimastigomonasEmilianiaChrysochromulinaAncoracystaRhodomonasHemiselmisPalpitomonasPicozoan sp. MS584−11CyanidioschyzonPorphyraGlaucocystisCyanophoraChlamydomonasProtothecaPycnococcusNephroselmisMesostigmaMarchantiaNicotiana

rps16

rpl23

rpl35

rpl36

tatA

tufA

rpoC

rpoB

rpl34

rpl32

rpoD

rpoA

rpl27

rpl10

rpl1

atp3

ccmA

rpl18

rpl19

secY

cox11

rpl31

rpl20

rps1

ccmC

sdh4

rps10

nad10

sdh2

ccmB

sdh3

rpl11

rps7

nad11

rps19

atp1

rps11

rpl2

nad8

ccmF

rpl5

rpl14

rps2

atp4

rps8

rpl6

tatC

rps4

rps14

nad9

rps13

nad7

rpl16

nad4L

rps3

nad6

rps12

atp8

nad3

nad5

atp9

nad2

nad4

nad1

cox3

cox2

atp6

cox1

cob

Translation TranscriptionComplex I Complex II Complex III Complex IV Complex V Cyt C Maturation Membrane Transport

Viridiplantae

Gla

Rho

Crypt

Hap

Rhiz

Stramen

Alveolata

Excavata

AmOpisthokonta

Data Visualization

https://dbsloan.github.io/TS2018/exercises/r_figure_drawing.html

Figure Drawing in R

https://dbsloan.github.io/TS2018/exercises/r_figure_drawing.html

Data Visualization

Data Visualization

https://dbsloan.github.io/TS2018/exercises/circos.html

Visualizing Genomic Data with Circos

https://dbsloan.github.io/TS2018/exercises/circos.html

RecordKeepingin Computing - GitHub Pages · RecordKeepingin Computing PERSPECTIVE Good enough...

Documents

Transcript of RecordKeepingin Computing - GitHub Pages · RecordKeepingin Computing PERSPECTIVE Good enough...