RecordKeepingin Computing - GitHub Pages · RecordKeepingin Computing PERSPECTIVE Good enough...
Transcript of RecordKeepingin Computing - GitHub Pages · RecordKeepingin Computing PERSPECTIVE Good enough...
Record Keeping in Computing
PERSPECTIVE
Good enough practices in scientific computingGreg Wilson1☯*, Jennifer Bryan2☯, Karen Cranston3☯, Justin Kitzes4☯, Lex Nederbragt5☯,Tracy K. Teal6☯
1 Software Carpentry Foundation, Austin, Texas, United States of America, 2 RStudio and Department ofStatistics, University of British Columbia, Vancouver, British Columbia, Canada, 3 Department of Biology,Duke University, Durham, North Carolina, United States of America, 4 Energy and Resources Group,University of California, Berkeley, Berkeley, California, United States of America, 5 Centre for Ecological andEvolutionary Synthesis, University of Oslo, Oslo, Norway, 6 Data Carpentry, Davis, California, United Statesof America
☯ These authors contributed equally to this work.* [email protected]
Author summary
Computers are now essential in all branches of science, but most researchers are nevertaught the equivalent of basic lab skills for research computing. As a result, data can getlost, analyses can take much longer than necessary, and researchers are limited in howeffectively they can work with software and data. Computing workflows need to followthe same practices as lab projects and notebooks, with organized data, documented steps,and the project structured for reproducibility, but researchers new to computing oftendon’t know where to start. This paper presents a set of good computing practices thatevery researcher can adopt, regardless of their current level of computational skill. Thesepractices, which encompass data management, programming, collaborating with col-leagues, organizing projects, tracking work, and writing manuscripts, are drawn from awide variety of published sources from our daily lives and from our work with volunteerorganizations that have delivered workshops to over 11,000 people since 2010.
Overview
We present a set of computing tools and techniques that every researcher can and should con-sider adopting. These recommendations synthesize inspiration from our own work, from theexperiences of the thousands of people who have taken part in Software Carpentry and DataCarpentry workshops over the past 6 years, and from a variety of other guides. Our recom-mendations are aimed specifically at people who are new to research computing.
Introduction
Three years ago, a group of researchers involved in Software Carpentry and Data Carpentrywrote a paper called "Best Practices for Scientific Computing" [1]. That paper provided recom-mendations for people who were already doing significant amounts of computation in theirresearch. However, as computing has become an essential part of science for all researchers,there is a larger group of people new to scientific computing, and the question then becomes,"where to start?"
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005510 June 22, 2017 1 / 20
a1111111111a1111111111a1111111111a1111111111a1111111111
OPENACCESS
Citation: Wilson G, Bryan J, Cranston K, Kitzes J,Nederbragt L, Teal TK (2017) Good enoughpractices in scientific computing. PLoS ComputBiol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Editor: Francis Ouellette, Ontario Institute forCancer Research, CANADA
Published: June 22, 2017
Copyright: © 2017 Wilson et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, whichpermits unrestricted use, distribution, andreproduction in any medium, provided the originalauthor and source are credited.
Funding: The authors received no specific fundingfor this work.
Competing interests: The authors have declaredthat no competing interests exist.
PERSPECTIVE
Good enough practices in scientific computingGreg Wilson1☯*, Jennifer Bryan2☯, Karen Cranston3☯, Justin Kitzes4☯, Lex Nederbragt5☯,Tracy K. Teal6☯
1 Software Carpentry Foundation, Austin, Texas, United States of America, 2 RStudio and Department ofStatistics, University of British Columbia, Vancouver, British Columbia, Canada, 3 Department of Biology,Duke University, Durham, North Carolina, United States of America, 4 Energy and Resources Group,University of California, Berkeley, Berkeley, California, United States of America, 5 Centre for Ecological andEvolutionary Synthesis, University of Oslo, Oslo, Norway, 6 Data Carpentry, Davis, California, United Statesof America
☯ These authors contributed equally to this work.* [email protected]
Author summary
Computers are now essential in all branches of science, but most researchers are nevertaught the equivalent of basic lab skills for research computing. As a result, data can getlost, analyses can take much longer than necessary, and researchers are limited in howeffectively they can work with software and data. Computing workflows need to followthe same practices as lab projects and notebooks, with organized data, documented steps,and the project structured for reproducibility, but researchers new to computing oftendon’t know where to start. This paper presents a set of good computing practices thatevery researcher can adopt, regardless of their current level of computational skill. Thesepractices, which encompass data management, programming, collaborating with col-leagues, organizing projects, tracking work, and writing manuscripts, are drawn from awide variety of published sources from our daily lives and from our work with volunteerorganizations that have delivered workshops to over 11,000 people since 2010.
Overview
We present a set of computing tools and techniques that every researcher can and should con-sider adopting. These recommendations synthesize inspiration from our own work, from theexperiences of the thousands of people who have taken part in Software Carpentry and DataCarpentry workshops over the past 6 years, and from a variety of other guides. Our recom-mendations are aimed specifically at people who are new to research computing.
Introduction
Three years ago, a group of researchers involved in Software Carpentry and Data Carpentrywrote a paper called "Best Practices for Scientific Computing" [1]. That paper provided recom-mendations for people who were already doing significant amounts of computation in theirresearch. However, as computing has become an essential part of science for all researchers,there is a larger group of people new to scientific computing, and the question then becomes,"where to start?"
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005510 June 22, 2017 1 / 20
a1111111111a1111111111a1111111111a1111111111a1111111111
OPENACCESS
Citation: Wilson G, Bryan J, Cranston K, Kitzes J,Nederbragt L, Teal TK (2017) Good enoughpractices in scientific computing. PLoS ComputBiol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Editor: Francis Ouellette, Ontario Institute forCancer Research, CANADA
Published: June 22, 2017
Copyright: © 2017 Wilson et al. This is an openaccess article distributed under the terms of theCreative Commons Attribution License, whichpermits unrestricted use, distribution, andreproduction in any medium, provided the originalauthor and source are credited.
Funding: The authors received no specific fundingfor this work.
Competing interests: The authors have declaredthat no competing interests exist.
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510
Record Keeping in Computing
Record Keeping in Computing
https://dbsloan.github.io/TS2018/exercises/r_markdown.html
Introduction to R and R Markdown
Data Visualization
Images: Circos.ca
Data Visualization
Data Visualization
Quality Figures for Papers and Presentations
• Clear and accurate representation of your data
• Clean, professional, and aesthetically pleasing appearance
• Efficient, reproducible, and automated
Data Visualization
Writing Code to Generate Figuresggplot(cnld) + geom_point(aes(x=CumPos, y=r2, size=0.75, colour=as.factor(ChromPrint), alpha = 1/8)) + scale_size_identity() + theme_bw(base_size=15) + scale_color_manual(values=c(rep(c('black', 'dark gray'),11), 'black', 'red')) + scale_x_continuous(expand = c(0.015, 0.015),labels=c(as.character(1:chrNum), "X"), breaks=bpMidVec) + theme(plot.margin = unit( c(0.03,0.03,0.03,0.03) , "in" ), legend.position='none', axis.text.x = element_text(size=6), axis.text.y = element_text(size=7), axis.title.x = element_text(size=8), axis.title.y = element_text(size=8)) + xlab('Chromosome Position') + ylab(expression(paste("Mitonuclear LD (",r^2, ")")))
Data VisualizationG
ene
Expr
essi
on (R
NA-
seq
Rea
d D
epth
)
Chromosome Position (kb)
Data Visualization
https://dbsloan.github.io/TS2018/exercises/ggplot.html
Plotting Genomic Data with R and ggplot
Data Visualization
inferences based on gene-order support earlier sequence-based analyses that suggested that these two insect endosym-bionts form a monophyletic group (fig. 5a–c) (Spaulding andvon Dohlen 1998; Thao and Baumann 2004; Sloan andMoran 2012b).
Proliferation of Short Tandem Repeats and the IndelSpectrum in Portiera BT Intergenic Regions
Despite containing a slightly larger number of genes, thePortiera TV genome is approximately 20% smaller in sizethan its counterparts in B. tabaci (table 2). The size differencereflects the presence of much longer intergenic sequences inPortiera BT (fig. 6), which may have resulted from a relaxationof the typical bias in the indel mutation spectrum. In contrastto the pattern observed in other bacteria (Mira et al. 2001;Kuo and Ochman 2009), recent intergenic indels in Portiera BTdo not exhibit an excess of deletions (fig. 7). Notably, thePortiera BT indel spectrum is highly enriched for 7-bp changes(fig. 7). All these indels occur in the context of short tandemrepeats, and the distribution of tandem repeat lengths in thePortiera BT genome exhibits a dramatic spike at 7 bp that isnot found in other insect endosymbionts, including Portiera TV(fig. 8). These repeats are concentrated in intergenic regions(fig. 1) and spatially correlated with structural variation in thePortiera BT genome identified by paired-end read conflicts(r¼ 0.45; P<0.0001). In contrast, tandem repeats in thePortiera TV genome are rare and not significantly correlatedwith paired-end read conflicts (r¼ 0.03; P¼0.43).
Accurately characterizing an indel spectrum requires high-quality genome sequences and an appropriate model for in-ferring ancestral states. In identifying indels that were uniqueamong the four Portiera BT genomes, we found that the twogenomes from B. tabaci biotype B exhibited a higher degree ofsimilarity than the two frombiotype Q. Althoughwe identifiednumerous indels between the two Portiera BT-B genomes,these only occurred in structurally variable regions that werenot conserved with Portiera BT-Q and, therefore, had to beexcluded from the analysis. We found that 23 of 28 of theshort indels (<4bp) that were unique to one of the twoPortiera BT-Q genomes were in or adjacent to single-nucleotide repeat regions (homopolymers). By itself, this is
not surprising because homopolymers are prone to highrates of indel mutations, but we unexpectedly found that aclear majority (18 of 23) of these indels occurred in one of thetwo genomes (GenBank accession CP003835). Although thisgenome may have experienced a higher rate of structuralchange, it is also possible that some of the homopolymer-associated indels in this genome represent sequencingerrors, because it was generated largely with the Roche 454platform, a technology that is known to produce impreciseestimates of homopolymer length (Santos-Garcia et al. 2012).Therefore, sequencing errors probably added noise to ourestimate of the indel spectrum. Although these inaccuraciesshould not pertain to the large number of indels found inmicrosatellite regions, short tandem repeats are prone to re-current mutations (homoplasy) and, therefore, may violateour parsimony assumption for inferring ancestral states.
E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT
E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT
E. coliPseudomonasHalomonasChromohalobacterCarsonellaPortiera TVPortiera BT Carsonella
Portiera BT
Portiera TV 14.5
2.5
21.5
A B D C
1.00
1.00 1.00
0.41 100
100 100
97
FIG. 5.—Phylogenetic relationships among select Gammaproteobacteria as inferred from gene order and orientation with MGR (A), TIBA (B), and
BADGER (C). Bipartition support is indicated with bootstrap values and Bayesian posterior probabilities from the TIBA and BADGER analyses, respectively. A
relative rate test using Carsonella as the closest outgroup to Portiera found that inversions had accumulated asymmetrically in the Portiera BT genome (D).
Portiera BT (244.9-278.6 kb)
Portiera TV (183.5-203.6 kb)
FIG. 6.—Abnormally large intergenic regions in the Portiera BT
genome. Gray shading connects orthologous genes that are shared be-
tween strains in a region of conserved synteny. Yellow blocks represent
genes without an intact ortholog in the other strain.
0
4
8
12
16
20
<-20
-1
9 -1
7 -1
5 -1
3 -1
1 -9
-7
-5
-3
-1 1 3 5 7 9 11
13
15
17
19
>20
Num
ber
of In
dels
Indel Size (bp)
32 Insertions (170 bp total) 22 Deletions (-133 bp total)
Mean Indel = +0.69 bp
FIG. 7.—Size distribution of indels in Portiera BT.
Evolution of Genomic Instability GBE
Genome Biol. Evol. 5(5):783–793. doi:10.1093/gbe/evt044 Advance Access publication March 29, 2013 789
by guest on May 3, 2013
http://gbe.oxfordjournals.org/D
ownloaded from
Data Visualization
trichopoda. When mapped back against their correspondingplastid DNA sequences, these fragments cover anywhere from0.5% to 87.2% of the plastid genome (table 1 and fig. 2).Plastid-derived sequences accounted for less than 1% ofmany angiosperm mitochondrial genomes. At the other ex-treme, they represent 10.3% of the Boea hygrometrica mito-chondrial genome. An even higher percentage has beenreported for the mitochondrial genome of Cucurbita pepo(Alverson et al. 2010), but this species was not included inour study because it lacks a sequenced plastid genome. Thesevalues are based on identified fragments of at least 200 bp inlength, but including smaller fragments does not increase thetotals substantially.
The largest mtpt fragment was 12.6 kb in length (foundin Zea mays), but there was clear evidence that some of the ex-isting sequences were part of larger transfers that were sub-sequently broken up by large deletions and rearrangements.
For example, the Silene conica mitochondrial genome con-tains two mtpt fragments from a 35-kb region of plastidDNA (fig. 3). Although these fragments are now located ondifferent chromosomes in the S. conica mitochondrialgenome, their corresponding boundaries precisely abutin the plastid genome, suggesting that they were derivedfrom a single transfer that was subsequently split by a rear-rangement. This transfer appears to have occurred relativelyrecently because it shares the derived inversion found in theS. conica plastid genome (Sloan, Alverson, Wu, et al. 2012).The transferred 35 kb sequence has been reduced to only18 kb by a series of 23 large deletions ranging from 96to 4,615 bp in size. Many of these were likely associatedwith a microhomology-mediated repair process (Derianoand Roth 2013), as 14 of the 23 deletions show small regions(7–18 bp) of sequence similarity between the pair of deletionbreakpoints (fig. 3).
FIG. 2.—Origins of mtpts from the plastid genome. The location of each mtpt fragment (minimum 200 bp) within the plastid genome. Shading indicates
nucleotide sequence identity (excluding gaps) relative to the corresponding plastid sequence. The Nicotiana tabacum plastid genome was used as a reference
for defining position. The map of the N. tabacum plastid genome at the bottom of the figure was generated with OGDRAW v1.2 (Lohse et al. 2007) after
removing the second copy of the inverted repeat.
Sloan and Wu GBE
3214 Genome Biol. Evol. 6(12):3210–3221. doi:10.1093/gbe/evu253 Advance Access publication November 21, 2014
by guest on Decem
ber 20, 2014http://gbe.oxfordjournals.org/
Dow
nloaded from
Data Visualization
DiphylleiaThecamonasHomoCrassostreaTrichoplaxOscarellaMnemiopsisAllomycesSaccharomycesMonosigaNucleariaHartmannellaAcanthamoebaDictyosteliumMa. jakobiformisMa. californianaReclinomonasAndaluciaNaegleriaAcrasisAmoeboid BB2TsukubamonasTrypanosomaEuglena gracilisAlexandriumChromeraPlasmodiumTetrahymenaColponemid−likeNyctotherusChattonellaPhaeodactylumCafeteriaBlastocystisBigelowiellaParacercomonasBrevimastigomonasEmilianiaChrysochromulinaAncoracystaRhodomonasHemiselmisPalpitomonasPicozoan sp. MS584−11CyanidioschyzonPorphyraGlaucocystisCyanophoraChlamydomonasProtothecaPycnococcusNephroselmisMesostigmaMarchantiaNicotiana
rps16
rpl23
rpl35
rpl36
tatA
tufA
rpoC
rpoB
rpl34
rpl32
rpoD
rpoA
rpl27
rpl10
rpl1
atp3
ccmA
rpl18
rpl19
secY
cox11
rpl31
rpl20
rps1
ccmC
sdh4
rps10
nad10
sdh2
ccmB
sdh3
rpl11
rps7
nad11
rps19
atp1
rps11
rpl2
nad8
ccmF
rpl5
rpl14
rps2
atp4
rps8
rpl6
tatC
rps4
rps14
nad9
rps13
nad7
rpl16
nad4L
rps3
nad6
rps12
atp8
nad3
nad5
atp9
nad2
nad4
nad1
cox3
cox2
atp6
cox1
cob
Translation TranscriptionComplex I Complex II Complex III Complex IV Complex V Cyt C Maturation Membrane Transport
Viridiplantae
Gla
Rho
Crypt
Hap
Rhiz
Stramen
Alveolata
Excavata
AmOpisthokonta
Data Visualization
https://dbsloan.github.io/TS2018/exercises/r_figure_drawing.html
Figure Drawing in R
Data Visualization
Data Visualization
https://dbsloan.github.io/TS2018/exercises/circos.html
Visualizing Genomic Data with Circos