DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing...
Transcript of DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing...
DiscoPlot: Discordant read visualisation 1
Mitchell J. Sullivana, Scott A. Beatsona* 2
aAustralian Infectious Diseases Research Centre, and School of Chemistry and Molecular 3
Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia 4
*To whom correspondence should be addressed. Tel: +61 7 3365 4863; Email: [email protected] 5
Keywords: Bioinformatics, software, structural variation, misalignment, microbial genomics 6
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
Abstract 7
Over the last decade, the emergence of high-throughput sequencing has led to an increase in 8
both the size and scope of genome sequencing projects. Although genome sequencing and 9
analysis has changed dramatically during this time, the way read alignments are visualised has 10
remained largely unchanged. To address the problem of visualising growing sequencing 11
datasets, we have developed DiscoPlot, a tool for visualising read alignments using a two-12
dimensional scatterplot. DiscoPlot allows the user to quickly identify genomic rearrangements, 13
misassemblies and sequencing artefacts by providing a scalable method for visualising large 14
sections of the genome. It reads single-end or paired read alignments in SAM, BAM or standard 15
BLAST tab format and creates a scatter plot of opaque crosses representing the alignments to 16
a reference. DiscoPlot is freely available (under a GPL license) for download (Mac OS X, Unix 17
and Windows) at https://mjsull.github.io/DiscoPlot. 18
Introduction 19
The emergence of high-throughput sequencing has led to an increase in both the size and scope of 20
genome sequencing projects. Assembly and mapping visualisation tools, such as Tablet (Milne et al. 21
2013), BamView (Carver et al. 2010) and Savant (Fiume et al. 2010) have been developed to be able to 22
handle the computational challenges presented by large volumes of data, however analysing even small 23
genomes can be time consuming. Furthermore, misassemblies and structural rearrangements can be 24
obfuscated by the large volume of data making it difficult to distinguish between discordant reads 25
caused by structural variations and discordant reads caused by spurious mapping or read chimeras. The 26
time consuming nature of visual analysis has led to a task originally accomplished manually, such as 27
identifying structural variations or misassemblies, being automated. Many programs have been 28
developed for identifying structural variations using alignment information such as SVDetect (Zeitouni 29
et al. 2010), ForestSV (Michaelson and Sebat 2012), BreakDancer (Chen et al. 2009) and InGap-SV 30
(Qi and Zhao 2011). Tools have also developed that make use of paired read alignment information to 31
identify misassembled contigs, including REAPR (Hunt et al. 2013) and amosvalidate (Phillippy et al. 32
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
2008). These tools may make assumptions that may not be true for all datasets. For example, high levels 33
of read chimeras, highly variable coverage and bimodal insert sizes can all lead to overcalling or 34
undercalling of structural variations. Problems also arise when the original structure and a structural 35
variant are both present in the same sequencing run. Such sites where multiple structural variants exist 36
are often biological significant, such as genes undergoing DNA invertase mediated prophage tail fibre 37
allele switching (Forde et al. 2014). Thus there is a clear need for a scalable, visual approach that 38
enables the user to quickly identify genomic rearrangements, misassemblies and sequencing artefacts. 39
A related problem has underpinned visualisation software development for Hi-C data (Servant et al. 40
2012). Hi-C probes the three-dimensional architecture of genomes by coupling proximity based ligation 41
with high-throughput sequencing (Lieberman-Aiden et al. 2009). The Hi-C methodology produces large 42
volumes of mate-pair sequenced reads with each paired read coming from separate but spatially close 43
regions of the genome. Reads are then mapped back to a reference genome, and regions of the genome 44
that physically interact can be inferred from the concentration of reads that map to a particular region. 45
One way of accomplishing this is by binning reads in a 2-dimensional array whereby the x and y 46
coordinates are determined by the mapping coordinates of the left and right read pair, and visualising 47
the array using a heatmap. This approach provides a method of visualising large volumes of read 48
alignments across an entire genome. 49
A two-dimensional layout provides a scalable method for visualising read alignments. Heatmaps, while 50
good at visualising “low resolution” information, as in the case of Hi-C that involves interaction 51
between large (>100,000 bp) regions of the genome, have several major drawbacks which make them 52
unsuitable for visualising structural variations using paired-end reads. Bin number is limited by the 53
resolution of the image being created. If the bin size is larger than a genomic rearrangement, a two-54
dimensional approach will not work. Single significant bins can be difficult to spot at high resolutions. 55
In contrast, if a wide array of values are present in the bins it can be difficult to find a colour gradient 56
to make significantly different values distinguishable from one another. This problem is magnified if 57
colours are also needed to indicate other information, such as read orientation. 58
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
To solve these problems DiscoPlot has been developed. DiscoPlot is a Python application that uses 59
matplotlib (Hunter 2007) to display read alignments as a scatter plot. It uses an opaque cross in which 60
the size of the cross is proportional to the number of reads mapping to the bin. This approach has several 61
inherent advantages over heatmaps. Representing counts as sizes allows colour to be used to display 62
additional information such as read orientation. Markers allow single bins with a significant number of 63
reads to easily be identified even when a large number of bins has been used. Anti-aliasing allows the 64
user to use more bins than resolution permits while still retaining some accuracy obtained by using a 65
small bin size. This approach has also been adapted to work with single-end long reads. The script can 66
create figures from paired alignments using a SAM, indexed BAM file and single-end reads using and 67
alignment in standard BLAST tab format. DiscoPlot is an open source project available on Windows, 68
OSX and GNU/Linux. 69
Implementation 70
DiscoPlot allows genome structural variants, misassemblies, circularisation of the genome and 71
sequencing artefacts such as read chimeras to be quickly identified. A reference genome of width w 72
base pairs is divided into n bins of size b base pairs. The user can either specify the number of bins to 73
use or the size of the bins. Two sparse matrices Ma and Mb of size n×n are built to count direct and 74
inverted alignments, and two arrays Aa and Ab of length n are also created to count unaligned sequence. 75
Paired-end sequencing matrix construction 76
Read alignments are processed from a read-alignment file in standard SAM or BAM format (Li et al. 77
2009). If reads are orientated in opposite directions Mb[⌊x/b⌋][⌊y/b⌋] is incremented by one, where x is 78
the position of the read aligning to the forward strand and y is the position of its pair (Figure 1). If both 79
ends of the read map to the forward strand x is the position of the mate aligning closest to the 5’ end 80
and y is determined by the position of its pair. Similarly, if both ends of the read map to the reverse 81
strand the x is determined by which mate is closest to the 3’ end of the forward strand and the y by its 82
pair. In both cases Ma[⌊x/b⌋][⌊y/b⌋] is incremented by one. The positions (x) of reads with unmapped 83
mates are also tracked in separate arrays, Aa[⌊x/b⌋] is incremented by one for each read that maps to the 84
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
forward strand with an unmapped pair. Ab[⌊x/b⌋] is incremented by one for each read that maps to the 85
reverse strand with an unmapped pair. 86
87
Figure 1: Creation of a DiscoPlot using mate-pair reads. In the above figure the pink rectangle 88
represents the reference sequence. The linked arrows represent the left (blue) and right (orange) 89
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
pairs of mate-paired reads. The square underneath is the DiscoPlot representation of how the 90
reads map. 1) DiscoPlot represents the chromosome in 2D space. 2) This reference is divided up 91
into bins. 3) When a read maps to the reference DrawGrid represents it as a cross on the grid 92
where the x coordinate of the cross is the start of the read aligning the forward strand maps to, 93
and the y coordinate is determined by mapping position of the start of its pair. 4) Multiple reads 94
can be represented in this way. 4) In the x axis each cross aligns with a read mapping to the 95
forward strand. 5) In the y axis each cross aligns with a read aligning to the reverse strand. 6) The 96
size of the cross represents how many read pairs map to that bin. 7) When mate-pair reads align 97
concordantly with the reference sequence crosses occur at position x, y where y is x minus the 98
insert size of the pair. 8) If read data is not concordant with the reference genome crosses occur 99
in unexpected locations. Read pairs that align to the same strand are indicated by red crosses. 9) 100
A representation of how region of the chromosome that excises into a circular product would 101
look when mapped against the chromosomal reference. 102
Single-end sequencing matrix construction 103
Local alignments of each read to a reference sequence are provided by the user in standard BLAST tab 104
format or generated automatically using NCBI-BLAST (Camacho et al. 2009). To reduce the number 105
of spurious hits, alignments less than the user defined identity or length are ignored (default values 95% 106
and 100 basepairs (bp), respectively). The alignment with the highest Bit-score becomes the primary 107
alignment. If several alignments score equally one is chosen at random to prevent spikes at repetitive 108
sites in the genome. If the entire read does not align in a single High Scoring Pair (HSP), subsequent 109
alignments are iterated through from highest to lowest Bit-score. If the alignment includes a previously 110
unaligned portion of the read it is included in a list of secondary alignments. Each alignment is defined 111
by four positions Qs, Qe, Rs and Re. Qs is the position in the read of the start of the alignment, Qe is the 112
position in the read of the end of the alignment, Qs is always less than Qe. Rs is the position in the 113
reference of the start of the alignment, Re is the position in the reference of the end of the alignment. If 114
alignment is to the forward strand of the reference Rs is less than Re, if the alignment is to the reverse 115
strand Rs is greater than Re. If the primary alignment is to the reverse strand of the reference, the primary 116
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
and secondary alignments for the read are converted so that they are the equivalent of those that would 117
be found by aligning the reverse compliment of the read. To reverse these alignments Qs becomes the r 118
- Qe and Qe becomes the r - Qs, where r is the length of the read. Rs and Re are switched. After these 119
values have been set an anchor (a) is set as Rs from the primary alignment. Then, for each position (i) 120
in the length (l) of each alignment a set of x and y values are calculated where x = ⌊(Qs+a+i)/b⌋ and y = 121
⌊(Rs(i/l)+Re(1-i/l))/b⌋. If the alignment is the primary alignment or in the same orientation as the primary 122
alignment each unique combination of x and y is incremented by one in Ma. If the alignment is not in 123
the same orientation as the primary alignment each unique combination of x and y is incremented by 124
one in Mb (Figure 2). If the 5’ end of read mapping to the forward strand or the 3’ end of a read mapping 125
to the reverse strand doesn’t align to the reference Aa[Rs/b] is incremented by one. Similarly, if the 3’ 126
end of a read mapping to the forward strand or the 5’ end of a read mapping to the reverse strand 127
Ab[Re/b] is incremented by one. This process is repeated for all reads. 128
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
129
Figure 2: Creation of a DiscoPlot using single end reads. The reference genome is represented by a 130
pink rectangle, which is divided into bins. Reads are represented by orange (forward) and blue 131
(reverse) arrows. The square underneath is the DiscoPlot representation of how the reads map. 1) 132
The anchor is set at the 5’ end of the primary alignment in the reference. A cross is drawn at bin 133
X, Y where X and Y are equal to the bin containing the anchor. For each extra bin the read aligns 134
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
to an additional cross is drawn. The X coordinate of the cross is the position in the read relative 135
to the anchor and is located on the X axis relative to the initial cross. The Y coordinate is the 136
position that section of the read aligns to in the reference. For concordant reads this creates a 137
straight diagonal. 2) Additional reads can be added, if two or more reads map to the same bin the 138
size of the cross is increased. 3) The blue rectangle represents the genome from which the reads 139
were generated and grey polygons represent how it aligns to the reference. The X position 140
increments by one for each additional bin the read maps to. The Y position corresponds to the 141
position that section of the read aligns to in the reference. This read has two alignments to the 142
reference, as the second alignment is orientated in the opposite direction of the primary alignment 143
crosses are drawn in blue. 4) Additional reads are added. The resulting DiscoPlot is similar to a 144
Dotplot representation of the genome from which the reads were generated and the reference. 145
Drawing the Graph 146
For each bin in Ma an opaque red × is drawn with x and y coordinates equal to the x and y coordinates 147
of the bin multiplied by b. The width and height of the cross in points equals √2000𝑐1
𝑏×𝑚 where m is 148
the median value of the most populated diagonal in the matrix and c is the count of alignments in that 149
bin. Bins from Mb are drawn as blue opaque × symbols. Bins from Aa are drawn along the x axis as 150
opaque green + symbols and bins from Ab are drawn along the y axis as opaque green + symbols. Size, 151
thickness and opacity can all be adjusted by the user. Crosses are used because they allow deletions and 152
other features to become apparent even when visualising large sections of the reference genome. If more 153
than one reference is present in an alignment, alignments are separated by an opaque green band; this 154
allows potential integration events to be observed. Discoplot also allows the user to define regions of 155
the reference to view enabling precise visualisation of genomic rearrangements (Figure 4). 156
Discoplot uses matplotlib to display read mapping in an interactive window that can be zoomed and 157
scrolled to identify exact coordinates of markers. A subsection of the plot can be displayed; 158
alternatively, the plot can be broken up into a grid to highlight multiple subsections. Images can be 159
created from the GUI or written straight to an image from the command-line (PNG, SVG, PDF 160
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
supported on most systems, see matplotlib documentation). Reads associated with markers can be 161
written to a separate BAM or FASTA file using the command-line. 162
Case studies 163
Case study 1: simulated Illumina reads using a mock variant genome of uropathogenic E. coli 164
UTI89 165
A mock variant genome was created using the genome of Escherichia coli UTI89. E. coli str. UTI89 166
is an uropathogenic E. coli isolated from a patient with an acute urinary tract infection. It contains a 167
5,065,741 bp chromosome and an 114,230 bp plasmid. A 300bp inversion and a 3000bp inversion were 168
created at bases 50,000..50,300 and 100,000..100,3000. A 300bp insertion and a 3000bp insertion were 169
created at bases 150,000 and 200,000. Bases in the ranges 250,000..250,300 and 300,000..303,000 were 170
deleted. Bases 400,000..400,300 were moved to position 350,000 and bases 500,000..503,000 were 171
moved to position 450,000. Paired-end reads with an insert size of 300bp were simulated at 200 times 172
coverage from the mock genome using GemSIM (McElroy et al. 2012) and mapped to UTI89 using 173
BWA (Li 2013). A DiscoPlot was generated from the resulting BAM file using a bin size of 100bp 174
(Figure 3). 175
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
176
177
Figure 3: DiscoPlot of a mock genome. A mock genome was created by adding genomic 178
rearrangements to the chromosome of E. coli str. UTI89. Paired-end reads generated from the 179
mock genome (query) with GemSim and mapped back to UTI89 (reference). The first ~500 Kbp 180
were then visualised using DiscoPlot. 181
1000bp single-end reads were also generated from the mock genome at 20 times coverage and mapped 182
back to UTI89. Sizes of reads and structural variants were chosen to demonstrate how each structural 183
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
variant is represented in the DiscoPlot when the read length or insert size is less than or equal to the 184
structural variant or greater than the structural variant. A close up of each of the structural variants was 185
also generated (Figure 4). 186
187
Figure 4: DiscoPlots of common structural variants. Each box shows a common genomic 188
rearrangement represented by a DiscoPlot. Rows A and B were created using 100 bp long paired-189
end reads with an insert size of 300bp. Rows C and D were created using single-end reads with 190
an average length of 1000bp. For each box the rearrangement in the sequenced genome is listed, 191
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
followed by the scale of the gridlines in brackets. A1, C1: 300 bp deletion (400 bp). A2, C2: 300 192
bp insertion (400 bp). A3, C3: 300 bp inversion (400 bp). A4, C4: 300 bp sequence translocated 193
50 Kbp upstream (10 Kbp). B1, D1: 3000 bp deletion (1000 bp). B2, D2: 3000 bp insertion (500 194
bp). B3, D3: 3000 bp inversion (1000 bp). B4, D4: 3000 bp sequence translocated 50 Kbp 195
upstream (10 Kbp). 196
Case study 2: real Illumina reads from sequencing uropathogenic E. coli UTI89 197
An Illumina HiSeq2000 101 bp paired-end sequencing run of Escherichia coli str. UTI89 (ENA acc: 198
ERR687901) was aligned to the publicly available reference genome (CP000243.1) and its plasmid 199
(CP000244). The alignment was then examined using DiscoPlot. Low quality reads were trimmed, 200
where possible, or filtered if the average per-base quality was less than 30 or had one pair trimmed to 201
less than 50 base pairs. In total, 2,433,934 high quality read pairs with an average insert size of 367.25 202
base pairs with a standard deviation of 59.1 were aligned using the BWA mem algorithm (Li 2013). 203
The resulting BAM file was then visualised using DiscoPlot using 100,000 bins revealing several 204
structural variations (Figure 5). 205
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
206
Figure 5: The dynamic nature of the genome of E. coli str. UTI89. Discoplot of paired-end reads 207
from a clonal culture of UTI89 mapped back to the published reference chromosome and plasmid. 208
Coordinates from 0 to 5,065,741 represent the chromosome of E. coli UTI89, coordinates ≥ 209
5,066,000 represent the plasmid of E. coli UTI89. 210
Sites with discordantly mapping reads were identified using DiscoPlot (Figure 6). Reads from these 211
sites were randomly sampled and mapped back to the reference using BLASTn (Camacho et al. 2009) 212
to confirm if they had been correctly aligned. Structural variants were inferred from context of region 213
of the genome to which the reads mapped, mapping distance and orientation of the aligned reads. Local 214
alignments of soft-clipped reads were performed to identify exact boundaries of structural variations. 215
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
216
Figure 6: Discordant reads in E. coli str. UTI89. a) Read alignment indicates inversion of bases 217
919,638..922,323. 12bp inverted repeat present at terminals of region. Start and stop of inverted 218
region occurs in two probable tail fibre proteins. Two additional tail fibre assembly proteins are 219
encoded within the boundaries of this region. Region is immediately downstream of a putative 220
DNA invertase gene. b, f, h, i) Reads are misaligned as they map equally well in a concordant 221
position. c) Read alignment indicates circularisation of bases 1,653,000..1,662,603. 17bp direct 222
repeats present at terminals of this region. Region also encodes five putative phage-related 223
membrane proteins, two putative phage proteins, three phage hypothetical proteins, four 224
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
hypothetical proteins and a single putative phage related secreted protein. Size of crosses 225
indicates coverage of this region is higher than average. Only a single read (indicated by the 226
cross, top left) indicates potential excision of this region. d) Read alignments indicate inversion 227
of bases 2,109,690..2,114,003. Region contains ~100bp inverted repeat at terminals which 228
encodes a tRNA. Region contains 3 hypothetical proteins and an additional tRNA identical to the 229
repeats. A P4-phage integrase is present immediately downstream of the inversion. The lack of 230
concordantly mapping reads at prophage boundary indicates that the inverted region has reached 231
fixation in the population. e) Reads indicate inversion of bases 2,906,008..2,906,936. 15bp 232
inverted repeats present at terminals of this region. The 3’ end of a putative tail fibre assembly 233
gene is encoded by this region. g) Read alignments indicate inversion of bases 234
4,907,424..4,907,737. Regions has 9bp inverted repeat at terminals. It is located in a non-coding 235
region between fimA and fimE which encode the type I fimbriae. 236
The majority of structural variants cannot be explained by misalignment of the reads or sequencing 237
error and therefore have biological significance. Two sites of variation are potentially caused by tail 238
fibre allele switching (Figure 6a and 6e), similar to that previously shown in the genome of 239
uropathogenic E. coli ST131 strain EC958 (Forde et al. 2014). An inversion event occurred at a site 240
with several tRNA (Figure 6d). At another site, phase variation of a fimbriae loci was occurring (Figure 241
6g). A circularising region was also identified (Figure 6c), this region was present in the sequencing 242
data at much higher coverage (243.1× coverage compared to 88.3× the average coverage of the 243
chromosome). It is possible that this phage was induced in a small fraction of the bacteria. Additional 244
analysis, such as PCR or long read sequencing, is needed to confirm the exact nature of the identified 245
structural variants. 246
In addition to the structural variants detected, reads misaligned to several repetitive regions (Figures 247
6b, 6f, 6h, 6i). Misaligned reads all had insert sizes much smaller than the average, which likely 248
explains why they had been misplaced by BWA mem. (Figure 7). 249
250
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
251
252
Figure 7: Misaligned reads and insert size. Blue columns indicate frequency of the mapping distances 253
of all aligned reads. Each vertical red line indicates the corrected mapping distances of a 254
misaligned read from site b (Figure 6). 255
For each batch, BWA mem estimates the mean and variance of insert size distributions from reliable 256
single-end hits. BWA then uses a normal distribution, determined by these values, to estimate the 257
likelihood of each of the possible pairings of reads with multiple alignments. This value is combined 258
with the alignment scores of the reads to determine the probable alignment of a read pair. If the putative 259
insert size of an alignment varies significantly from the mean BWA considers the read a chimeric pair; 260
BWA considers all chimeric alignments of a read pair equally likely. As insert sizes from our 261
sequencing run were not normally distributed, BWA considers the reads with small insert sizes 262
chimeric and arbitrarily assigns them to one of the instances of a repeat. This resulted in the 263
misalignment of reads with small insert sizes at large exact repeats. 264
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
Conclusions 265
Discoplot provides a simple, scalable method for visualising read alignments to a reference. It allows 266
the user to quickly identify genomic rearrangements, misassemblies and sequencing artefacts. A visual 267
approach is inherently sequence type agnostic, this allows the user to immediately analyse their 268
alignments without calculating the average length or standard deviation of their read library. This is 269
particularly useful when analysing complex libraries such as mate-pair reads that may contain a shadow 270
library of read pairs in the reverse orientation. We have shown how DiscoPlot can be used to quickly 271
identify genome rearrangements and misaligned reads in bacterial genomes. In future, additional 272
colours may be incorporated into to display other metrics such as average number of single nucleotide 273
variants per read in a bin. To make DiscoPlot more accessible, features only available through the use 274
of the command line could be incorporated into a graphical user interface. 275
Acknowledgments 276
The authors would also like to thank Professor Mark Schembri for making available the UTI89 Illumina 277
datasets used in this project. We would also like to thank Mitchell Stanton-Cook for helping make 278
DiscoPlot.py available through PyPI. 279
Funding 280
SAB is supported by a Career Development Fellowship from the National Health and Medical Research 281
Council Career of Australia [grant number APP1090456]. 282
References 283
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: 284
architecture and applications. BMC bioinformatics 10: 421. 285
Carver T, Böhme U, Otto TD, Parkhill J, Berriman M. 2010. BamView: viewing mapped read 286
alignment data in the context of the reference sequence. Bioinformatics (Oxford, England) 287
26(5): 676-677. 288
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, 289
Zhang Q, Locke DP et al. 2009. BreakDancer: an algorithm for high-resolution mapping of 290
genomic structural variation. Nat Meth 6(9): 677-681. 291
Fiume M, Williams V, Brook A, Brudno M. 2010. Savant: genome browser for high-throughput 292
sequencing data. Bioinformatics 26(16): 1938-1944. 293
Forde BM, Ben Zakour NL, Stanton-Cook M, Phan M-D, Totsika M, Peters KM, Chan KG, Schembri 294
MA, Upton M, Beatson SA. 2014. The Complete Genome Sequence of Escherichia coli 295
EC958: A High Quality Reference Sequence for the Globally Disseminated Multidrug 296
Resistant E. coli O25b:H4-ST131 Clone. PloS one 9(8): e104400. 297
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto T. 2013. REAPR: a universal tool for 298
genome assembly evaluation. Genome Biology 14(5): R47. 299
Hunter JD. 2007. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9(3): 300
90-95. 301
Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. In 302
ArXiv e-prints, Vol 1303, p. 3997. 303
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. 304
The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 305
25(16): 2078-2079. 306
Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie 307
BR, Sabo PJ, Dorschner MO et al. 2009. Comprehensive mapping of long-range interactions 308
reveals folding principles of the human genome. Science (New York, NY) 326(5950): 289-293. 309
McElroy K, Luciani F, Thomas T. 2012. GemSIM: general, error-model based simulator of next-310
generation sequencing data. BMC Genomics 13(1): 74. 311
Michaelson JJ, Sebat J. 2012. forestSV: structural variant discovery through statistical learning. Nat 312
Meth 9(8): 819-821. 313
Milne I, Stephen G, Bayer M, Cock PJA, Pritchard L, Cardle L, Shaw PD, Marshall D. 2013. Using 314
Tablet for visual exploration of second-generation sequencing data. Briefings in 315
bioinformatics 14(2): 193-202. 316
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts
Phillippy AM, Schatz MC, Pop M. 2008. Genome assembly forensics: finding the elusive mis-317
assembly. Genome Biol 9(3): R55. 318
Qi J, Zhao F. 2011. inGAP-sv: a novel scheme to identify and visualize structural variation from 319
paired end mapping data. Nucleic Acids Res 39(Web Server issue): W567-575. 320
Servant N, Lajoie BR, Nora EP, Giorgetti L, Chen CJ, Heard E, Dekker J, Barillot E. 2012. HiTC: 321
exploration of high-throughput 'C' experiments. Bioinformatics (Oxford, England) 28(21): 322
2843-2844. 323
Zeitouni B, Boeva V, Janoueix-Lerosey I, Loeillet S, Legoix-né P, Nicolas A, Delattre O, Barillot E. 324
2010. SVDetect: a tool to identify genomic structural variations from paired-end and mate-325
pair sequencing data. Bioinformatics (Oxford, England) 26(15): 1895-1896. 326
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015
PrePrin
ts