DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing...

20
DiscoPlot: Discordant read visualisation 1 Mitchell J. Sullivan a , Scott A. Beatson a * 2 a Australian Infectious Diseases Research Centre, and School of Chemistry and Molecular 3 Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia 4 *To whom correspondence should be addressed. Tel: +61 7 3365 4863; Email: [email protected] 5 Keywords: Bioinformatics, software, structural variation, misalignment, microbial genomics 6 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015 PrePrints

Transcript of DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing...

Page 1: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

DiscoPlot: Discordant read visualisation 1

Mitchell J. Sullivana, Scott A. Beatsona* 2

aAustralian Infectious Diseases Research Centre, and School of Chemistry and Molecular 3

Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia 4

*To whom correspondence should be addressed. Tel: +61 7 3365 4863; Email: [email protected] 5

Keywords: Bioinformatics, software, structural variation, misalignment, microbial genomics 6

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 2: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

Abstract 7

Over the last decade, the emergence of high-throughput sequencing has led to an increase in 8

both the size and scope of genome sequencing projects. Although genome sequencing and 9

analysis has changed dramatically during this time, the way read alignments are visualised has 10

remained largely unchanged. To address the problem of visualising growing sequencing 11

datasets, we have developed DiscoPlot, a tool for visualising read alignments using a two-12

dimensional scatterplot. DiscoPlot allows the user to quickly identify genomic rearrangements, 13

misassemblies and sequencing artefacts by providing a scalable method for visualising large 14

sections of the genome. It reads single-end or paired read alignments in SAM, BAM or standard 15

BLAST tab format and creates a scatter plot of opaque crosses representing the alignments to 16

a reference. DiscoPlot is freely available (under a GPL license) for download (Mac OS X, Unix 17

and Windows) at https://mjsull.github.io/DiscoPlot. 18

Introduction 19

The emergence of high-throughput sequencing has led to an increase in both the size and scope of 20

genome sequencing projects. Assembly and mapping visualisation tools, such as Tablet (Milne et al. 21

2013), BamView (Carver et al. 2010) and Savant (Fiume et al. 2010) have been developed to be able to 22

handle the computational challenges presented by large volumes of data, however analysing even small 23

genomes can be time consuming. Furthermore, misassemblies and structural rearrangements can be 24

obfuscated by the large volume of data making it difficult to distinguish between discordant reads 25

caused by structural variations and discordant reads caused by spurious mapping or read chimeras. The 26

time consuming nature of visual analysis has led to a task originally accomplished manually, such as 27

identifying structural variations or misassemblies, being automated. Many programs have been 28

developed for identifying structural variations using alignment information such as SVDetect (Zeitouni 29

et al. 2010), ForestSV (Michaelson and Sebat 2012), BreakDancer (Chen et al. 2009) and InGap-SV 30

(Qi and Zhao 2011). Tools have also developed that make use of paired read alignment information to 31

identify misassembled contigs, including REAPR (Hunt et al. 2013) and amosvalidate (Phillippy et al. 32

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 3: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

2008). These tools may make assumptions that may not be true for all datasets. For example, high levels 33

of read chimeras, highly variable coverage and bimodal insert sizes can all lead to overcalling or 34

undercalling of structural variations. Problems also arise when the original structure and a structural 35

variant are both present in the same sequencing run. Such sites where multiple structural variants exist 36

are often biological significant, such as genes undergoing DNA invertase mediated prophage tail fibre 37

allele switching (Forde et al. 2014). Thus there is a clear need for a scalable, visual approach that 38

enables the user to quickly identify genomic rearrangements, misassemblies and sequencing artefacts. 39

A related problem has underpinned visualisation software development for Hi-C data (Servant et al. 40

2012). Hi-C probes the three-dimensional architecture of genomes by coupling proximity based ligation 41

with high-throughput sequencing (Lieberman-Aiden et al. 2009). The Hi-C methodology produces large 42

volumes of mate-pair sequenced reads with each paired read coming from separate but spatially close 43

regions of the genome. Reads are then mapped back to a reference genome, and regions of the genome 44

that physically interact can be inferred from the concentration of reads that map to a particular region. 45

One way of accomplishing this is by binning reads in a 2-dimensional array whereby the x and y 46

coordinates are determined by the mapping coordinates of the left and right read pair, and visualising 47

the array using a heatmap. This approach provides a method of visualising large volumes of read 48

alignments across an entire genome. 49

A two-dimensional layout provides a scalable method for visualising read alignments. Heatmaps, while 50

good at visualising “low resolution” information, as in the case of Hi-C that involves interaction 51

between large (>100,000 bp) regions of the genome, have several major drawbacks which make them 52

unsuitable for visualising structural variations using paired-end reads. Bin number is limited by the 53

resolution of the image being created. If the bin size is larger than a genomic rearrangement, a two-54

dimensional approach will not work. Single significant bins can be difficult to spot at high resolutions. 55

In contrast, if a wide array of values are present in the bins it can be difficult to find a colour gradient 56

to make significantly different values distinguishable from one another. This problem is magnified if 57

colours are also needed to indicate other information, such as read orientation. 58

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 4: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

To solve these problems DiscoPlot has been developed. DiscoPlot is a Python application that uses 59

matplotlib (Hunter 2007) to display read alignments as a scatter plot. It uses an opaque cross in which 60

the size of the cross is proportional to the number of reads mapping to the bin. This approach has several 61

inherent advantages over heatmaps. Representing counts as sizes allows colour to be used to display 62

additional information such as read orientation. Markers allow single bins with a significant number of 63

reads to easily be identified even when a large number of bins has been used. Anti-aliasing allows the 64

user to use more bins than resolution permits while still retaining some accuracy obtained by using a 65

small bin size. This approach has also been adapted to work with single-end long reads. The script can 66

create figures from paired alignments using a SAM, indexed BAM file and single-end reads using and 67

alignment in standard BLAST tab format. DiscoPlot is an open source project available on Windows, 68

OSX and GNU/Linux. 69

Implementation 70

DiscoPlot allows genome structural variants, misassemblies, circularisation of the genome and 71

sequencing artefacts such as read chimeras to be quickly identified. A reference genome of width w 72

base pairs is divided into n bins of size b base pairs. The user can either specify the number of bins to 73

use or the size of the bins. Two sparse matrices Ma and Mb of size n×n are built to count direct and 74

inverted alignments, and two arrays Aa and Ab of length n are also created to count unaligned sequence. 75

Paired-end sequencing matrix construction 76

Read alignments are processed from a read-alignment file in standard SAM or BAM format (Li et al. 77

2009). If reads are orientated in opposite directions Mb[⌊x/b⌋][⌊y/b⌋] is incremented by one, where x is 78

the position of the read aligning to the forward strand and y is the position of its pair (Figure 1). If both 79

ends of the read map to the forward strand x is the position of the mate aligning closest to the 5’ end 80

and y is determined by the position of its pair. Similarly, if both ends of the read map to the reverse 81

strand the x is determined by which mate is closest to the 3’ end of the forward strand and the y by its 82

pair. In both cases Ma[⌊x/b⌋][⌊y/b⌋] is incremented by one. The positions (x) of reads with unmapped 83

mates are also tracked in separate arrays, Aa[⌊x/b⌋] is incremented by one for each read that maps to the 84

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 5: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

forward strand with an unmapped pair. Ab[⌊x/b⌋] is incremented by one for each read that maps to the 85

reverse strand with an unmapped pair. 86

87

Figure 1: Creation of a DiscoPlot using mate-pair reads. In the above figure the pink rectangle 88

represents the reference sequence. The linked arrows represent the left (blue) and right (orange) 89

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 6: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

pairs of mate-paired reads. The square underneath is the DiscoPlot representation of how the 90

reads map. 1) DiscoPlot represents the chromosome in 2D space. 2) This reference is divided up 91

into bins. 3) When a read maps to the reference DrawGrid represents it as a cross on the grid 92

where the x coordinate of the cross is the start of the read aligning the forward strand maps to, 93

and the y coordinate is determined by mapping position of the start of its pair. 4) Multiple reads 94

can be represented in this way. 4) In the x axis each cross aligns with a read mapping to the 95

forward strand. 5) In the y axis each cross aligns with a read aligning to the reverse strand. 6) The 96

size of the cross represents how many read pairs map to that bin. 7) When mate-pair reads align 97

concordantly with the reference sequence crosses occur at position x, y where y is x minus the 98

insert size of the pair. 8) If read data is not concordant with the reference genome crosses occur 99

in unexpected locations. Read pairs that align to the same strand are indicated by red crosses. 9) 100

A representation of how region of the chromosome that excises into a circular product would 101

look when mapped against the chromosomal reference. 102

Single-end sequencing matrix construction 103

Local alignments of each read to a reference sequence are provided by the user in standard BLAST tab 104

format or generated automatically using NCBI-BLAST (Camacho et al. 2009). To reduce the number 105

of spurious hits, alignments less than the user defined identity or length are ignored (default values 95% 106

and 100 basepairs (bp), respectively). The alignment with the highest Bit-score becomes the primary 107

alignment. If several alignments score equally one is chosen at random to prevent spikes at repetitive 108

sites in the genome. If the entire read does not align in a single High Scoring Pair (HSP), subsequent 109

alignments are iterated through from highest to lowest Bit-score. If the alignment includes a previously 110

unaligned portion of the read it is included in a list of secondary alignments. Each alignment is defined 111

by four positions Qs, Qe, Rs and Re. Qs is the position in the read of the start of the alignment, Qe is the 112

position in the read of the end of the alignment, Qs is always less than Qe. Rs is the position in the 113

reference of the start of the alignment, Re is the position in the reference of the end of the alignment. If 114

alignment is to the forward strand of the reference Rs is less than Re, if the alignment is to the reverse 115

strand Rs is greater than Re. If the primary alignment is to the reverse strand of the reference, the primary 116

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 7: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

and secondary alignments for the read are converted so that they are the equivalent of those that would 117

be found by aligning the reverse compliment of the read. To reverse these alignments Qs becomes the r 118

- Qe and Qe becomes the r - Qs, where r is the length of the read. Rs and Re are switched. After these 119

values have been set an anchor (a) is set as Rs from the primary alignment. Then, for each position (i) 120

in the length (l) of each alignment a set of x and y values are calculated where x = ⌊(Qs+a+i)/b⌋ and y = 121

⌊(Rs(i/l)+Re(1-i/l))/b⌋. If the alignment is the primary alignment or in the same orientation as the primary 122

alignment each unique combination of x and y is incremented by one in Ma. If the alignment is not in 123

the same orientation as the primary alignment each unique combination of x and y is incremented by 124

one in Mb (Figure 2). If the 5’ end of read mapping to the forward strand or the 3’ end of a read mapping 125

to the reverse strand doesn’t align to the reference Aa[Rs/b] is incremented by one. Similarly, if the 3’ 126

end of a read mapping to the forward strand or the 5’ end of a read mapping to the reverse strand 127

Ab[Re/b] is incremented by one. This process is repeated for all reads. 128

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 8: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

129

Figure 2: Creation of a DiscoPlot using single end reads. The reference genome is represented by a 130

pink rectangle, which is divided into bins. Reads are represented by orange (forward) and blue 131

(reverse) arrows. The square underneath is the DiscoPlot representation of how the reads map. 1) 132

The anchor is set at the 5’ end of the primary alignment in the reference. A cross is drawn at bin 133

X, Y where X and Y are equal to the bin containing the anchor. For each extra bin the read aligns 134

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 9: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

to an additional cross is drawn. The X coordinate of the cross is the position in the read relative 135

to the anchor and is located on the X axis relative to the initial cross. The Y coordinate is the 136

position that section of the read aligns to in the reference. For concordant reads this creates a 137

straight diagonal. 2) Additional reads can be added, if two or more reads map to the same bin the 138

size of the cross is increased. 3) The blue rectangle represents the genome from which the reads 139

were generated and grey polygons represent how it aligns to the reference. The X position 140

increments by one for each additional bin the read maps to. The Y position corresponds to the 141

position that section of the read aligns to in the reference. This read has two alignments to the 142

reference, as the second alignment is orientated in the opposite direction of the primary alignment 143

crosses are drawn in blue. 4) Additional reads are added. The resulting DiscoPlot is similar to a 144

Dotplot representation of the genome from which the reads were generated and the reference. 145

Drawing the Graph 146

For each bin in Ma an opaque red × is drawn with x and y coordinates equal to the x and y coordinates 147

of the bin multiplied by b. The width and height of the cross in points equals √2000𝑐1

𝑏×𝑚 where m is 148

the median value of the most populated diagonal in the matrix and c is the count of alignments in that 149

bin. Bins from Mb are drawn as blue opaque × symbols. Bins from Aa are drawn along the x axis as 150

opaque green + symbols and bins from Ab are drawn along the y axis as opaque green + symbols. Size, 151

thickness and opacity can all be adjusted by the user. Crosses are used because they allow deletions and 152

other features to become apparent even when visualising large sections of the reference genome. If more 153

than one reference is present in an alignment, alignments are separated by an opaque green band; this 154

allows potential integration events to be observed. Discoplot also allows the user to define regions of 155

the reference to view enabling precise visualisation of genomic rearrangements (Figure 4). 156

Discoplot uses matplotlib to display read mapping in an interactive window that can be zoomed and 157

scrolled to identify exact coordinates of markers. A subsection of the plot can be displayed; 158

alternatively, the plot can be broken up into a grid to highlight multiple subsections. Images can be 159

created from the GUI or written straight to an image from the command-line (PNG, SVG, PDF 160

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 10: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

supported on most systems, see matplotlib documentation). Reads associated with markers can be 161

written to a separate BAM or FASTA file using the command-line. 162

Case studies 163

Case study 1: simulated Illumina reads using a mock variant genome of uropathogenic E. coli 164

UTI89 165

A mock variant genome was created using the genome of Escherichia coli UTI89. E. coli str. UTI89 166

is an uropathogenic E. coli isolated from a patient with an acute urinary tract infection. It contains a 167

5,065,741 bp chromosome and an 114,230 bp plasmid. A 300bp inversion and a 3000bp inversion were 168

created at bases 50,000..50,300 and 100,000..100,3000. A 300bp insertion and a 3000bp insertion were 169

created at bases 150,000 and 200,000. Bases in the ranges 250,000..250,300 and 300,000..303,000 were 170

deleted. Bases 400,000..400,300 were moved to position 350,000 and bases 500,000..503,000 were 171

moved to position 450,000. Paired-end reads with an insert size of 300bp were simulated at 200 times 172

coverage from the mock genome using GemSIM (McElroy et al. 2012) and mapped to UTI89 using 173

BWA (Li 2013). A DiscoPlot was generated from the resulting BAM file using a bin size of 100bp 174

(Figure 3). 175

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 11: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

176

177

Figure 3: DiscoPlot of a mock genome. A mock genome was created by adding genomic 178

rearrangements to the chromosome of E. coli str. UTI89. Paired-end reads generated from the 179

mock genome (query) with GemSim and mapped back to UTI89 (reference). The first ~500 Kbp 180

were then visualised using DiscoPlot. 181

1000bp single-end reads were also generated from the mock genome at 20 times coverage and mapped 182

back to UTI89. Sizes of reads and structural variants were chosen to demonstrate how each structural 183

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 12: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

variant is represented in the DiscoPlot when the read length or insert size is less than or equal to the 184

structural variant or greater than the structural variant. A close up of each of the structural variants was 185

also generated (Figure 4). 186

187

Figure 4: DiscoPlots of common structural variants. Each box shows a common genomic 188

rearrangement represented by a DiscoPlot. Rows A and B were created using 100 bp long paired-189

end reads with an insert size of 300bp. Rows C and D were created using single-end reads with 190

an average length of 1000bp. For each box the rearrangement in the sequenced genome is listed, 191

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 13: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

followed by the scale of the gridlines in brackets. A1, C1: 300 bp deletion (400 bp). A2, C2: 300 192

bp insertion (400 bp). A3, C3: 300 bp inversion (400 bp). A4, C4: 300 bp sequence translocated 193

50 Kbp upstream (10 Kbp). B1, D1: 3000 bp deletion (1000 bp). B2, D2: 3000 bp insertion (500 194

bp). B3, D3: 3000 bp inversion (1000 bp). B4, D4: 3000 bp sequence translocated 50 Kbp 195

upstream (10 Kbp). 196

Case study 2: real Illumina reads from sequencing uropathogenic E. coli UTI89 197

An Illumina HiSeq2000 101 bp paired-end sequencing run of Escherichia coli str. UTI89 (ENA acc: 198

ERR687901) was aligned to the publicly available reference genome (CP000243.1) and its plasmid 199

(CP000244). The alignment was then examined using DiscoPlot. Low quality reads were trimmed, 200

where possible, or filtered if the average per-base quality was less than 30 or had one pair trimmed to 201

less than 50 base pairs. In total, 2,433,934 high quality read pairs with an average insert size of 367.25 202

base pairs with a standard deviation of 59.1 were aligned using the BWA mem algorithm (Li 2013). 203

The resulting BAM file was then visualised using DiscoPlot using 100,000 bins revealing several 204

structural variations (Figure 5). 205

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 14: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

206

Figure 5: The dynamic nature of the genome of E. coli str. UTI89. Discoplot of paired-end reads 207

from a clonal culture of UTI89 mapped back to the published reference chromosome and plasmid. 208

Coordinates from 0 to 5,065,741 represent the chromosome of E. coli UTI89, coordinates ≥ 209

5,066,000 represent the plasmid of E. coli UTI89. 210

Sites with discordantly mapping reads were identified using DiscoPlot (Figure 6). Reads from these 211

sites were randomly sampled and mapped back to the reference using BLASTn (Camacho et al. 2009) 212

to confirm if they had been correctly aligned. Structural variants were inferred from context of region 213

of the genome to which the reads mapped, mapping distance and orientation of the aligned reads. Local 214

alignments of soft-clipped reads were performed to identify exact boundaries of structural variations. 215

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 15: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

216

Figure 6: Discordant reads in E. coli str. UTI89. a) Read alignment indicates inversion of bases 217

919,638..922,323. 12bp inverted repeat present at terminals of region. Start and stop of inverted 218

region occurs in two probable tail fibre proteins. Two additional tail fibre assembly proteins are 219

encoded within the boundaries of this region. Region is immediately downstream of a putative 220

DNA invertase gene. b, f, h, i) Reads are misaligned as they map equally well in a concordant 221

position. c) Read alignment indicates circularisation of bases 1,653,000..1,662,603. 17bp direct 222

repeats present at terminals of this region. Region also encodes five putative phage-related 223

membrane proteins, two putative phage proteins, three phage hypothetical proteins, four 224

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 16: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

hypothetical proteins and a single putative phage related secreted protein. Size of crosses 225

indicates coverage of this region is higher than average. Only a single read (indicated by the 226

cross, top left) indicates potential excision of this region. d) Read alignments indicate inversion 227

of bases 2,109,690..2,114,003. Region contains ~100bp inverted repeat at terminals which 228

encodes a tRNA. Region contains 3 hypothetical proteins and an additional tRNA identical to the 229

repeats. A P4-phage integrase is present immediately downstream of the inversion. The lack of 230

concordantly mapping reads at prophage boundary indicates that the inverted region has reached 231

fixation in the population. e) Reads indicate inversion of bases 2,906,008..2,906,936. 15bp 232

inverted repeats present at terminals of this region. The 3’ end of a putative tail fibre assembly 233

gene is encoded by this region. g) Read alignments indicate inversion of bases 234

4,907,424..4,907,737. Regions has 9bp inverted repeat at terminals. It is located in a non-coding 235

region between fimA and fimE which encode the type I fimbriae. 236

The majority of structural variants cannot be explained by misalignment of the reads or sequencing 237

error and therefore have biological significance. Two sites of variation are potentially caused by tail 238

fibre allele switching (Figure 6a and 6e), similar to that previously shown in the genome of 239

uropathogenic E. coli ST131 strain EC958 (Forde et al. 2014). An inversion event occurred at a site 240

with several tRNA (Figure 6d). At another site, phase variation of a fimbriae loci was occurring (Figure 241

6g). A circularising region was also identified (Figure 6c), this region was present in the sequencing 242

data at much higher coverage (243.1× coverage compared to 88.3× the average coverage of the 243

chromosome). It is possible that this phage was induced in a small fraction of the bacteria. Additional 244

analysis, such as PCR or long read sequencing, is needed to confirm the exact nature of the identified 245

structural variants. 246

In addition to the structural variants detected, reads misaligned to several repetitive regions (Figures 247

6b, 6f, 6h, 6i). Misaligned reads all had insert sizes much smaller than the average, which likely 248

explains why they had been misplaced by BWA mem. (Figure 7). 249

250

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 17: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

251

252

Figure 7: Misaligned reads and insert size. Blue columns indicate frequency of the mapping distances 253

of all aligned reads. Each vertical red line indicates the corrected mapping distances of a 254

misaligned read from site b (Figure 6). 255

For each batch, BWA mem estimates the mean and variance of insert size distributions from reliable 256

single-end hits. BWA then uses a normal distribution, determined by these values, to estimate the 257

likelihood of each of the possible pairings of reads with multiple alignments. This value is combined 258

with the alignment scores of the reads to determine the probable alignment of a read pair. If the putative 259

insert size of an alignment varies significantly from the mean BWA considers the read a chimeric pair; 260

BWA considers all chimeric alignments of a read pair equally likely. As insert sizes from our 261

sequencing run were not normally distributed, BWA considers the reads with small insert sizes 262

chimeric and arbitrarily assigns them to one of the instances of a repeat. This resulted in the 263

misalignment of reads with small insert sizes at large exact repeats. 264

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 18: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

Conclusions 265

Discoplot provides a simple, scalable method for visualising read alignments to a reference. It allows 266

the user to quickly identify genomic rearrangements, misassemblies and sequencing artefacts. A visual 267

approach is inherently sequence type agnostic, this allows the user to immediately analyse their 268

alignments without calculating the average length or standard deviation of their read library. This is 269

particularly useful when analysing complex libraries such as mate-pair reads that may contain a shadow 270

library of read pairs in the reverse orientation. We have shown how DiscoPlot can be used to quickly 271

identify genome rearrangements and misaligned reads in bacterial genomes. In future, additional 272

colours may be incorporated into to display other metrics such as average number of single nucleotide 273

variants per read in a bin. To make DiscoPlot more accessible, features only available through the use 274

of the command line could be incorporated into a graphical user interface. 275

Acknowledgments 276

The authors would also like to thank Professor Mark Schembri for making available the UTI89 Illumina 277

datasets used in this project. We would also like to thank Mitchell Stanton-Cook for helping make 278

DiscoPlot.py available through PyPI. 279

Funding 280

SAB is supported by a Career Development Fellowship from the National Health and Medical Research 281

Council Career of Australia [grant number APP1090456]. 282

References 283

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: 284

architecture and applications. BMC bioinformatics 10: 421. 285

Carver T, Böhme U, Otto TD, Parkhill J, Berriman M. 2010. BamView: viewing mapped read 286

alignment data in the context of the reference sequence. Bioinformatics (Oxford, England) 287

26(5): 676-677. 288

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 19: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, 289

Zhang Q, Locke DP et al. 2009. BreakDancer: an algorithm for high-resolution mapping of 290

genomic structural variation. Nat Meth 6(9): 677-681. 291

Fiume M, Williams V, Brook A, Brudno M. 2010. Savant: genome browser for high-throughput 292

sequencing data. Bioinformatics 26(16): 1938-1944. 293

Forde BM, Ben Zakour NL, Stanton-Cook M, Phan M-D, Totsika M, Peters KM, Chan KG, Schembri 294

MA, Upton M, Beatson SA. 2014. The Complete Genome Sequence of Escherichia coli 295

EC958: A High Quality Reference Sequence for the Globally Disseminated Multidrug 296

Resistant E. coli O25b:H4-ST131 Clone. PloS one 9(8): e104400. 297

Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto T. 2013. REAPR: a universal tool for 298

genome assembly evaluation. Genome Biology 14(5): R47. 299

Hunter JD. 2007. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9(3): 300

90-95. 301

Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. In 302

ArXiv e-prints, Vol 1303, p. 3997. 303

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. 304

The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 305

25(16): 2078-2079. 306

Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie 307

BR, Sabo PJ, Dorschner MO et al. 2009. Comprehensive mapping of long-range interactions 308

reveals folding principles of the human genome. Science (New York, NY) 326(5950): 289-293. 309

McElroy K, Luciani F, Thomas T. 2012. GemSIM: general, error-model based simulator of next-310

generation sequencing data. BMC Genomics 13(1): 74. 311

Michaelson JJ, Sebat J. 2012. forestSV: structural variant discovery through statistical learning. Nat 312

Meth 9(8): 819-821. 313

Milne I, Stephen G, Bayer M, Cock PJA, Pritchard L, Cardle L, Shaw PD, Marshall D. 2013. Using 314

Tablet for visual exploration of second-generation sequencing data. Briefings in 315

bioinformatics 14(2): 193-202. 316

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts

Page 20: DiscoPlot: Discordant read visualisation - PeerJ · To address the problem of visualising growing sequencing 12 datasets, we have developed DiscoPlot, a tool for visualising read

Phillippy AM, Schatz MC, Pop M. 2008. Genome assembly forensics: finding the elusive mis-317

assembly. Genome Biol 9(3): R55. 318

Qi J, Zhao F. 2011. inGAP-sv: a novel scheme to identify and visualize structural variation from 319

paired end mapping data. Nucleic Acids Res 39(Web Server issue): W567-575. 320

Servant N, Lajoie BR, Nora EP, Giorgetti L, Chen CJ, Heard E, Dekker J, Barillot E. 2012. HiTC: 321

exploration of high-throughput 'C' experiments. Bioinformatics (Oxford, England) 28(21): 322

2843-2844. 323

Zeitouni B, Boeva V, Janoueix-Lerosey I, Loeillet S, Legoix-né P, Nicolas A, Delattre O, Barillot E. 324

2010. SVDetect: a tool to identify genomic structural variations from paired-end and mate-325

pair sequencing data. Bioinformatics (Oxford, England) 26(15): 1895-1896. 326

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1038v1 | CC-BY 4.0 Open Access | rec: 4 May 2015, publ: 4 May 2015

PrePrin

ts