Improving somatic mutation detection in cancer · 2018. 2. 15. · can result in a knowledge gap...

ESSAY

Improving somatic mutation detection in cancer

Wouter Huiting, MSc

1769863

13-8-2016

Supervisor: Dr. C. van Diemen

2

Contents Preface ............................................................................................................................................................ 3

Abstract .......................................................................................................................................................... 4

1. Introduction .......................................................................................................................................... 5

1.1 The challenges of somatic variant calling in cancer ................................................................ 6

1.2 Goal of this essay ......................................................................................................................... 7

......................................................................................................................................................................... 9

2. Optimizing the workflow of tumour variant calling ..................................................................... 10

2.1 DNA extraction and shearing .................................................................................................. 10

2.2 Library preparation .................................................................................................................... 11

2.3 High-throughput sequencing: base-calling, depth and coverage ........................................ 13

2.4 Read quality assessment and pre-processing ......................................................................... 15

2.5 Read mapping ............................................................................................................................. 16

2.6 Variant calling, annotation and prioritization ........................................................................ 19

3. Discussion ........................................................................................................................................... 21

4. References ........................................................................................................................................... 24

3

Preface

This essay was written in the spring of 2016 in the context of my Master education Behavioral

and Cognitive Neuroscience at the University of Groningen. The execution of this essay was

supervised by Dr. C. van Diemen, whom I would like to thank for the opportunity.

4

Abstract

Somatic variation analysis is important to help us understand the onset and progression of

cancer. Unfortunately, although next-generation sequencing technology has advanced rapidly

over the last decade, most NGS strategies still prove inadequate to accurately grasp the molecular

complexity of cancer, a fact that is largely the result of intratumour heterogeneity. In addition, the

high-throughput of modern NGS platforms has also left us with the difficult task of managing

and analysing vast amounts of data, forcing researchers to rely heavily on the use of

bioinformatics. Advances in experimental and computational techniques designed to cope with

these challenges occur quickly. The result is a rapidly evolving workflow of somatic tumour

variant analysis. It is important that the entire cancer research community is informed regularly

on these advances. Here I try to provide an overview of the complete workflow of tumour

variant analysis in a way that is relevant to all those involved in cancer research. In addition, I

highlight several weak links in this workflow and provide recommendations on how to cope with

them.

5

1. Introduction

Since the dawn of next-generation sequencing (NGS) more than a decade ago, the sequencing

technology has evolved immensely, driving speed and throughput to an unprecedented level

(Goodwin et al. 2016). The financial costs of sequencing an entire human genome have been

brought down from roughly US$10 million to little over a thousand dollars in just a few years’

time (https://www.genome.gov/27541954/dna-sequencing-costs-data). NGS technologies are

now widely used in laboratories around the world, and many acclaim their potential as a

diagnostic tool (Katsanis et al. 2013; Vrijenhoek et al. 2015). The development and widespread

adoption of NGS technology would not have been possible without the efforts of Frederic

Sanger and his colleagues almost forty years ago. In 1977, it was Sanger who introduced the first

automated sequencing method based on his ground breaking ‘chain-termination’ technique. Due

to its high accuracy (e.g. through long read lengths) and ease of use, Sanger sequencing would

become the dominant DNA sequencing technology for the next decades. Around the turn of the

millennium fundamentally different sequencing methods started to spring up, collectively referred

to as ‘next-generation sequencing’ (NGS) techniques. Aided by the development of new

technologies such as high-resolution imaging, these exciting methods offered several advantages

over Sanger sequencing. Most importantly, they all allowed mass parallelisation of sequencing,

greatly improving throughput (Shendure, Ji. 2008). See Box 1 for an overview of the principles

behind Sanger and modern NGS platforms; an in depth overview lies beyond the scope of this

essay (for an excellent recent review see for example Goodwin et al. 2016).

One field that has particularly benefited from this development is cancer research. Cancer

comprises a group of diseases that arise from ‘a clone that has accumulated the requisite

somatically-acquired genetic aberrations, leading to malignant transformation’ (Stratton. 2011;

Watson et al. 2013). The development of effective therapies for cancer patients requires a

comprehensive assessment of the role of these somatic variants in tumour formation (Wang et al.

2013). Before the advent of NGS, studies relied on a costly and low-throughput workflow of

PCR amplification followed by Sanger sequencing to identify candidate cancer drivers. When

NGS technologies became widely used, this meant that the cancer research field could initiate

systematic sequencing ‘screens’ to identify such somatic variants much more effectively. NGS

studies have since then contributed greatly to our understanding of the mutations that drive

tumourigenesis in different tumour types (Watson et al. 2013). However, accurate somatic

mutation ‘calling’ remains highly challenging still.

https://www.genome.gov/27541954/dna-sequencing-costs-data

6

1.1 The challenges of somatic variant calling in cancer

True somatic variant are very difficult to distinguish. Besides the errors in the sequencing process

itself, e.g. contaminations and sequencing errors, there are issues inherent to the short-read

nature of NGS: amplification bias and ambiguities in short read mapping, to name a few (Wang et

al. 2013). However, most of these problems arise with every NGS-project. What makes somatic

cancer variants in particular difficult to identify is the intratumour heterogeneity (Fidler. 1978;

Marusyk et al. 2012). As tumours evolve from single cells, clonal lineages begin to diverse, giving

rise to distinct subpopulations within the tumour (Gerlinger et al. 2012). It is this genomic

diversity that drives tumour proliferation, enabling cancer cells to survive a range of selective

pressures from the tumour’s microenvironment: pH, hypoxia, therapy and others. (Davis, Navin.

2016). The result is a highly resistant tumour in which variants are non-uniformly present

(Landau et al. 2013), with variant allele frequencies (VAFs) as low as 5% having been reported

(Carter et al. 2012). Several sequencing strategies have been employed to disentangle this

heterogeneity, and they can be divided into two different types of strategies. Methods belonging

to the first type aim to sample from different places in the tumour and sequence them

simultaneously. This can be executed with different macroscopic regions of the tumour mass, a

method referred to as multiregion sequencing (Gerlinger et al. 2012; Yates et al. 2015), but also

on a microscopic scale using a range of single cells (Navin et al. 2011; Xu et al. 2012). A related

technique is to fluorescently label and sort cells from the tumour to achieve homogenous

subpopulations of cancer cells (Bolognesi et al. 2016). The latter method has the added benefit of

removing healthy cells from further analysis. The second strategy is to sequence ultra-deep (Nik-

Zainal et al. 2012). As I will discuss later, this strategy has shown great promise of substantially

improving the identification of tumour variants, even with very low VAFs (Griffith et al. 2015).

Although choosing the appropriate sequencing method is key, other aspects have to be

taken into account as well. Among them are practical considerations, for example the choice of a

valid control input to be able to accurately distinguish germline variation from true somatic

variants. Others are more strategic, like the choice to include or omit a PCR amplification step.

As we will see further down, all of these can influence the outcome of the tumour variant

analysis. One element in particular in which there is still significant room for improvement is the

use of bioinformatics tools in the various data analysis steps of somatic variant calling. Modern

NGS platforms generate large amounts of heterogeneous data, with higher error rates and

(generally) shorter read lengths compared to Sanger sequencing platforms (see Box 1c). Because

of this, NGS poses considerable challenges for data management and computational analysis

(Schadt et al. 2010). Consequently, somatic variant analysis is forced to rely heavily on

7

bioinformatics (Pabinger et al. 2013). A wide range of algorithms has been written to aid in

specific parts of the data analysis (Li, Homer. 2010; Bao et al. 2014), but the fact that there are so

many makes choosing the right toolset a strenuous task, especially for the less experienced user

(Pabinger et al. 2013).

1.2 Goal of this essay

Further advances in somatic variation analysis are important to help us understand the onset and

progression of cancer. As existing sequencing technologies are constantly improved, and new

technologies emerge at a fast pace, it is key that the cancer research field is informed regularly on

new methods, tools etc. Indeed, excellent reviews are periodically written on how to optimize

variant calling, but these often highlight only one aspect of the entire workflow. Moreover, the

majority of these reviews is directed at geneticists and bioinformaticians (see for instance Alioto et

al. 2015 or Griffith et al. 2015). This makes it challenging for clinical researchers and cell

biologists to keep track of the advances in somatic variant calling methods in cancer. As NGS

techniques and the associated data analysis are becoming more and more complex, the danger of

a loss of crosstalk between bioinformatics and (clinically driven) biological research arises. This

can result in a knowledge gap that will directly affect the clinical impact of NGS. Such a gap

could prevent clinicians and researchers from smaller labs to exploit the power of NGS

technology in the future. To prevent such a schism from happening it is crucial that we bring

these fields closer to each other. For these reasons, I here present in interdisciplinary review of

the entire workflow of tumour variant calling, starting from the bench-work. For clarity, I focus

on several elements and principles that have arguably the largest effect on the accuracy of somatic

variant calls, including library preparation, sequencing depth and coverage, as well as the choice

of mappers and callers; other factors will be covered more briefly.

9

Box 1 (continued)

d. Sanger and Illumina sequencing

(I) Sanger sequencing workflow. (II-IV) General workflow of Illumina sequencing, the current

marketleader in NGS platforms. See also Box 1a. Images taken from Mardis, 2013. For additional

information, the reader is referred to the excellent recent reviews by Goodwin et al. (2016)

Template-strand

5’ 3’ 3’ 5’

5’

5’

5’

5’

5’

5’

5’

Direction of

electrophoresis

3’

T

C

G

A

A

T

C

5’

Direction of

sequence read

-

-

- - - -

5’ 3’

5’

10

2. Optimizing the workflow of tumour variant calling

2.1 DNA extraction and shearing

Tumour DNA is typically obtained from formaldehyde fixed-paraffin embedded (FFPE) tumour

samples, the standard preservation format for diagnostic surgical pathology (Kokkat et al. 2013).

In parallel to sampling the tumour, DNA is extracted from peripheral blood to serve as a

‘normal’ genome. This allows true somatic variants in the tumour to be discriminated from the

host’s germline variants later in the workflow (Shah et al. 2009; see also Figure 2). If no peripheral

blood is available, DNA from normal tissue surrounding the tumour in the FFPE sample is

sometimes taken as a control (Pleasance et al. 2010). Importantly, choosing the proper source for

a healthy control is no trivial task. For instance, peripheral blood often contains circulating

tumour cells (Pantel, Speicher. 2015), and tissue surrounding the tumour may appear to be

healthy but can in fact harbour cells with significant genomic defects (Sadanandam et al. 2012;

Troester et al. 2016). Both situations are unwanted, as they prevent the accurate identification of

tumour-related somatic alterations. Another potential problem stems from the notion that the

fixation and embedding needed to make FFPE tumour samples can damage the DNA (Ben-Ezra

et al. 1991; Williams et al. 1999). This not only reduces the amount of DNA available for

sequencing, it can also lead to inaccurate variant calling (Akbari et al. 2005). Pre-analytical

molecular sample characterization has been proposed to correct for these problems (Sah et al.

2013).

Next, the obtained DNA has to be sheared into the short oligonucleotides - typically tens

or hundreds of nucleotides long - required for high-throughput NGS (Poptsova et al. 2014).

Although several DNA shearing methods exist, all of them rely on one of two principles:

shearing by mechanical force or shearing through enzymatic fragmentation (Knierim et al.

2011).Few studies have investigated the effects of the various DNA fragmentation protocols on

variant calling accuracy, as it has long been assumed that shearing occurs in a random (i.e.

unbiased) manner. Indeed, there is data supporting the idea that DNA fragmentation is random,

regardless of the method employed, leading some to suggest that ‘a fragmentation method can be

chosen solely according to lab facilities, feasibility and experimental design’ (Knierim et al. 2011).

However, several studies suggest that DNA shearing is in fact subject to a ‘fragmentation bias’, as

both enzymatic- and mechanical fragmentation were shown to be sequence specific (Hansen et al.

2010; Grokhovsky et al. 2011). In an interesting study from 2014, Poptsova and coworkers

showed that ultrasound shearing of genomic DNA can cause an amplified (i.e. higher than

chance) cleavage of GC-rich areas, likely as a result of local variations in DNA structural

dynamics. Regardless of the exact sequence that is preferred, a fragmentation bias will result in a

11

non-random distribution of fragment lengths and sequence ends, which in turn leads to a non-

uniform read coverage after the alignment step (see section 3.3; Finotello et al. 2014). It is

important that sequencing studies are aware of this problem, and try to correct for it. The

solution could be a two-step fragmentation protocol, although that has only been validated for

ChIP-seq (Mokry et al. 2010). Alternatively, it might be possible to negate the fragmentation bias

by adding specific chemical agents in the shearing step (Grokhovsky et al. 2011).

2.2 Library preparation

An important principle of library preparation is ‘library complexity’, referring to the number of

unique fragments present in the library. The main goal when preparing any sequencing library is

to make it as complex (i.e. diverse) as possible, so that it accurately reflects the complexity of the

original genetic sequence (Head et al. 2014), and at the same time avoid bias (Van Dijk et al. 2014).

Once the dsDNA is properly fragmented, the sequencing library can be made. In this

process oligonucleotide adapters (specific to the NGS platform used) are attached to the ends of

the fragments, preparing them for sequencing. Fragment ends are processed before the adapters

can be ligated to generate the blunt-ended fragments that are required for adapter ligation (Head

et al. 2014). End-processing is a two-step process that starts with the enzymatic blunting and 5’

phosphorylation of both sides of the fragment, followed by the addition of an adenine nucleotide

to the 3’ ends. This A-tail not only reduces the risk of fragment-chimeras, but also facilitates the

ligation of the T-tailed adapter oligonucleotides (Quail et al. 2008). This is a crucial step in the

variant calling workflow, as the use of different protocols can result in marked variation in variant

calling effectiveness (see Rhodes et al. 2014; Alioto et al. 2015).

After end-repair and A-tailing, adapters are ligated to the fragments. These adapters are a

crucial part of library preparation as they hybridize to sequencing primers during the sequencing

reaction. Importantly, the choice of adapters not only depends on the NGS platform used, but

also on whether single-end or paired-end reads are pursued (see further down). As a result of

imperfect end-repair or A-tailing, artefacts can arise in the library, including adapter dimers. As

these dimers form clusters very efficiently during sequencing, they take up valuable space and

thus waste the capacity of the sequencing platform (Head et al. 2014). A size-selection step right

after shearing (but before adapter ligation), has been proposed as a means of reducing the effect

of adapter-chimeras (Quail et al. 2009). This also yields a tighter distribution of fragment sizes

resulting in a more homogenous PCR amplification, which in turn will provide a more uniform

read coverage after the alignment step (see section 3.3; Quail et al. 2009).

Standard NGS library preparation protocols then rely on a PCR step to amplify their

library (i.e. enriching for properly ligated fragments) before sequencing (Kebschul, Zador. 2015).

12

The need for this stems largely from the notion that libraries should be carefully quantified

before sequencing commences (see further down), and most quantification protocols require

large amounts of DNA to ensure accurate titration (Meyer et al. 2008; Parkinson et al. 2012).

Unfortunately, PCR is an inherently sensitive process: it not only skews data (and thus reduce

library complexity) but it can also introduce hybrid or erroneous sequences into the library (Aird

et al. 2011). Different factors are thought to underlie these PCR-induced imperfections, among

which fragment length, content dependent amplification of sequences, template switching and

even the intrinsic stochasticity of PCR (Kebschul, Zador. 2015). Variation in the PCR parameters

(temperature, polymerase, buffer) plays a substantial role in this, which can be exacerbated with

every PCR cycle (Dabney, Meyer. 2012; Meyer, Liu. 2014). Because of these issues with PCR

amplification, some studies opt to exclude this step from library preparation (Kozarewa et al.

2009; Quail et al. 2009). The idea behind this is that eliminating the PCR amplification leads to

improved coverage of regions with a high GC content and reduces the amount of duplicate reads

after sequencing (Kozarewa et al. 2009). Omitting PCR indeed results in a more homogenous

distribution of reads and can thus improve variant detection in cancer (Alioto et al. 2015).

However, a PCR-free approach does require a larger amount of starting material, and as FFPE

samples generally yield small quantities of DNA, a PCR-free approach to identify somatic tumour

variants might not always be possible (Luthra et al. 2015).

The last step of library preparation is quantification, during which the precise amount of

adapter ligated fragments present in the library is evaluated. This allows the researcher to load the

correct amount of sample onto the sequencing station (Liu et al. 2012; Loman et al. 2012). This is

important, as sequencing experiments performed with too many or too few correctly ligated

library fragments can yield poor data quality (Laurie et al. 2013). Several methods for library

quantification exist, but the most widespread method is real-time quantitative PCR (qPCR). This

is mainly due to its ability to assess only the amount of adapter ligated fragments (Buehler et al.

2010). Unfortunately, qPCR has also considerable disadvantages that can compromise variant

calling. Template size and sequence content can result in an amplification bias just like with a

regular PCR (Valasek, Repa. 2005). In addition, qPCR demands that a standard curve is created

for each sample, a laborious process that is prone to inaccuracies (Yun et al. 2006). To overcome

these problems White and coworkers developed a digital emulsion PCR method to quantify

libraries (White et al. 2009). This method is based on the (massively parallel) fluorescent detection

of a probe oligonucleotide (e.g. TaqMan) added to the adapter ligated fragments. A droplet digital

PCR (ddPCR) device (e.g. QX100®, Bio-Rad) first generates thousands of droplets; most are

empty, but some contain a single amplified fragment of DNA. Next, droplets are counted and

13

assessed for fluorescence (all or nothing, so binary read-out), after which a the total number of

input molecules can be calculated (White et al. 2009). DdPCR offers superior sensitivity and

stability over convential qPCR quantification (Robin et al. 2016).

2.3 High-throughput sequencing: base-calling, depth and coverage

While new NGS instruments are being developed at an astonishing pace (Goodwin et al. 2016),

the accuracy and speed of the main NGS platforms currently in use is also constantly improved.

Several elements are of particular importance in this respect.

Base-calling algorithms turn the sequencer’s output (fluorescence in the case of Illumina,

current changes in the case of Thermo Fisher) into a base-call. However, due to the inevitable

imperfections in sequencing chemistry and signal detection, errors in base-calling can arise. For

instance, Illumina technology suffers from a number of biases owing to its technology, including

phasing (and prephasing), signal decay and cross-talk (see Figure 1; Cacho et al. 2015). The

standard Illumina base-calling algorithm, ‘Bustard’, is able to reduce the effects of these

uncertainties on base-call accuracy by explicitly modelling these biases. However, the error rates

in Bustard’s calls can still be significantly improved (up to 30%; Nielsen et al. 2011) by more

sophisticated algorithms like BlindCall and freelbis (Das, Vikalo. 2013; Renaud et al. 2013; Ye et

al. 2014). An in depth description of the mathematical principles used by these base-callers lies far

beyond the scope of this report, but an excellent review of recently developed base-callers was

recently written by Cacho and coworkers (2015). Importantly, the use of these advanced

statistical tools instead of Bustard was shown to significantly reduce false positive SNP calls in

tumour variant analysis (Nielsen et al. 2011).

Two other important concepts that are crucial for accurate somatic mutation detection in

cancer are sequencing depth and coverage. Depth and coverage are two highly related terms that

are frequently used interchangeably. Some use coverage to describe the breadth of sequence

coverage, i.e. the percentage of the target genome that is sequenced a given number of times.

Depth can then be thought of as the ‘redundancy of coverage’ (Sims et. 2014), often denoted as n

x (e.g. 30x depth). The importance of sequencing depth becomes instantly clear when one

considers the internal error rate of high-throughput, short read sequencing - without sufficient

depth, it is impossible to distinguish sequencing mistakes from real sequence variants. In

addition, a uniform coverage is needed to eliminate the underrepresentation of SNPs in specific

regions (for instance regions with high GC content) of the genome. Accordingly, an increased

depth and uniformity of coverage can rescue sequencing errors (Sims et. 2014).

14

Figure 1. Commonly modelled base-calling errors for the Illumina platform

(a) Scaled cytosine (C) intensity versus cycle of a single read. A spike indicates a potential C nucleotide at that

position. Phasing is observed as an anticipation signal in the cycle before a C (left arrow) and after (right arrows). It

occurs during the sequencing process when one or more strands within a cluster fail to incorporate the next base in

the read. The reads start lagging behind, distorting the fluorescence emissions. Prephasing occurs when two bases are

incorporated in a single cycle. (b) Maximum intensity (signal) and median intensity (noise) plotted against cycle.

During the sequencing of the complementary strand, some material may be lost , causing a decreased signal to noise

ratio known as signal decay. (c) Intensity versus fluorophore emission spectrum. The spectrum of the guanine (G)

fluorophore bleeds into the optimal spectrum of the thymine (T) filter. Thus, when a G fluorophore is excited, a T

signal will also be detected. This causes a positive correlation between the intensities of these two channels, a

phenomenon known as cross-talk. (d) Two-dimensional histogram of intensity data of the T channel versus G

channel. The G fluorophores (right arrow) transmit to the T channel, hence the positive linearity. However, the T

fluorophores do not transmit to the G channel. A similar situation occurs with A and C channels (not shown). Figure

and text adapted from Cacho et al. 2015.

For the accurate detection of germline single nucleotide variants (SNPs) a 30x average depth in

95% of the genome was shown to be sufficient (Ajay et al. 2011). Most cancer genomes (and

‘normal’ control genomes) are sequenced to comparable depths (Mardis et al. 2012; Borad et al.

2014), as previous studies indicated that depths of 15x-50x are sufficient to detect all SNPs and

15

small indels (Bentley et al. 2008; Ajay et al. 2011). However, these estimates are largely based on

high-purity tumours, while in fact most tumours exhibit severe heterogeneity (as discussed

earlier). As a result, a somatic tumour variant can have a VAF of 5% (Biankin et al. 2012) or even

lower. Such rare variant are unlikely to be picked up with a sequencing depth of 30x. A recent

study by Griffith and coworkers (2015) showed that a 30x-50x depth for whole genome

sequencing is indeed insufficient for adequate variant identification in the face of sample

contamination, aneuploidy or even moderate intratumour heterogeneity. Instead, they

recommend a depth of 500x-1000x for the discovery of novel variants, especially for those with

VAFs < 10%. Performing such ultra-deep sequencing (Wagle et al. 2012) of both the tumour and

normal genomes on current NGS platforms is extremely costly, and so a 100x-300x depth has

been proposed as a compromise (Alioto et al. 2015). In addition, it is also important that the ratio

of tumour : normal depth is kept as close to one as possible, and at least within a 10% range, as

this appears to reduce the amount of false positives (Alioto et al. 2015).

2.4 Read quality assessment and pre-processing

Upon completion of the desired amount of sequencing runs, the output first needs to be

evaluated for its quality (Pabinger et al. 2013). Modern high-throughput sequencing platforms spit

out several millions of short DNA reads with every run (Goodwin et al. 2016). Importantly, not

all of these reads meet the predefined standards as they generally contain several sequencing

artefacts (Dai et al. 2010). These errors have to be removed by trimming and filtering the reads. A

range of tools has been developed to execute the different steps of quality assessment and

subsequent pre-processing (Pabinger et al. 2013). As this is a complex process to grasp fully

without prior bioinformatics training, only the basic principles will be discussed here.

The first step entails the visualization of base quality scores included in the output of the

NGS platform. The output (that is in the text-based FASTQ format) contains not only the

predicted sequence, but also contains for every base the estimated probability of an erroneous

call (as discussed earlier). These error-probability values (P), ranging from 10% to 0.0001%, are

converted into standard quality scores or ‘Phred scores’ by calculating the -10 log (P). As a result,

the Phred scores form an array that is equal in length to the array of base calls, with values

typically ranging from 10 to 40 (in ASCII format, see http://blog.nextgenetics.net/?e=33). Note

that a 0.1% error rate in base calling translates into a Phred score of 10; higher Phred scores

mean higher estimated accuracy (Nielsen et al. 2011). Programs like FastQC (Andrews. 2010)

process these output files and produce graphical summary reports allowing one to quickly assess

the quality of the data. Next, the reads are trimmed and filtered based on both the quality scores

and sequence properties (Pabinger et al. 2013). This is important as read quality is often not

http://blog.nextgenetics.net/?e=33

16

consistent over the entire length of a read (Huse et al. 2007; Dohm et al. 2008), and ignoring low

quality base calls hampers downstream variant calling accuracy (Olson et al. 2015). Trimmomatic,

PRINSEQ and other tools can perform various trimming tasks, among which adapter and primer

trimming and 3’ and 5’ low quality stretch trimming. It is important to also remove reads that do

not meet a minimum average base quality, as well as reads that exceed the minimum or maximum

read length threshold. ClinQC integrates multiple quality control tools, making it a highly useful

tool for clinical research (Pandey et al. 2016). Importantly, performing these various trimming and

filtering steps was shown to result in significant improvements in variant calling (Del Fabbro et al.

2013).

2.5 Read mapping

After preprocessing and quality assessment, the reads are ready for further downstream analysis.

The classic approach used by many variant analysis projects is to align (‘map’) the reads of both

the tumour and the normal sample to a validated human reference genome like GRCh37 or the

newer GRCh38 (http://www.ncbi.nlm.nih.gov/project/genome/assembly/grc/human; see also

Figure 2). This process, commonly referred to as ‘resequencing’, entails multiple complex

bioinformatical steps (an alternative is to computationally ‘stitch’ the reads together, a process

referred to as de novo assembly; see Box 2). These steps have to overcome technical hurdles that

are collectively dubbed the ‘read-mapping problem’ (Trapnell, Salzberg. 2009). The core of this

read-mapping problem is two-fold. In a practical sense, aligning billions of short sequences to a

large genome requires highly efficient algorithms in the absence of extreme computational power

(a PC only offers so much bits of memory). A more strategic problem stems from the complexity

and heterogeneity of the human genome. Chromosomes are not just simple arrays of nucleotides

in which variation consists of the occasional single SNP. On the contrary, the genomic sequence

carries insertions and deletions (indels), translocations, inversions, duplications and copy number

variants (CNVs) (Feuk et al. 2006), and these structural variants likely account for ten times more

variation among human genomes than SNPs do (Pang et al. 2010). As a consequence,

chromosomes differ from person to person (Baker. 2012), and in somatic tissue even from cell to

cell (Astolfi et al. 2010; O’Huallachain et al. 2012). This heterogeneity is exacerbated greatly in

cancer genomes due to their so-called ‘mutator phenotype’ (Albertson et al. 2009; Loeb. 2016).

While the short reads generated by modern high-throughput sequencers are great for picking up

point mutations, they are difficult to work with in the face of (large) structural rearrangements

(Trapnell, Salzberg. 2009). Indeed, NGS technologies have long been ‘biased towards typing

unique tags in the genome’ (Baker. 2012).

http://www.ncbi.nlm.nih.gov/project/genome/assembly/grc/human

17

A range of alignment algorithms (‘mappers’) have been developed over the last years to

tackle the read-mapping problem, including Bowtie (Langmead et al. 2009), BWA (Li, Durbin.

2009) and SOAP/SOAP2 (Li, Yu, Li et al. 2009), to name just a few. Instead of describing for

each of these how they work and how they can be wielded to optimize tumour variant calling, I

try here to provide some general focus points. The first thing to realise when executing a tumour

resequencing project is that it is very important to use an appropriate reference sequence to align

the tumour and normal reads to (Alioto et al. 2015). At this moment the main source of human

reference genomes is the Genome Reference Consortium (GRC). The GRC keeps updating the

human reference assembly, as even the most recent human genome build (GRCh38) contains

gaps, particularly around the centromeres (Chaisson et al. 2015). As a result, rare reads that belong

somewhere else are mapped to the wrong place in the genome (the true location is missing),

leading to local read pile-ups i.e. false positives. This phenomenon can be mitigated by providing

Figure 2. Analysis of tumour and matched normal DNA

Distinguishing somatic tumour variants from germline variants requires the parallel sequencing of DNA from

tumour tissue and DNA from normal tissue. Here, peripheral blood is used as a ‘normal’ (see also section 2.1). After

sequencing, the reads are mapped to the human reference genome (in green). Discrepancies observed in both

samples are germline variants (in this example heterozygous), whereas those observed only in the tumour sample are

inferred to be somatic variants.

a ‘decoy’ sequence additional to the reference genome. A decoy is made up out of sequences

known to be absent from the reference. By including it, rare reads are ‘scavenged’ from the

reference (Li. 2014). An extra control step is to filter the alignment for so-called ‘blacklisted sites’

in the genome. These are sites known to suffer from extensive read pile-up, and they should be

18

excluded from further downstream analysis (Miga et al. 2015). Indeed, the use of decoy sequences

and blacklists can reduce false positives in somatic mutation detection (Miga et al. 2015; Alioto et

al. 2015). To reduce false positives it is also key that reads mapped with many errors are filtered

out (Pabinger et al. 2013).

It is also recommended to use paired-end or mate-paired reads (Pabinger et al. 2013).

While in single-end sequencing short fragments are read only from one end, in paired-end

sequencing both ends of longer fragments are read (Volik et al. 2003). The result is a collection of

paired reads separated by a known distance, so sequence as well as relative positional data can be

inferred. This technique has proven to be very useful in detecting so-called ‘copy-neutral

rearrangements’ in cancer genomes (Bashir et al. 2008; Oesper et al. 2012). Paired-end sequencing

also helps to map reads over repetitive regions more precisely* (Treangen et al. 2011).

Finally, it is important to choose the right alignment software. For this it is wise to

consider the NGS platform that was used, as their output sometimes requires the use of specific

algorithms (Luthra et al. 2015). However, it is far more important to choose an alignment tool

based on the application at hand. Not only do the various algorithms often differ in their ability

to pick up specific genetic alterations, but in addition, they tend to suffer from a significant trade-

off between speed and accuracy (Ruffalo et al. 2011). Indeed, the use of different alignment tools

can clearly impact variant calling (Griffith et al. 2015). For alignment in the context of a tumour

variant workflow, the tools Bowtie2, Novoalign and GMAP appear to be valid choices, as they

show a high accuracy in picking up a range of genomic variations. However, Novoalign is much

slower than the two others, most likely because it is based on a different alignment algorithm

(Bao et al. 2014). Indeed, this speed-accuracy trade-off is something that has to be considered,

especially in a clinical setting. As choosing an appropriate alignment tool is clearly a strenuous

task, it is recommended to use multiple alignment strategies in parallel (Griffith et al. 2015). By

assuming that a consensus in alignment has a higher likelihood of being correct, one can increase

the accuracy of the variant calling workflow (Goode et al. 2013).

One post-alignment processing procedure that should also be mentioned here is the

removal of PCR duplicates. PCR duplicates are reads of the exact same length and sequence

identity that arise during library amplification. These duplicates consequently align with the exact

same mapping coordinates. As a result of PCR bias, some reads are amplified much more than

others, resulting in heterogeneous coverage (Aird et al. 2011), as described earlier. To correct for

this bias it is common practice to remove excess duplicate reads after the alignment step

*A thorough discussion on paired-end sequencing is not provided here; for an excellent review see for example Risca, Greenleaf (2015)

19

(DePristo et al. 2011). Importantly, duplicate removal should not be performed carelessly, as

overcorrecting read counts can produce flawed variant calls (Zhou et al. 2014).

2.6 Variant calling, annotation and prioritization

The last step of the tumour variant analysis workflow is variant calling, followed by variant

annotation and prioritization (Pabinger et al. 2013). By comparing the aligned reads of the tumour

and normal samples with each other, and with a reference genome, a range of somatic tumour

variants can be detected (see Figure 2). It is important that a distinction between germline and

somatic variant is made as these variants frequently play different roles in tumour development

and progression (Pujana. 2014). Like read alignment, variant calling and annotation relies heavily

on the use of bioinformatics - GATK and SAMtools are well known ‘callers’ (Li, Handsaker et al.

2009; McKenna et al. 2010), and some tools like Strelka (Saunders et al. 2012) are specifically

designed to pick up somatic variants. These programs employ different algorithms to identify

candidate variants (Altmann et al. 2012). Basic tools identify variants when the number of high

confidence base-calls that disagree with the reference base exceeds a certain threshold; more

refined tools also take into account strand bias and the quality of neighbouring base-calls (Olson

et al. 2015). Importantly, most commonly used variant callers appear ill-suited to handle ultra-

deep sequencing data. This is likely the result of an ‘over-training’ of parameters and filtering

procedures towards a 30x-40x tumour-normal pair (Griffith et al. 2015). As the performance of

different variant callers can change in different settings (as a result of different algorithm

parameters), ‘the selection of an appropriate algorithm should be driven by each experiment’s

design’ (Griffith et al. 2015). Recently a new variant caller, VarDict was developed specifically for

20

ultra-deep sequencing (Lai et al. 2016). Choosing the right caller is a crucial element in any variant

analysis workflow: callers have to be stringent enough to control false positive calls, but not too

stringent as this will result in false negatives (Olson et al. 2015).

Another factor that has to be taken into account when choosing a caller is the mapper

used in the alignment step. Recently, Alioto and co-workers showed that certain mapper-caller

combinations show a much higher compatibility than others (Alioto et al. 2015). If possible, it is

recommended to use multiple variant callers to correct for this cross-talk, as this was shown to

improve performance of somatic variant calling (for the same reason as using multiple mappers;

Bao et al. 2014; Alioto et al. 2015).

After variants have been called, the data enters another bioinformatics pipeline in which

variants are annotated for clinical relevance. The sheer amount of data generated during the

tumour variant analysis means that manually performing this step would be a long and difficult

process (Dienstmann et al. 2014). For this reason, again a range of software tools has been

developed to streamline this process. Together, these tools help to stepwise filter out calls that

are known to be irrelevant while prioritizing those with the largest clinical significance. First the

less reliable and common variant calls are removed, including those with low coverage, low

quality and those supported by a low-confidence read alignment (Patel et al. 2014). The remaining

variants can then be prioritized relative to the disease and the genomic context (Bao et al. 2014).

In this step, the aim is to identify those somatic tumour variants that can ‘confer diagnostic,

prognostic, or treatment-related information’ (Sukhai et al. 2016). For an overview of the tools

developed to perform these steps see Pabinger et al. 2013 or Bao et al. 2014. A particularly

powerful toolset in this respect is ANNOVAR. ANNOVAR integrates public databases that

store detailed information on possible variants, including experimental and/or clinical evidence,

as well as detailed genomics data from for instance the ENCODE project (Wang et al. 2010). By

doing so, it offers a complete annotation and prioritization of tumour variants, helping clinical

researchers to interpret the data and make an informed decision regarding treatment (Yang,

Wang. 2015). In addition, it is important that findings are communicated with the clinical

oncology practice in a clear and timely fashion. A recent review by Dienstmann and coworkers

describes not only a systematic approach to variant annotation and prioritization, but also

proposes to use a structured format of cancer pathology reports (Dienstmann et al. 2014). It is

initiatives like this that will help us to exploit the enormous potential of big genomics data

(Eisenstein. 2015) in the fight against cancer.

21

3. Discussion

In order to better understand cancer and tumourigenesis it is key that the somatic variants

underlying the disease are characterized well. Tumour variant analysis is however no trivial task,

in particular because of intratumour heterogeneity. At the same time, deciphering this

heterogeneity is one of the primary goals of cancer research, as it is thought to be a driving force

of tumour proliferation and resistance to therapy. Designing an efficient (and cost-effective)

tumour variant analysis workflow is therefore challenging. Researchers do not only have to

decide how to perform the bench-work, but they are also asked to find the appropriate tools to

support their specific NGS data analysis. It is important that all those involved understand how

the various components of the somatic variant analysis workflow are executed, and where there is

room for improvement. Indeed, errors that are picked up only in the final stages of variant calling

might arise at the bench.

The overview provided here points out several key points for improvement of the

somatic tumour variant workflow (see also Table 1). Some of these recommendations are

beginning to be followed up in cancer studies, for instance the use of combined analysis tools:

although researchers have long relied on the use of a single caller, in recent years more studies

have instead combined the output of multiple tools (Field et al. 2015). As pipelines employing

multiple callers significantly outperform those that don’t in terms of accuracy (Alioto et al. 2015),

this trend will likely benefit cancer research greatly. Importantly, the same holds true for the use

of multiple alignment tools (Griffith et al. 2015). Another positive development is the increased

number of studies that use ultra-deep sequencing for tumour variant analysis. However, as the

costs of whole-genome sequencing are still substantial, many researchers opt to get higher depths

by sequencing only parts of the genome. One technique is whole exome sequencing (WES).

Although WES is now a routinely used instrument to detect genetic variation in humans

(Koboldt et al. 2013), several recent studies show that sequencing the entire genome (WGS) is

more accurate than WES when it comes to detecting somatic variants (Fang et al. 2014; Meynert

et al. 2014). The reason for this is that WES produces a more heterogeneous read coverage than

WGS, likely because it is subject to higher levels of sequencing bias (Veal et al. 2012; Belkadi et al.

2015). Nonetheless, WES is still a powerful, and not to mention cheaper method. A technique

that is perhaps clinically more relevant than WES is targeted resequencing of mutational

‘hotspots’ of tumours (Mamanova et al. 2010; Gerstung et al. 2012). With this technique a

restricted gene panel can be sequenced ultra-deep, allowing one to pick up low VAF somatic

mutations in known causative genes (Agrawal et al. 2011). Importantly, both WES and targeted

sequencing do not allow for a complete characterization of a tumour genome (Griffith et al.

22

2015): the first does not capture variation in non-coding DNA*, and the latter sequences only

user-specified genomic stretches. Even with very high depth sequencing it can be extremely

difficult to identify low VAF SNPs (Griffith et al. 2015). The required depth depends heavily on

the complexity of the sequencing library. The question of how much additional information can

be gained when a specific library is sequenced deeper is therefore highly relevant. Recently, Daley

and Smith presented a computational method that can quantify library complexity. Such a

method is likely to prove very useful in controlling the costs of routine tumour variant analysis in

the clinic (Daley, Smith. 2013).

Table 1: the main recommendation to optimize tumour variant calling

Workflow component Recommendations

DNA extraction & shearing - Be aware of poor DNA yield and/or quality when extracting from FFPE

samples. Employ pre-analytical sample characterization

Library preparation - Size-selection after shearing

- PCR free preparation if possible

- Use Kapa HiFI polymerase

- Optimize temperatures and/or duration of PCR steps

- Size selection after PCR amplification

- Use ddPCR to quantify library

- Use paired-end reads

Depth and coverage - Sequence deep (100-300x) , especially with variants of low expected (< 10%)

VAF

- Keep tumour : normal depth ratio close to one

Quality assessment - Use multiple software tools, or a complete toolset like ClinQC

Read mapping - Pick most complete validated reference genome (GRCh37 or GRCh38)

- Use decoy sequences and blacklisted sites

- Use multiple mapping tools

- Use PCR duplicate removal with caution

Variant calling

- Optimize mapper-caller combination / use multiple callers

23

* Non-coding DNA is rapidly shedding its ‘junk-DNA’ nickname as a significant amount of it appears to have some

functional role (Flintoft, 2005; but see also Palazzo, Gregory. 2014); Importantly, somatic variants in non-coding

DNA have also been directly linked to cancer (Khurana et al. 2016).

Other recommendations made here are still far from becoming standard practice. For

example, the majority of cancer studies still relies on a PCR step to amplify their library ahead of

quantification, while this amplification has long been recognized as a source of artificial

mutations as well as substantial bias (Aird et al. 2011). Recently, alternative approaches have been

introduced, for instance the use of ddPCR for library quantification when input material is

limiting (Robin et al. 2016). The arrival of Illumina’s TruSeq® PCR-free technology is another

important step towards reducing PCR artefacts (Alioto et al. 2015). However, as TruSeq requires

very high amounts of starting material compared to other NGS technologies, it is currently not

always considered a viable option in tumour variant analysis (Huptas et al. 2016). Future studies

should evaluate whether ddPCR quantification could enable a PCR-free approach in the face of

FFPE tumour samples.

The ‘age of clinical sequencing’ (Daley, Smith. 2013) is clearly approaching fast. Before

genetic screening of tumours, and with it personalized cancer medicine, can become a routine

application however, several important issues will have to be dealt with. One aspect that is

sometimes overlooked is the duration of the complete variant calling workflow, which means that

in the case of some aggressive cancers NGS analysis might be simply too slow (Goodwin et al.

2016). Indeed, faster systems will need to be developed to allow a ubiquitous deployment of

NGS in the cancer clinic. In addition, the accuracy and sensitivity of somatic variant workflows

will need to be further optimized and standardized, especially in the face of somatic variants with

a low VAF. It is key that new variant calling workflows, as well as their individual pipeline

components, are thoroughly evaluated to identify and isolate potential sources of error (Davies et

al. 2016). This will greatly facilitate the analysis of tumour NGS data, and, ultimately, clinical

interpretation.

24

4. References

Agrawal, N., Frederick, M. J., Pickering, C. R., Bettegowda, C., Chang, K., Li, R. J., . . . Myers, J. N. (2011). Exome

sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science (New

York, N.Y.), 333(6046), 1154-1157.

Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., . . . Gnirke, A. (2011). Analyzing and

minimizing PCR amplification bias in illumina sequencing libraries. Genome Biology, 12(2), R18-2011-12-2-r18.

Epub 2011 Feb 21.

Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V., & Margulies, E. H. (2011). Accurate and comprehensive

sequencing of personal genomes. Genome Research, 21(9), 1498-1505.

Akbari, M., Hansen, M. D., Halgunset, J., Skorpen, F., & Krokan, H. E. (2005). Low copy number DNA template

can render polymerase chain reaction error prone in a sequence-dependent manner. The Journal of Molecular

Diagnostics : JMD, 7(1), 36-39.

Albertson, T. M., Ogawa, M., Bugni, J. M., Hays, L. E., Chen, Y., Wang, Y., . . . Preston, B. D. (2009). DNA

polymerase epsilon and delta proofreading suppress discrete mutator and cancer phenotypes in mice. Proceedings

of the National Academy of Sciences of the United States of America, 106(40), 17101-17104.

Alioto, T. S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M. D., Hovig, E., . . . Gut, I. G. (2015). A

comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature

Communications, 6, 10001.

Altmann, A., Weber, P., Bader, D., Preuss, M., Binder, E. B., & Muller-Myhsok, B. (2012). A beginners guide to SNP

calling from high-throughput DNA-sequencing data. Human Genetics, 131(10), 1541-1554.

Andrews, S.FastQC: A quality control tool for high throughput sequence data. Available online

at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc

25

Baker, M. (2012). Structural variation: The genome's hidden architecture. Nature Methods, 9(2), 133-137.

Bao, R., Huang, L., Andrade, J., Tan, W., Kibbe, W. A., Jiang, H., & Feng, G. (2014). Review of current methods,

applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer

Informatics, 13(Suppl 2), 67-82.

Bashir, A., Volik, S., Collins, C., Bafna, V., & Raphael, B. J. (2008). Evaluation of paired-end sequencing strategies

for detection of genome rearrangements in cancer. PLoS Computational Biology, 4(4), e1000051.

Belkadi, A., Bolze, A., Itan, Y., Cobat, A., Vincent, Q. B., Antipenko, A., . . . Abel, L. (2015). Whole-genome

sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proceedings of the

National Academy of Sciences of the United States of America, 112(17), 5473-5478.

Ben-Ezra, J., Johnson, D. A., Rossi, J., Cook, N., & Wu, A. (1991). Effect of fixation on the amplification of nucleic

acids from paraffin-embedded material by the polymerase chain reaction. The Journal of Histochemistry and

Cytochemistry : Official Journal of the Histochemistry Society, 39(3), 351-354.

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., . . . Smith, A. J. (2008).

Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53-59.

Berlin, K., Koren, S., Chin, C. S., Drake, J. P., Landolin, J. M., & Phillippy, A. M. (2015). Assembling large genomes

with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33(6), 623-630.

Biankin, A. V., Waddell, N., Kassahn, K. S., Gingras, M. C., Muthuswamy, L. B., Johns, A. L., . . . Grimmond, S. M.

(2012). Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature, 491(7424), 399-

405.

Bolognesi, C., Forcato, C., Buson, G., Fontana, F., Mangano, C., Doffini, A., . . . Manaresi, N. (2016). Digital sorting

of pure cell populations enables unambiguous genetic analysis of heterogeneous formalin-fixed paraffin-

embedded tumors by next generation sequencing. Scientific Reports, 6, 20944.

Borad, M. J., Champion, M. D., Egan, J. B., Liang, W. S., Fonseca, R., Bryce, A. H., . . . Carpten, J. D. (2014).

Integrated genomic characterization reveals novel, therapeutically relevant drug targets in FGFR and EGFR

pathways in sporadic intrahepatic cholangiocarcinoma. PLoS Genetics, 10(2), e1004135.

26

Buehler, B., Hogrefe, H. H., Scott, G., Ravi, H., Pabon-Pena, C., O'Brien, S., . . . Happe, S. (2010). Rapid

quantification of DNA libraries for next-generation sequencing. Methods (San Diego, Calif.), 50(4), S15-8.

Cacho, A., Smirnova, E., Huzurbazar, S., & Cui, X. (2015). A comparison of base-calling algorithms for illumina

sequencing technology. Briefings in Bioinformatics.

Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., . . . Getz, G. (2012). Absolute

quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5), 413-421.

Chaisson, M. J., Huddleston, J., Dennis, M. Y., Sudmant, P. H., Malig, M., Hormozdiari, F., . . . Eichler, E. E. (2015).

Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536), 608-611.

Dabney, J., & Meyer, M. (2012). Length and GC-biases during sequencing library amplification: A comparison of

various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques, 52(2), 87-

94.

Dai, M., Thompson, R. C., Maher, C., Contreras-Galindo, R., Kaplan, M. H., Markovitz, D. M., . . . Meng, F. (2010).

NGSQC: Cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics, 11 Suppl 4, S7-

2164-11-S4-S7.

Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4),

325-327.

Das, S., & Vikalo, H. (2013). Base calling for high-throughput short-read sequencing: Dynamic programming

solutions. BMC Bioinformatics, 14, 129-2105-14-129.

Davies, K. D., Farooqi, M. S., Gruidl, M., Hill, C. E., Woolworth-Hirschhorn, J., Jones, H., . . . Aisner, D. L. (2016).

Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing

bioinformatics and variant interpretation. The Journal of Molecular Diagnostics : JMD, 18(4), 572-579.

Davis, A., & Navin, N. E. (2016). Computing tumor trees from single cells. Genome Biology, 17(1), 113-016-0987-z.

Del Fabbro, C., Scalabrin, S., Morgante, M., & Giorgi, F. M. (2013). An extensive evaluation of read trimming effects

on illumina NGS data analysis. PloS One, 8(12), e85024.

27

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., . . . Daly, M. J. (2011). A

framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics,

43(5), 491-498.

Dienstmann, R., Dong, F., Borger, D., Dias-Santagata, D., Ellisen, L. W., Le, L. P., & Iafrate, A. J. (2014).

Standardized decision support in next generation sequencing reports of somatic cancer variants. Molecular

Oncology, 8(5), 859-873.

Eisenstein, M. (2015). Big data: The power of petabytes. Nature, 527(7576), S2-4.

Fang, H., Wu, Y., Narzisi, G., O'Rawe, J. A., Barron, L. T., Rosenbaum, J., . . . Lyon, G. J. (2014). Reducing INDEL

calling errors in whole genome and exome sequencing data. Genome Medicine, 6(10), 89-014-0089-z. eCollection

2014.

Ferrarini, M., Moretto, M., Ward, J. A., Surbanovski, N., Stevanovic, V., Giongo, L., . . . Sargent, D. J. (2013). An

evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC

Genomics, 14, 670-2164-14-670.

Feuk, L., Marshall, C. R., Wintle, R. F., & Scherer, S. W. (2006). Structural variants: Changing the landscape of

chromosomes and design of disease studies. Human Molecular Genetics, 15 Spec No 1, R57-66.

Fidler, I. J. (1978). Tumor heterogeneity and the biology of cancer invasion and metastasis. Cancer Research, 38(9),

2651-2660.

Field, M. A., Cho, V., Andrews, T. D., & Goodnow, C. C. (2015). Reliably detecting clinically important variants

requires both combined variant calls and optimized filtering strategies. PloS One, 10(11), e0143199.

Finotello, F., Lavezzo, E., Bianco, L., Barzon, L., Mazzon, P., Fontana, P., . . . Di Camillo, B. (2014). Reducing bias

in RNA sequencing data: A novel approach to compute counts. BMC Bioinformatics, 15 Suppl 1, S7-2105-15-S1-

S7. Epub 2014 Jan 10.

Flintoft, L. (2005). Genome evolution: an adaptive view of non-coding DNA. Nat.Rev.Genet.

28

Gerlinger, M., Santos, C. R., Spencer-Dene, B., Martinez, P., Endesfelder, D., Burrell, R. A., . . . Swanton, C. (2012).

Genome-wide RNA interference analysis of renal carcinoma survival regulators identifies MCT4 as a warburg

effect metabolic target. The Journal of Pathology, 227(2), 146-156.

Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., & Beerenwinkel, N. (2012). Reliable

detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3, 811.

Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., . . . Campbell, I. G. (2013). A simple

consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90.

Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., & McCombie, W. R. (2015). Oxford

nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Research,

25(11), 1750-1756.

Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation

sequencing technologies. Nature Reviews.Genetics, 17(6), 333-351.

Griffith, M., Miller, C. A., Griffith, O. L., Krysiak, K., Skidmore, Z. L., Ramu, A., . . . Wilson, R. K. (2015).

Optimizing cancer genome sequencing and analysis. Cell Systems, 1(3), 210-223.

Grokhovsky, S. L., Il'icheva, I. A., Nechipurenko, D. Y., Golovkin, M. V., Panchenko, L. A., Polozov, R. V., &

Nechipurenko, Y. D. (2011). Sequence-specific ultrasonic cleavage of DNA. Biophysical Journal, 100(1), 117-125.

Hansen, K. D., Brenner, S. E., & Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random

hexamer priming. Nucleic Acids Research, 38(12), e131.

Head, S. R., Komori, H. K., LaMere, S. A., Whisenant, T., Van Nieuwerburgh, F., Salomon, D. R., & Ordoukhanian,

P. (2014). Library construction for next-generation sequencing: Overviews and challenges. Biotechniques, 56(2),

61-4, 66, 68, passim.

Huptas, C., Scherer, S., & Wenning, M. (2016). Optimized illumina PCR-free library preparation for bacterial whole

genome sequencing and analysis of factors influencing de novo assembly. BMC Research Notes, 9, 269-016-2072-

9.

29

Huse, S. M., Huber, J. A., Morrison, H. G., Sogin, M. L., & Welch, D. M. (2007). Accuracy and quality of massively

parallel DNA pyrosequencing. Genome Biology, 8(7), R143.

Katsanis, S. H., & Katsanis, N. (2013). Molecular genetic testing and the future of clinical genomics. Nature

Reviews.Genetics, 14(6), 415-426.

Kebschull, J. M., & Zador, A. M. (2015). Sources of PCR-induced distortions in high-throughput sequencing data

sets. Nucleic Acids Research, 43(21), e143.

Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M. A., & Gerstein, M. (2016). Role of non-coding

sequence variants in cancer. Nature Reviews.Genetics, 17(2), 93-108.

Knierim, E., Lucke, B., Schwarz, J. M., Schuelke, M., & Seelow, D. (2011). Systematic comparison of three methods

for fragmentation of long-range PCR products for next generation sequencing. PloS One, 6(11), e28240.

Koboldt, D. C., Steinberg, K. M., Larson, D. E., Wilson, R. K., & Mardis, E. R. (2013). The next-generation

sequencing revolution and its impact on genomics. Cell, 155(1), 27-38.

Kokkat, T. J., Patel, M. S., McGarvey, D., LiVolsi, V. A., & Baloch, Z. W. (2013). Archived formalin-fixed paraffin-

embedded (FFPE) blocks: A valuable underexploited resource for extraction of DNA, RNA, and protein.

Biopreservation and Biobanking, 11(2), 101-106.

Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., & Turner, D. J. (2009). Amplification-free

illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes.

Nature Methods, 6(4), 291-295.

Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., . . . Dry, J. R. (2016). VarDict: A

novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research,

44(11), e108.

Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M. S., . . . Wu, C. J. (2013).

Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell, 152(4), 714-726.

Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short

DNA sequences to the human genome. Genome Biology, 10(3), R25-2009-10-3-r25. Epub 2009 Mar 4.

30

Laurie, M. T., Bertout, J. A., Taylor, S. D., Burton, J. N., Shendure, J. A., & Bielas, J. H. (2013). Simultaneous digital

quantification and fluorescence-based size characterization of massively parallel sequencing libraries.

Biotechniques, 55(2), 61-67.

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics

(Oxford, England), 30(20), 2843-2851.

Li, H. (2016). Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics

(Oxford, England), 32(14), 2103-2110.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics

(Oxford, England), 25(14), 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . 1000 Genome Project Data Processing

Subgroup. (2009). The sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16),

2078-2079.

Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in

Bioinformatics, 11(5), 473-483.

Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., & Wang, J. (2009). SOAP2: An improved ultrafast tool

for short read alignment. Bioinformatics (Oxford, England), 25(15), 1966-1967.

Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., . . . Law, M. (2012). Comparison of next-generation sequencing

systems. Journal of Biomedicine & Biotechnology, 2012, 251364.

Loeb, L. A. (2016). Human cancers express a mutator phenotype: Hypothesis, origin, and consequences. Cancer

Research, 76(8), 2057-2059.

Loman, N. J., Misra, R. V., Dallman, T. J., Constantinidou, C., Gharbia, S. E., Wain, J., & Pallen, M. J. (2012).

Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, 30(5), 434-

439.

Luthra, R., Chen, H., Roy-Chowdhuri, S., & Singh, R. R. (2015). Next-generation sequencing in clinical molecular

diagnostics of cancer: Advantages and challenges. Cancers, 7(4), 2023-2036.

31

Mamanova, L., Coffey, A. J., Scott, C. E., Kozarewa, I., Turner, E. H., Kumar, A., . . . Turner, D. J. (2010). Target-

enrichment strategies for next-generation sequencing. Nature Methods, 7(2), 111-118.

Mardis, E. R. (2013). Next-generation sequencing platforms. Annual Review of Analytical Chemistry (Palo Alto, Calif.), 6,

287-303.

Marusyk, A., Almendro, V., & Polyak, K. (2012). Intra-tumour heterogeneity: A looking glass for cancer? Nature

Reviews.Cancer, 12(5), 323-334.

McCoy, R. C., Taylor, R. W., Blauwkamp, T. A., Kelley, J. L., Kertesz, M., Pushkarev, D., . . . Fiston-Lavier, A. S.

(2014). Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-

repetitive transposable elements. PloS One, 9(9), e106689.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., . . . DePristo, M. A. (2010). The

genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome

Research, 20(9), 1297-1303.

Meyer, C. A., & Liu, X. S. (2014). Identifying and mitigating bias in next-generation sequencing methods for

chromatin biology. Nature Reviews.Genetics, 15(11), 709-721.

Meyer, M., Briggs, A. W., Maricic, T., Hober, B., Hoffner, B., Krause, J., . . . Hofreiter, M. (2008). From micrograms

to picograms: Quantitative PCR reduces the material demands of high-throughput sequencing. Nucleic Acids

Research, 36(1), e5.

Meynert, A. M., Ansari, M., FitzPatrick, D. R., & Taylor, M. S. (2014). Variant detection sensitivity and biases in

whole genome and exome sequencing. BMC Bioinformatics, 15, 247-2105-15-247.

Miga, K. H., Eisenhart, C., & Kent, W. J. (2015). Utilizing mapping targets of sequences underrepresented in the

reference assembly to reduce false positive alignments. Nucleic Acids Research, 43(20), e133.

Mokry, M., Hatzis, P., de Bruijn, E., Koster, J., Versteeg, R., Schuijers, J., . . . Cuppen, E. (2010). Efficient double

fragmentation ChIP-seq provides nucleotide resolution protein-DNA binding profiles. PloS One, 5(11), e15092.

Mostovoy, Y., Levy-Sakin, M., Lam, J., Lam, E. T., Hastie, A. R., Marks, P., . . . Kwok, P. Y. (2016). A hybrid

approach for de novo human genome sequence assembly and phasing. Nature Methods, 13(7), 587-590.

32

Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., . . . Venter, J. C. (2000). A

whole-genome assembly of drosophila. Science (New York, N.Y.), 287(5461), 2196-2204.

Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L., McIndoo, J., . . . Wigler, M. (2011). Tumour evolution

inferred by single-cell sequencing. Nature, 472(7341), 90-94.

Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation

sequencing data. Nature Reviews.Genetics, 12(6), 443-451.

Nik-Zainal, S., Alexandrov, L. B., Wedge, D. C., Van Loo, P., Greenman, C. D., Raine, K., . . . Breast Cancer

Working Group of the International Cancer Genome Consortium. (2012). Mutational processes molding the

genomes of 21 breast cancers. Cell, 149(5), 979-993.

Oesper, L., Ritz, A., Aerni, S. J., Drebin, R., & Raphael, B. J. (2012). Reconstructing cancer genomes from paired-end

sequencing data. BMC Bioinformatics, 13 Suppl 6, S10-2105-13-S6-S10.

O'Huallachain, M., Karczewski, K. J., Weissman, S. M., Urban, A. E., & Snyder, M. P. (2012). Extensive genetic

variation in somatic human tissues. Proceedings of the National Academy of Sciences of the United States of America,

109(44), 18018-18023.

Olson, N. D., Lund, S. P., Colman, R. E., Foster, J. T., Sahl, J. W., Schupp, J. M., . . . Zook, J. M. (2015). Best

practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics, 6,

235.

Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., . . . Trajanoski, Z. (2014). A survey of

tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics, 15(2), 256-278.

Palazzo, A. F., & Gregory, T. R. (2014). The case for junk DNA. PLoS Genetics, 10(5), e1004351.

Pandey, R. V., Pabinger, S., Kriegner, A., & Weinhausel, A. (2016). ClinQC: A tool for quality control and cleaning

of sanger and NGS data in clinical research. BMC Bioinformatics, 17, 56-016-0915-y.

Pang, A. W., MacDonald, J. R., Pinto, D., Wei, J., Rafiq, M. A., Conrad, D. F., . . . Scherer, S. W. (2010). Towards a

comprehensive structural variation map of an individual human genome. Genome Biology, 11(5), R52-2010-11-5-

r52. Epub 2010 May 19.

33

Pantel, K., & Speicher, M. R. (2016). The biology of circulating tumor cells. Oncogene, 35(10), 1216-1224.

Parkinson, N. J., Maslau, S., Ferneyhough, B., Zhang, G., Gregory, L., Buck, D., . . . Fischer, M. D. (2012).

Preparation of high-quality next-generation sequencing libraries from picogram quantities of target DNA.

Genome Research, 22(1), 125-133.

Patel, Z. H., Kottyan, L. C., Lazaro, S., Williams, M. S., Ledbetter, D. H., Tromp, H., . . . Kaufman, K. M. (2014).

The struggle to find reliable results in exome sequencing data: Filtering out mendelian errors. Frontiers in

Genetics, 5, 16.

Pleasance, E. D., Cheetham, R. K., Stephens, P. J., McBride, D. J., Humphray, S. J., Greenman, C. D., . . . Stratton,

M. R. (2010). A comprehensive catalogue of somatic mutations from a human cancer genome. Nature,

463(7278), 191-196.

Poptsova, M. S., Il'icheva, I. A., Nechipurenko, D. Y., Panchenko, L. A., Khodikov, M. V., Oparina, N. Y., . . .

Grokhovsky, S. L. (2014). Non-random DNA fragmentation in next-generation sequencing. Scientific Reports, 4,

4532.

Pujana, M. A. (2014). Integrating germline and somatic data towards a personalized cancer medicine. Trends in

Molecular Medicine, 20(8), 413-415.

Quail, M. A., Kozarewa, I., Smith, F., Scally, A., Stephens, P. J., Durbin, R., . . . Turner, D. J. (2008). A large genome

center's improvements to the illumina sequencing system. Nature Methods, 5(12), 1005-1010.

Quail, M. A., Swerdlow, H., & Turner, D. J. (2009). Improved protocols for the illumina genome analyzer

sequencing system. Current Protocols in Human Genetics / Editorial Board, Jonathan L.Haines ...[Et Al.], Chapter 18,

Unit 18.2.

Renaud, G., Kircher, M., Stenzel, U., & Kelso, J. (2013). freeIbis: An efficient basecaller with calibrated quality scores

for illumina sequencers. Bioinformatics (Oxford, England), 29(9), 1208-1209.

Rhodes, J., Beale, M. A., & Fisher, M. C. (2014). Illuminating choices for library prep: A comparison of library

preparation methods for whole genome sequencing of cryptococcus neoformans using illumina HiSeq. PloS

One, 9(11), e113501.

34

Risca, V. I., & Greenleaf, W. J. (2015). Beyond the linear genome: Paired-end sequencing as a biophysical tool. Trends

in Cell Biology, 25(12), 716-719.

Robin, J. D., Ludlow, A. T., LaRanger, R., Wright, W. E., & Shay, J. W. (2016). Comparison of DNA quantification

methods for next generation sequencing. Scientific Reports, 6, 24067.

Ruffalo, M., LaFramboise, T., & Koyuturk, M. (2011). Comparative analysis of algorithms for next-generation

sequencing read alignment. Bioinformatics (Oxford, England), 27(20), 2790-2796.

Sadanandam, A., Lal, A., Benz, S. C., Eppenberger-Castori, S., Scott, G., Gray, J. W., . . . Benz, C. C. (2012).

Genomic aberrations in normal tissue adjacent to HER2-amplified breast cancers: Field cancerization or

contaminating tumor cells? Breast Cancer Research and Treatment, 136(3), 693-703.

Sah, S., Chen, L., Houghton, J., Kemppainen, J., Marko, A. C., Zeigler, R., & Latham, G. J. (2013). Functional DNA

quantification guides accurate next-generation sequencing mutation detection in formalin-fixed, paraffin-

embedded tumor biopsies. Genome Medicine, 5(8), 77.

Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: Accurate somatic

small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford, England), 28(14), 1811-

1817.

Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., & Nolan, G. P. (2010). Computational solutions to large-scale

data management and analysis. Nature Reviews.Genetics, 11(9), 647-657.

Shah, S. P., Morin, R. D., Khattra, J., Prentice, L., Pugh, T., Burleigh, A., . . . Aparicio, S. (2009). Mutational

evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265), 809-813.

Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26(10), 1135-1145.

Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: Key

considerations in genomic analyses. Nature Reviews.Genetics, 15(2), 121-132.

Sukhai, M. A., Craddock, K. J., Thomas, M., Hansen, A. R., Zhang, T., Siu, L., . . . Kamel-Reid, S. (2016). A

classification system for clinical relevance of somatic variants identified in molecular profiling of cancer.

Genetics in Medicine : Official Journal of the American College of Medical Genetics, 18(2), 128-136.

35

Trapnell, C., & Salzberg, S. L. (2009). How to map billions of short reads onto genomes. Nature Biotechnology, 27(5),

455-457.

Treangen, T. J., & Salzberg, S. L. (2011). Repetitive DNA and next-generation sequencing: Computational challenges

and solutions. Nature Reviews.Genetics, 13(1), 36-46.

Troester, M., Hoadley, K., & D'Arcy, M.,.. (2016). DNA defects, epigenetics, and gene expression in cancer-adjacent

breast: A study from the cancer genome atlas. Npj Breast Cancer, 2

Valasek, M. A., & Repa, J. J. (2005). The power of real-time PCR. Advances in Physiology Education, 29(3), 151-159.

van Dijk, E. L., Jaszczyszyn, Y., & Thermes, C. (2014). Library preparation methods for next-generation sequencing:

Tone down the bias. Experimental Cell Research, 322(1), 12-20.

Veal, C. D., Freeman, P. J., Jacobs, K., Lancaster, O., Jamain, S., Leboyer, M., . . . Brookes, A. J. (2012). A

mechanistic basis for amplification differences between samples and between genome regions. BMC Genomics,

13, 455-2164-13-455.

Volik, S., Zhao, S., Chin, K., Brebner, J. H., Herndon, D. R., Tao, Q., . . . Collins, C. (2003). End-sequence profiling:

Sequence-based analysis of aberrant genomes. Proceedings of the National Academy of Sciences of the United States of

America, 100(13), 7696-7701.

Vrijenhoek, T., Kraaijeveld, K., Elferink, M., de Ligt, J., Kranendonk, E., Santen, G., . . . Cuppen, E. (2015). Next-

generation sequencing-based genome diagnostics across clinical genetics centers: Implementation choices and

their effects. European Journal of Human Genetics : EJHG, 23(9), 1142-1150.

Wagle, N., Berger, M. F., Davis, M. J., Blumenstiel, B., Defelice, M., Pochanard, P., . . . Garraway, L. A. (2012). High-

throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel

sequencing. Cancer Discovery, 2(1), 82-93.

Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-

throughput sequencing data. Nucleic Acids Research, 38(16), e164.

Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., . . . Zhao, Z. (2013). Detecting somatic point mutations in

cancer genome sequencing data: A comparison of mutation callers. Genome Medicine, 5(10), 91.

36

Watson, I. R., Takahashi, K., Futreal, P. A., & Chin, L. (2013). Emerging patterns of somatic mutations in cancer.

Nature Reviews.Genetics, 14(10), 703-718.

White, R. A.,3rd, Blainey, P. C., Fan, H. C., & Quake, S. R. (2009). Digital PCR provides sensitive and absolute

calibration for high throughput sequencing. BMC Genomics, 10, 116-2164-10-116.

Williams, C., Ponten, F., Moberg, C., Soderkvist, P., Uhlen, M., Ponten, J., . . . Lundeberg, J. (1999). A high

frequency of sequence alterations is due to formalin fixation of archival specimens. The American Journal of

Pathology, 155(5), 1467-1471.

Xu, X., Hou, Y., Yin, X., Bao, L., Tang, A., Song, L., . . . Wang, J. (2012). Single-cell exome sequencing reveals

single-nucleotide mutation characteristics of a kidney tumor. Cell, 148(5), 886-895.

Yang, H., & Wang, K. (2015). Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR.

Nature Protocols, 10(10), 1556-1566.

Yates, L. R., Gerstung, M., Knappskog, S., Desmedt, C., Gundem, G., Van Loo, P., . . . Campbell, P. J. (2015).

Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nature Medicine, 21(7),

751-759.

Ye, C., Hsiao, C., & Corrada Bravo, H. (2014). BlindCall: Ultra-fast base-calling of high-throughput sequencing data

by blind deconvolution. Bioinformatics (Oxford, England), 30(9), 1214-1219.

Yun, J. J., Heisler, L. E., Hwang, I. I., Wilkins, O., Lau, S. K., Hyrcza, M., . . . Der, S. D. (2006). Genomic DNA

functions as a universal external standard in quantitative real-time PCR. Nucleic Acids Research, 34(12), e85.

Zerbino, D. R., & Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de bruijn graphs.

Genome Research, 18(5), 821-829.

Zhou, W., Chen, T., Zhao, H., Eterovic, A. K., Meric-Bernstam, F., Mills, G. B., & Chen, K. (2014). Bias from

removing read duplication in ultra-deep sequencing experiments. Bioinformatics (Oxford, England)

Improving somatic mutation detection in cancer · 2018. 2. 15. · can result in a knowledge gap...

Documents

Transcript of Improving somatic mutation detection in cancer · 2018. 2. 15. · can result in a knowledge gap...