Cancer genome-sequencing study design -...

12
Cancer pathogenesis is rooted in inherited genetic variation and acquired somatic mutation; accord- ingly, genomics is integral to cancer research (for a review, see REF. 1). In 2008, the first cancer genome was sequenced using second-generation technology (also known as next-generation sequencing) 2 . Four years later, approximately 800 genomes from at least 25 dif- ferent cancer types have been sequenced. Consider that only 20 years ago, sequencing one human genome took an international collaboration more than 10 years and cost US$3.8 billion 3–5 . Today, accurate and rapid genome sequencing costs only a few thousand dollars. With this advancement in technology comes considerable capac- ity to increase basic cancer biology knowledge and the opportunity to advance cancer prevention, diagnostics, prognostics and treatment. Despite the diversity of questions to be addressed using cancer genome-sequencing studies (that is, studies that have used second-generation technology to sequence at least one cancer genome), only a limited number of specific aims have been investigated to date; many remain to be explored. For instance, cancer pre- vention is an important area that could greatly benefit from well-designed second-generation cancer genome- sequencing studies. Family-based and case–control study designs will be integral to uncovering inherited polymorphisms that predispose individuals to cancer. Knowing cancer predisposition can benefit patients, health practitioners and the health-care system if it results in a change of lifestyle behaviours or medical intervention that reduces cancer risk, morbidity or mortality. The aim of this article is not to review results of cancer genome-sequencing studies but to focus on their archetypal specific aims, methodological requi- sites and study designs. Throughout this Review, the benefits and limitations of approaches, technologies and interpretation will also be discussed. Specific aims Thus far, most cancer genome-sequencing studies have had one or more of four specific aims: discovering driver mutations; identifying somatic mutational signatures; characterizing clonal evolution; and advancing person- alized medicine (FIG. 1; Supplementary information S1 (table)). First, determining which somatic mutations are likely to contribute to the cancer phenotype is the most common aim of cancer genome-sequencing stud- ies. Discovering driver mutations leads to improved understanding of basic cancer biology and conse- quently treatment discovery and development. Take the gene enhancer of zeste 2 (EZH2), for example: second- generation sequencing resulted in the discovery of somatic mutations in EZH2 in lymphoma at a clinically signifi- cant frequency 6 , spurring functional characterization 7 and leading to a promising treatment 8 . Second, identify- ing somatic mutational signatures has also led to gains in understanding basic cancer biology. For the first time, researchers can uncover global signatures of the muta- tion processes and DNA repair mechanisms that con- tribute to the catalogue of somatic mutations in a cancer type. Indeed, the signatures of two newly discovered mutational phenomena, kataegis 9 and chromothripsis 10 , Genome Sciences Centre, BC Cancer Research Centre, 675 West 10th Avenue, Vancouver, British Colombia V5Z 1L3, Canada. Correspondence to M.A.M. e‑mail: [email protected] doi:10.1038/nrg3445 Driver mutations Somatic mutations that have a role in creating, controlling and/or directing some aspect of the cancer phenotype. Kataegis From the Greek meaning ‘thunderstorm’, this refers to clusters of somatic single-nucleotide variants that often colocalize with somatic structural variants. Chromothripsis From the Greek meaning ‘chromosome shattering’, this refers to a single event of genome shattering and reassembly that results in complex somatic structural variations characterized by oscillating copy number and tens to hundreds of rearrangements that localize to one or a few chromosomes. Cancer genome-sequencing study design Jill C. Mwenifumbo and Marco A. Marra Abstract | Discoveries from cancer genome sequencing have the potential to translate into advances in cancer prevention, diagnostics, prognostics, treatment and basic biology. Given the diversity of downstream applications, cancer genome-sequencing studies need to be designed to best fulfil specific aims. Knowledge of second-generation cancer genome-sequencing study design also facilitates assessment of the validity and importance of the rapidly growing number of published studies. In this Review, we focus on the practical application of second-generation sequencing technology (also known as next-generation sequencing) to cancer genomics and discuss how aspects of study design and methodological considerations — such as the size and composition of the discovery cohort — can be tailored to serve specific research aims. APPLICATIONS OF NEXT-GENERATION SEQUENCING REVIEWS NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 321 © 2013 Macmillan Publishers Limited. All rights reserved

Transcript of Cancer genome-sequencing study design -...

Page 1: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Cancer pathogenesis is rooted in inherited genetic variation and acquired somatic mutation; accord-ingly, genomics is integral to cancer research (for a review, see REF. 1). In 2008, the first cancer genome was sequenced using second-generation technology (also known as next-generation sequencing)2. Four years later, approximately 800 genomes from at least 25 dif-ferent cancer types have been sequenced. Consider that only 20 years ago, sequencing one human genome took an international collaboration more than 10 years and cost US$3.8 billion3–5. Today, accurate and rapid genome sequencing costs only a few thousand dollars. With this advancement in technology comes considerable capac-ity to increase basic cancer biology knowledge and the opportunity to advance cancer prevention, diagnostics, prognostics and treatment.

Despite the diversity of questions to be addressed using cancer genome-sequencing studies (that is, studies that have used second-generation technology to sequence at least one cancer genome), only a limited number of specific aims have been investigated to date; many remain to be explored. For instance, cancer pre-vention is an important area that could greatly benefit from well-designed second-generation cancer genome-sequencing studies. Family-based and case–control study designs will be integral to uncovering inherited polymorphisms that predispose individuals to cancer. Knowing cancer predisposition can benefit patients, health practitioners and the health-care system if it results in a change of lifestyle behaviours or medical intervention that reduces cancer risk, morbidity or

mortality. The aim of this article is not to review results of cancer genome-sequencing studies but to focus on their archetypal specific aims, methodological requi-sites and study designs. Throughout this Review, the benefits and limitations of approaches, technologies and interpretation will also be discussed.

Specific aimsThus far, most cancer genome-sequencing studies have had one or more of four specific aims: discovering driver mutations; identifying somatic mutational signatures; characterizing clonal evolution; and advancing person-alized medicine (FIG. 1; Supplementary information S1 (table)). First, determining which somatic mutations are likely to contribute to the cancer phenotype is the most common aim of cancer genome-sequencing stud-ies. Discovering driver mutations leads to improved understanding of basic cancer biology and conse-quently treatment discovery and development. Take the gene enhancer of zeste 2 (EZH2), for example: second- generation sequencing resulted in the discovery of somatic mutations in EZH2 in lymphoma at a clinically signifi-cant frequency6, spurring functional characterization7 and leading to a promising treatment8. Second, identify-ing somatic mutational signatures has also led to gains in understanding basic cancer biology. For the first time, researchers can uncover global signatures of the muta-tion processes and DNA repair mechanisms that con-tribute to the catalogue of somatic mutations in a cancer type. Indeed, the signatures of two newly discovered mutational phenomena, kataegis9 and chromothripsis10,

Genome Sciences Centre, BC Cancer Research Centre, 675 West 10th Avenue, Vancouver, British Colombia V5Z 1L3, Canada.Correspondence to M.A.M. e‑mail: [email protected]:10.1038/nrg3445

Driver mutationsSomatic mutations that have a role in creating, controlling and/or directing some aspect of the cancer phenotype.

KataegisFrom the Greek meaning ‘thunderstorm’, this refers to clusters of somatic single-nucleotide variants that often colocalize with somatic structural variants.

ChromothripsisFrom the Greek meaning ‘chromosome shattering’, this refers to a single event of genome shattering and reassembly that results in complex somatic structural variations characterized by oscillating copy number and tens to hundreds of rearrangements that localize to one or a few chromosomes.

Cancer genome-sequencing study designJill C. Mwenifumbo and Marco A. Marra

Abstract | Discoveries from cancer genome sequencing have the potential to translate into advances in cancer prevention, diagnostics, prognostics, treatment and basic biology. Given the diversity of downstream applications, cancer genome-sequencing studies need to be designed to best fulfil specific aims. Knowledge of second-generation cancer genome-sequencing study design also facilitates assessment of the validity and importance of the rapidly growing number of published studies. In this Review, we focus on the practical application of second-generation sequencing technology (also known as next-generation sequencing) to cancer genomics and discuss how aspects of study design and methodological considerations — such as the size and composition of the discovery cohort — can be tailored to serve specific research aims.

A P P L I C AT I O N S O F N E X T- G E N E R AT I O N S E Q U E N C I N G

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 321

© 2013 Macmillan Publishers Limited. All rights reserved

Page 2: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

have been discovered thanks to cancer genome-sequenc-ing studies. Third, characterizing clonal evolution is an important concept, particularly when considering can-cer treatment, and this characterization can be achieved at the nucleotide level using sequencing. For example, major subclones may share a somatic mutation (or muta-tions) that confers intrinsic drug resistance, whereas minor subclones and de novo mutations may evolve to confer acquired resistance (as has been documented for

the emergence of mutated v-Ki-Ras2 Kirsten rat sarcoma oncogene (KRAS) and acquired resistance to epidermal growth factor (EGFR)-targeted therapy11).

Finally, advancing personalized medicine is a clear application of using second-generation technology to sequence cancer genomes. The goal of personalized medicine is to reduce toxicity and to improve efficacy through selecting the correct treatment for the correct patient at the correct dose and time. Medulloblastoma

Figure 1 | Cancer genome second-generation-sequencing study designs. This flow diagram is intended to highlight the diversity of second-generation cancer genome-sequencing study designs. Study design choices are in yellow, green and blue boxes. Choosing a path along these boxes, which are connected by arrows, represents a possible cancer genome-sequencing study design. The yellow boxes highlight that single-patient studies are well-suited for personalized medicine, whereas the blue boxes highlight that discovery cohorts are well-suited for discovering driver mutations. Dark grey boxes represent choices for analyses or methods specific to the box that they are connected to. Specifically, clonal evolution can be examined through ultra-deep, multi-sample or multi-region sequencing. Discovery cohorts can be multi-omics studies that combine genome, exome or transcriptome sequencing. Genome sequencing can be either <30‑fold or ≥30‑fold redundant coverage for focused or comprehensive somatic mutation detection, respectively. Validation and extension cohorts can confirm the findings from discovery cohorts or can explore generalizability or clinical importance. Secondary aims are in light grey boxes. Peer-reviewed publications that may serve as models for a particular study design feature or study aim are noted in boxes. SNV, single-nucleotide variant; SV, structural variant.

Nature Reviews | Genetics

SV detection<30-folde.g. Stephens (2011)10

SNV, indel, SV detection≥30-folde.g. Berger (2012)97

Exomee.g. Muzny (2012)43

Transcriptomee.g. Morin (2011)62

Interaction omicse.g. Molenaar (2012)90

Validatione.g. Cheung (2012)89

Clinical importancee.g. Molenaar (2012)90

Generalizabilitye.g. Wu (2012)63

Mutational signaturese.g. Nik-Zainal (2012)9

Personalized medicinee.g. Jones (2010)17

Multi-omicse.g. Wang (2011)64

Genomee.g. Pleasance (2009)19

No validation orextension cohorte.g. Sung (2012)94

Validation orextension cohorte.g. Jones (2012)76

Multi-samplesequencinge.g. Ding (2012)100

Multi-regionsequencinge.g. Tao (2011)55

Ultra-deepsequencinge.g. Shah (2012)60

Clinically actionable mutatione.g. Chapman (2011)65

Recurrent mutatione.g. Fujimoto (2012)30

Pathway analysise.g. Ellis (2012)27

Single-patient studye.g. Ley (2008)2

Discovery cohorte.g. Campbell (2008)42

Clonal evolutione.g. Shah (2009)16

Driver mutationse.g. Banerji (2012)72

R E V I E W S

322 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved

Page 3: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Redundant sequence coverageThe total number of bases sequenced divided by the total number of bases in the haploid genome.

B allele frequenciesFrequencies equal to B / (A + B), where A is the count for the reference nucleotide at an inherited single-nucleotide polymorphism (SNP) position, and B is the count for the alternate nucleotide at that same SNP position.

could be a model cancer type to further the personal-ized medicine paradigm as it is a heterogeneous cancer type with respect to both overall survival and molecu-lar signatures; moreover, aggressive treatment results in improved mortality at the cost of substantial morbidity12. Identifying the patients who would be best served by an aggressive treatment regime has great potential to improve the quality of life of medulloblastoma survivors.

The specific aims discussed here are not exhaus-tive; many more remain to be explored with second-generation sequencing of cancer genomes. For instance, de novo germline variants associated with childhood cancers might be discovered through trio studies that include the proband offspring and both parents. What is more, these specific aims are not mutually exclusive. For example, through characterizing clonal evolution, researchers might find driver mutations.

Methodological requisitesSecond-generation cancer genome-sequencing studies have a generally accepted set of working methods (for a Review, see REF. 13), including but not limited to full sequencing of the matched normal genome (that is, the patient’s non-cancerous genome), at least 30-fold redundant sequence coverage for the detection of single-nucleotide variation, and verification resequencing to confirm the somatic status of acquired mutations.

Matched normal genome. Subtracting the genetic varia-tion of a non-cancerous ‘normal’ genome from its can-cerous counterpart allows the identification of somatic mutations. As a source of the normal genome, studies of haematological cancer types often use skin biopsies (for example, see REFS 2,14,15), whereas studies of solid tumours frequently use peripheral blood mononuclear cells (for example, see REFS 16–18). Both of these options may have contamination with circulating tumour DNA or cells. Surgical margins and proximal lymph nodes can also serve as a source of normal DNA (for example, see REFS 19–21). Their collection is the least invasive for the patient if surgical resection is a part of the treatment plan. However, it should be noted that, despite normal appear-ance, surrounding tissue may contain residual disease cells, early tumour-initiating somatic mutations and/or an altered transcriptome or epigenome. Regardless of the source of the matched normal genome, bioinformatics analyses should allow for low levels of contamination.

The average person inherits 3 to 4 million single-nucleotide polymorphisms (SNPs; for example, see REFS 22–26). Compared to these millions of inherited polymorphisms, there are relatively few (specifically, thousands to tens of thousands) candidate somatic single- nucleotide variants (SNVs) in the cancer genomes of adults (for example, see REFS 27–30). To have confidence that these candidates are real somatic SNVs, research-ers must have identified the large majority of inherited SNPs in the matched normal genome (BOX 1). Metrics that help to assess the quality of the SNP calling in non-cancerous genomes include: the proportion of the SNPs that overlap with those found in US National Center for Biotechnology Information SNP database, which is a

public repository of simple genetic variations31; the tran-sition to transversion ratio (~2.1 for the whole genome); and concordance with matched SNP genotyping arrays. SNP arrays can be used to estimate a false-negative rate; however, the assumption is that array calls are the gold standard.

SNVs are not the only kind of somatic mutation, but they are the most abundant. The somatic status of struc-tural variants, such as copy number variants (CNVs), copy-neutral regions of loss of heterogeneity (cnLOH), inversions and translocations, are also determined by comparison with the non-cancerous genome.

Redundant sequence coverage for studying single- nucleotide variants. Cancer genome-sequencing studies typically produce on the order of 90 Gb of aligned sequence to achieve 30-fold redundant coverage of the 3 Gb haploid human genome, which is generally accepted as sufficient to detect inherited SNPs reliably25 in diploid genomes. Estimating the optimal redundant coverage for normal genome sequencing is challenging owing to rapid changes of variables that influence the detection of SNPs, such as library construction meth-ods, sequencing chemistry, sequencing read length, read alignment algorithms and bioinformatics tools for vari-ant identification. It has been suggested, however, that on the order of 50-fold coverage is required to detect inherited genotypes confidently32.

Although cancer originates from a common progeni-tor, it evolves through clonal expansion, somatic muta-tion and selection1,33, which means that cancer cells from the same patient do not share all somatic muta-tions. Moreover, general characteristics of cancer, such as aneuploidy, non-cancerous cell contamination and extensive unbalanced structural variation, can add to the variability in mutant allele frequencies (that is, mutant sequencing read count / (mutant + reference sequencing read count)). Both of these facts mean that — unlike inherited SNPs that exist at B allele frequencies of 0% (in the case of a homozygous reference), 50% (in the case of a heterozygous variant) or 100% (in the case of a homo-zygous variant) — acquired SNVs may exist at a contin-uous range of mutant allele frequencies. Consequently, the current standard of 30-fold redundant coverage is likely to be insufficient to mitigate false-negative calls of SNVs with low mutant allele frequencies. In fact, to detect somatic mutations in minor subclones that have mutant allele frequencies as low as 1 to 2%, substantially higher sequence coverage (for example, 400- to 500-fold) of the cancer genome would be required.

In an effort to minimize false-negative errors, the International Cancer Genome Consortium guidelines suggest that the tumour cell content of a sample is at least 60 to 80% viable cells where possible. Some cancer genome-sequencing studies attempt to reduce non-can-cerous cell contamination through microdissection34, cell sorting35, creation of low passage cell lines36 and xenographs37. Using one of these techniques is likely to be required for cancer types that have a high stromal content, such as pancreatic cancer, or that have sub-stantial normal cellular content, such as haematological

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 323

© 2013 Macmillan Publishers Limited. All rights reserved

Page 4: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Paired-end readsSequencing reads from each end of the same DNA molecule. Knowing the sequence of both reads and the length of the DNA molecule improves mapping to a reference sequence, de novo assembly and detecting structural variations.

cancer types. Alternatively, increasing redundant cov-erage can help to compensate for low tumour purity, and in some cases, this is the most straightforward way to do so.

Paired-end reads for detecting structural variants. Second-generation cancer genome sequencing using paired-end reads with greater than 30-fold redundant coverage allows detailed characterization and simulta-neous detection of SNVs, indels and structural variants (Supplementary information S2 (table)). Sequencing

cases to less than 30-fold redundancy still allows a study to detect structural variants. Such studies can: charac-terize the architecture of structural variants at single-nucleotide resolution (for example, see REFS 38,39); describe the distribution of different types of structural variant in the cancer of an individual patient (for exam-ple, see REFS 39,40); examine the patterns of these vari-ants across cancers from different patients (for example, see REFS 10,41); explore the evolution of structural vari-ants (for example, see REFS 36,38); and discover chimeric genes (for example, see REFS 42,43).

Tool Statistic Multiple samples Filtering Indels URL

Samtools: mpileup, bcftools

Bayesian genotype likelihood model

Called independently

varFilter Yes http://samtools.sourceforge.net/mpileup.shtml

Genome Analysis Toolkit: UnifiedGenotyper

Bayesian genotype likelihood model

Called independently

Variant quality score recalibrator

Yes http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html; http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html

Genome Analysis Toolkit: HaplotypeCaller

Local de novo assembly of haplotypes and affine gap penalty pair hidden Markov model (HMM) likelihood function

Called independently

Variant quality score recalibrator

Yes http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html; http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html

SomaticSniper Bayesian genotype likelihood model

Tumour–normal pairs called jointly

Built in Yes http://gmt.genome.wustl.edu/somatic-sniper/current

VarScan2 Heuristic Fisher’s exact test

Called independently or as tumour–normal pairs

somaticFilter Yes http://varscan.sourceforge.net

Strelka Bayesian continuous allele frequencies for both tumour and normal samples

Tumour–normal pairs called jointly

Post-call filtration

Yes ftp://[email protected]

JointSNVMix Probabilistic graphical models

Tumour–normal pairs called jointly

MutationSeq Not specified

http://code.google.com/p/joint-snv-mix; http://compbio.bccrc.ca/software/mutationseq

Box 1 | Single-nucleotide alteration calling and filtering

Somatic mutation calling has several sources of false positives and false negatives, including technical bias, sequencing error, alignment artefacts, normal DNA contaminated with cancer DNA, tumour heterogeneity, copy number variants (CNVs) and copy-neutral regions of loss of heterogeneity (cnLOH). Single-nucleotide variant (SNV)- and indel-calling methods have features that attempt to minimize false-positive calls, and further filtering can also help to reduce these errors. A small selection of the many tools that can be used to detect SNVs and indels is listed in the table below.

Technical biasDuring the standard library construction process, PCR introduces duplicate reads and strand bias and GC bias can be compounded. Filtering out duplicate reads and SNV calls that are mainly supported by one strand can ameliorate PCR artefact bias. GC bias is a source of uneven redundant coverage across the genome; stretches of high AT and high GC content tend to be under-represented. Because inherited polymorphisms vastly outnumber acquired somatic variants, poor confidence in calls at these under-represented loci means that inherited single-nucleotide polymorphisms (SNPs) and indels are more likely to be mistaken for acquired SNVs and indels. Filtering by minimum read depth at a position

and/or by a minimum number of reads that support variant calls help to reduce this type of false positive. Annotating acquired SNVs and indels that are found in dbSNP or the 1000 Genomes repository is another way to address this issue. Of note, the current build of dbSNP contains somatic mutations.

Sequencing errorSecond-generation sequencing has a higher rate of base error than first-generation sequencing. Errors are more likely to occur at the 3′ ends of reads from Illumina platform sequencing and at homopolymeric sequences in the case of Roche 454 Life Sciences sequencing. Requiring a minimum base quality and/or consensus sequence quality also helps to reduce uncertainty at poorly sequenced loci.

Alignment artefactsAlignment tools produce artefacts in regions of low mappability, such as low copy and simple tandem repeats. Filtering by mapping quality may help. Alternatively, SNV and indel calls in low mappability regions can be filtered out. Misalignment of reads around indels also occurs; most tools locally realign or assemble reads around indels, but filtering SNVs by proximity to gapped alignments is also an option.

Affine gap penalty is a penalty that reduces sequence alignment score on the basis of the existence and length of gaps due to indels.

R E V I E W S

324 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved

Page 5: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Chimeric genesA combination of segments of two or more genes that forms a new gene.

Split readsSequencing reads that align to non-contiguous spans of the reference sequence owing to somatic structural variation.

Mass-spectrometric genotypingA method that generates locus-specific amplicons followed by primer extension that incorporates mass-modified dideoxynucleotides at the single-nucleotide polymorphism position. A mass spectrometer then measures the differential mass of the products.

Structural variants can be predicted from genome sequence on the basis of unexpected mapping distance and orientation of paired-end reads44, split reads (for a review, see REF. 45) and/or de novo assembly of sequenc-ing reads using bioinformatics tools such as Abyss46. Paired-end reads from short DNA library fragments, on the order of hundreds of base pairs, provide the ability to detect smaller size intrachromosomal rear-rangements (for reviews, see REFS 47,48). By contrast, paired-end reads from large DNA library fragments, on the order of thousands of base pairs, provide the abil-ity to detect rearrangements in complex DNA regions, such as repetitive and duplicated sequences, and require less sequencing than short fragment libraries to achieve comparable physical coverage40,49. Determining the dif-ferential redundant coverage across segments of the can-cer genome compared with its matched normal genome is a way to detect CNVs specifically42; this method does not depend on paired-end reads. However, paired-end reads improve the upstream process of aligning reads to the reference genome. In samples with a high degree of clonal heterogeneity or low tumour purity, the detection of CNVs can be confounded; however, bioinformatics adjustments for heterogeneity can help to improve detection50.

Structural variants predicted from genome sequenc-ing may result in gene amplification, deletion, disrup-tion or rearrangement. Transcriptome sequencing, however, can be used to identify how structural vari-ants in the genome alter transcription of affected genes. Notably, it can be used to identify transcribed chimeric genes. Chimeric proteins are ideal macromolecules for molecularly targeted drug therapy, which aims to inhibit a specific mutated protein, thus maximizing efficacy while minimizing toxicity. The prototypical model of successful molecular-targeted drug therapy is inhibition of the oncogenic chimaera BCR–ABL by imatinib51.

Verification resequencing. Verification resequencing is the use of a different technology to minimize false-positive calls of somatic mutations that occur owing to technology-specific systematic errors, such as library construction artefacts, sequencing errors and biases and alignment inaccuracy. The objective of verification resequencing is to confirm that the candidate somatic mutation is not present in the normal genome and is present in the cancer genome. It is important for stud-ies to report a verification rate (that is, a false-positive rate), but given that there may be thousands of candi-date somatic mutations per patient, sometimes it is not practical to confirm them all; most somatic mutations in a patient’s cancer genome are unique, and verification resequencing can be costly, both in terms of money as well as in terms of the quantity of nucleic acid required. Strategies for determining a false-positive rate while reducing the number of individual verifications car-ried out include: selecting the mutations that are most likely to have an effect on the structure, function or expression level of a protein (such as nonsynonymous SNVs; for example, see REFS 2,16); a random sample of the somatic mutations (for example, see REFS 37,52);

or somatic mutations that are deemed of interest (for example, see REFS 53,54). The International Cancer Genome Consortium has proposed that, on the basis of extrapolation from the calculated verification rate, at least 95% of somatic mutants listed in the catalogue for each sample should be real. A margin of error of 5% requires that a minimum of 384 somatic mutants should be verified.

A common form of verification is PCR amplifica-tion of the locus containing the candidate SNV, indel or structural variant breakpoint, followed by Sanger sequencing (for example, see REFS 2,42). Other methods of verification include mass-spectrometric genotyping (for example, see REFS 21,55) and targeted capture followed by sequencing using a different second-generation platform (for example, see REFS 15,18). Of note, Sanger sequencing and mass spectrometric genotyping are lim-ited by their inability to detect mutant alleles that are found at a low frequency (for example, see REFS 28,55). Verification of amplifications or deletions can be done through assessing concordance of the CNVs called from sequence data versus those detected with array-comparative genomic hybridization (for example, see REFS 34,39) or SNP arrays (for example, see REFS 14,18); however, the limits of detection with array technology make breakpoint sequencing preferable for smaller CNVs (for example, see REFS 42,56).

Study typesToday, with second-generation technology, the cost of sequencing a human cancer genome can be as little as US$5,000, and it continues to drop. Adequately pow-ered studies have now become economically feasible, but they are not inexpensive. Several study designs have been developed to maximize impact while minimizing cost (FIG. 1; Supplementary information S3 (table)).

Single-patient studies. Single-patient studies are hypothesis- generating and have the potential to inform clinical practice, but they do not allow for the generalization of findings. Researchers can theorize as to which somatic mutations from the whole catalogue are important to the pathogenesis of cancer in an individual patient. These theories may be based on the literature (for example, see REFS 2,15) or on phylogenetic evolutionary analysis55; consequently, they are limited in novelty or the strength of the conclusion, respectively. Single-patient second-generation cancer genome-sequencing studies are well-positioned to describe the somatic mutational signature of a cancer type (for example, see REFS 20,57) and to explore clonal evolution (for example, see REFS 16,18). They are best positioned to be integral components of personalized genomic medicine (for example, see REFS 17,58), the objective of which is to inform physician decision-making with respect to treatment.

Genome discovery cohort. The discovery cohort is a group of the same type, or subtype, of cancer that is subject to second-generation sequencing. Discovery cohort studies have the potential to detect recurrent somatic mutations of genes and pathways (that is,

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 325

© 2013 Macmillan Publishers Limited. All rights reserved

Page 6: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

mutations found in more than one patient’s cancer; BOX 2). Recurrence in cancers from different patients is good evidence that a mutation might be involved in cancer pathogenesis, but it is not definitive evidence because phenomena such as linkage disequilibrium with a pathogenic gene deletion can result in recurrent somatic deletion of an adjacent gene (or genes) that is not involved in pathogenesis59. Second-generation cancer genome sequencing of discovery cohorts can also allow for the identification of somatic mutational signatures or patterns of clonal evolution that may characterize a cancer type or subtype (for example, see REFS 41,52,60,61).

The actual power of the discovery cohort is in its potential for unbiased discovery and novel hypoth-esis generation. The statistical power of the discovery cohort is a function of the number of patients, inter-tumour genomic heterogeneity and the frequency of the event of interest. Genome discovery cohort

studies have often been small (containing between 2 and 97 cases) and thus of limited statistical power (Supplementary information S3 (table)). If the aim of a study is to catalogue the large majority of recurrently somatically mutated genes found in a cancer type, or subtype, then the International Cancer Genome Consortium recommends that approximately 100 matched tumour–normal pairs in a discovery cohort and 400 in a validation cohort (see below) are required to reliably detect genes that are somatically mutated in 3% of cases. Of note, this two-tiered design requires that all somatically mutated genes identified in the dis-covery cohort are assessed in the validation cohort. If a somatically mutated locus or gene recurs at a fairly high frequency, a large discovery cohort is not nec-essarily required (for example, see REFS 62,63), but a highly sensitive survey of the mutational landscape of a cancer type cannot be achieved with smaller discovery cohorts.

Box 2 | Detecting recurrent mutations

Example study

Type of study Type (or types) of cancer

Method to detect statistically significant recurrence

Method to detect statistically significant evidence of selection

Type (or types) of somatic mutations

Number of significantly recurrently mutated genes

Validation (cohort size/genes assessed)

Chapman (2011)65

Multi-ome; 23 WGS; 16 WES

Multiple myeloma MutSig Nonsynonymous to synonymous mutations ratio

Non-silent protein coding; non-coding regions with high regulatory potential

10; 18 161/BRAF, IRF4

Berger (2012)97

25 WGS Melanoma MutSig – 11 107/PREX2

Ellis (2012)27

Multi-ome; 46 WGS; 31 WES

Breast cancer MuSiC – Tier 1 mutations 18 –

Fujimoto (2012)30

27 WGS Hepatocellular carcinoma

In-house – Protein-altering point mutations

15 120/ARID1A

Jones (2012)76

Multi-ome; 39 WGS; 21 WES; 65 custom capture

Medulloblastoma MutSig – 8 –

Morin (2011)62

Multi-ome; 14 WGS or WES; 117 WTS

Non-Hodgkin’s lymphomas

– In-house, based on REF. 105

Nonsynonymous point mutation or nonsense mutations of 109 recurrently mutated genes

26 261/MEF2B; 89/MLL2

Shah (2012)60

Multi-ome; 15 WGS; 54 WES; 80 WTS

Breast cancer – In-house 6 159/29 genes

In each cancer, there is a collection of somatic mutations, some of which create, control and/or direct the cancer phenotype. There are several statistical methods to identify the somatic mutations that are likely to be contributing to cancer pathogenesis. One way to determine which somatic mutations are probably driver mutations is with bioinformatics tools that find genes that are somatically mutated more often than would be expected by chance (or those that have a higher mutation rate than the background mutation rate) in a cohort. The background mutation rate can vary by the base context (for example, there is an increased rate of C to T mutations in

the context CpG), among different regions of the genome (for example, exons versus introns), across a cohort of a cancer type or subtype (for example, hypermutated versus non-hypermutated colorectal cancers) or among individuals. Owing to their length and/or nucleotide composition, some genes may have inherently high mutation rates; thus, these factors may be adjusted for. Different methods calculate the background mutation rate differently. In addition to considering the mutation rate, the transition to transversion ratio and the nonsynonymous to synonymous mutation ratio can be used as proxy indicators of selection.

An empty cell means ‘not specified’, and a dash (–) means ‘not done’. ARID1A, AT-rich interactive domain 1A; BRAF, v-raf murine sarcoma viral oncogene B1; IRF4, interferon regulatory factor 4; MEF2B, myocyte enhancer factor 2B; MLL2, myeloid/lymphoid or mixed-lineage leukaemia 2; PREX2, phosphatidylinositol-3,4,5- trisphosphate-dependent Rac exchange factor 2; WES, whole-exome sequencing; WGS, whole-genome sequencing; WTS, whole-transcriptome sequencing.

R E V I E W S

326 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved

Page 7: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Multi-ome discovery cohortA cohort of cancer genomes, exomes and/or transcriptomes; more than one omic measurement per sample is not necessary.

Allelic imbalancesUnequal transcript levels of the alleles of a gene.

Multi-ome discovery cohort. Multi-ome discovery cohorts use second-generation technology to sequence an assortment of genomes, exomes (for example, see REFS 64,65) and/or transcriptomes (for example, see REFS 60,62) in a group of cancers of the same type or sub-type. Specifically, researchers will typically sequence the genome of a smaller number of samples and the exomes or transcriptomes of a larger number of samples; more than one ‘omic’ measurement per sample is not nec-essary. The advantages and disadvantages of exome and transcriptome sequencing, and consequently the multi-ome discovery cohort, are further discussed  below.

Second-generation technology is widely used to sequence cancer exomes, and this approach has been the source of exciting findings in various cancer types (for example, see REFS 66–69). Here we focus on how exome sequencing can be coupled with genome sequenc-ing. Certain versions of today’s exome technology can capture up to 70 Mb of exons, non-coding RNAs and non-coding regions with high regulatory potential. The exome can also be customized to capture specific regions of interest. Theoretically, sequencing the exome does not improve the detection of any specific type of mutation over genome sequencing; it can detect cod-ing SNVs, indels and structural variants70,71. Practically, the substantially lower cost of sequencing per sample allows researchers to focus on a subset of the genome

that is manageable in terms of cost and makes greater redundant coverage possible, increasing the sensitivity of detection of low mutation allele frequency somatic coding mutations (for example, see REFS 57,72). However, exome sequencing can miss somatic coding mutations in areas where sequence coverage is poor or where targeted capture probes need improvement57.

Second-generation sequencing of the transcriptome (using RNA sequencing (RNA-seq)73,74) allows for the discovery of SNVs (for example, see REF. 62), indels, chimeric transcripts (for example, see REF. 75), novel transcripts, alternative splicing (for example, see REF. 60), allelic imbalances (for example, see REF. 76) and differ-entially expressed transcripts (for example, see REF. 77) (FIG. 2). It should be noted that SNVs that decrease tran-script levels by negatively affecting transcription or transcript stability will not be detected in the transcrip-tome. Also, small RNAs are not captured in most RNA-seq libraries and require their own library preparation78. Compared with array-based technologies, RNA-seq has many advantages because it is digital technology (for a Review, see REF. 79). For example, RNA-seq has an improved ability to compare transcription levels across different genes, samples, experiments, time points and platforms. Moreover, it has a greater dynamic range and increased sensitivity depending on the depth of sequencing. Transcriptome sequencing can be limited, however, by the difficulty of finding a matched ‘normal’

Nature Reviews | Genetics

AAAAAAAAA

Translocation

AAAAAAAAA

Tran

scri

ptom

eTr

ansc

ript

ome

AAAAAAAAAAAAAAAAAA

Greater transcriptionNormal transcription

NonsynonymousSNV

Promoterhypomethylation

Amplification

AAAAAAAAAAAA

AAAAAAAAA

AAAAAA

Gene silencing

Nonsynonymous SNV

Promoter hypermethylation

AAAAAAAAA

Altered transcript isoform

Deletion

Splice site SNV

AAAAAA

Chimeric transcript No transcriptionGain-of-function variant and greater transcription

Loss-of-function variant

Figure 2 | The integration of transcriptome and epigenome with whole-genome sequencing. Integration analyses can indicate whether the somatic mutation of a gene results in a pathogenic increase, decrease or change of function. The dark blue tracts represent intergenic regions, the overlaid light blue rectangles represent genic regions, and the light blue rectangles with poly(A)s represent mRNA. The green circles represent cytosine methylation. The red and purple bars represent a translated nonsynonymous single-nucleotide variant (SNV). The yellow bar represents a splice site mutation. The black line represents the altered exon joining. The green tract represents an interchromosomal translocation.

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 327

© 2013 Macmillan Publishers Limited. All rights reserved

Page 8: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Integration omicsExamining how somatic mutation or deregulation of a genome, transcriptome and/or epigenome converge on a pathway, process or gene; more than one omic measurement per sample is not necessary. For example, gene inactivation through single-nucleotide variants or epigenomic silencing.

Interaction omicsExamining how the somatic mutation or deregulation of the genome, transcriptome and/or epigenome affect one another; more than one omic measurement per sample is ideal. For example, somatic copy number variants can have effects on transcript levels.

Custom captureHybridization or amplification of selected regions of the genome to specifically capture loci for second-generation sequencing.

Ultra-deep second-generation resequencingGreater than 100-fold redundant sequence coverage of a targeted selection of somatic mutations.

tissue. For example, differential expression analysis comparing cancerous tissues to cell-of-origin mate-rial would not be possible if the cell type of origin were unobtainable or unknown.

The advantages of multi-ome discovery cohorts are twofold. First, because the quantity of sequence required for coverage of exomes and transcriptomes is substan-tially less than for genomes, the multi-ome discovery cohorts minimize the expenditure of dollars, time and resources required to discover recurrent and potentially high-impact somatic mutations. The key words are ‘recurrent’ and ‘high-impact’. The lower cost of exome and transcriptome sequencing makes possible larger sample sizes, which have a greater power to detect sta-tistically significant recurrent somatic mutations; high-impact refers to somatic mutations that are most likely to affect protein function and/or expression. At the same time, researchers can identify frequently recurrent structural variants, SNVs or indels in the few genome sequences that cannot be detected through the exome or transcriptome sequencing. The second advantage of multi-ome discovery cohorts (or in any study in which more than one ome is measured) is integrative analy-ses (that is, integration omics), which explores how the different types of somatic mutation discovered with the different types of omes converge on a mutated locus, gene or pathway (FIG. 2). This is important, as there is no single second-generation sequencing omics measure capable of detecting all the types of abnormalities impli-cated in cancer pathogenesis. Landmark studies from The Cancer Genome Atlas Research Network use inte-grative analyses to consider cancer molecular aetiology from as many perspectives as possible43,77. Bioinformatics tools such as PathScan 80 and PARADIGM 81 facilitate integrative analyses of multi-ome cohorts.

On a separate but related note, exploring the interaction between omes (that is, interaction omics) is an emerging field in cancer genome sequencing. Technically, the task is straightforward and involves making more than one omic measurement on the same sample and then analysing the global interactions between, for example, genomes and transcriptomes or epigenomes. It considers how somatic mutation or deregulation in one ome affects another. Practically, exploring omic interactions is constrained by several factors (for a Review, see REF. 82). First, there are only a few cancer genome studies that have used second- generation sequencing to make more than one omic meas-urements per sample (Supplementary information S1 (table)). Second, there is a dearth of bioinformatics tools to carry out interaction analyses for second-generation sequence data. This may be, in part, due to the fact that there are a multitude of different specific questions that a researcher may ask when exploring interac-tion omics. For instance, in exploring the interaction between the genome and transcriptome, researchers have determined the proportion of DNA SNVs that are detectable in the transcriptome (for example, see REFS 62,76,83), discovered cancer-specific RNA-editing events that recode the amino acid sequence (for exam-ple, see REFS 16,84), identified splice site mutations that

affect transcript length and structure (for example, see REFS 60,62) and determined DNA somatic mutations association with transcript levels (for example, see REFS 16,17,60). Few cancer genome-sequencing studies have explored the interaction between the multifaceted epigenome and the somatic mutations of the genome or dysregulation of the transcriptome, even though aber-ration of the epigenome is a feature of cancer patho-genesis85. Chromatin immunoprecipitation followed by second-generation sequencing of the captured DNA fragments (ChIP–seq) is a well-established technique used to survey genome-wide interactions between modified histones and DNA82,86. In addition, second-generation-sequencing-based methods that determine the methylation state of cytosines, and associated bioinformatics tools, are emerging82,87.

Validation and extension cohort. A two-tiered study design includes hypothesis generation using a dis-covery cohort (that is, single-patient, genome cohort or multi-ome cohort) followed by hypothesis testing using a targeted approach in a larger validation cohort. Targeted methods for exploring somatic mutations of interest are varied (see Supplementary information S3 (table)). They include: genotyping for specific nucleo-tide changes; sequencing of single exons, the complete coding region of genes or whole genes; and targeting thousands of genes using custom capture followed by ultra-deep second-generation resequencing. When validat-ing patterns of structural variation across samples, often the extension cohort stage consists of analysis of publicly available whole-genome array data sets (for example, see REFS 29,83,88).

The primary aim of a validation cohort is to test the reproducibility of the discovery cohort findings (for example, see REFS 28,89). Some cancer genome-sequencing studies have secondary aims, such as sur-veying generalizability (for example, see REF. 63) or assessing clinical importance (for example, see REF. 90) of the somatically mutated genes and pathways of interest (FIG. 1; Supplementary information S1 (table)); this goes beyond validating findings and extends them. If a validation or extension cohort is composed of different cancer types, then researchers may determine the extent to which findings from a particular cancer type can be generalized to any of the hundreds of other histological types of cancer. Specifically, on achieving primary aims, researchers may survey generalizability in a cohort of similar cancer subtypes (for example, see REFS 62,83,91), among cancers with a uniting feature (for example, pae-diatric cancers63), between cancers with contrasting features (for example, brainstem versus non-brainstem cancers63) or among various common types of cancer (for example, see REF. 38). Somatically mutated genes or pathways that are common to a wide variety of cancer types probably underlie the hallmark phenotypes of can-cer pathogenesis (for a review of cancer hallmarks, see REF. 92). However, somatic mutations that are common to a specific cancer subtype may be important to dis-tinctive phenotypic features, optimal treatment or might serve as molecular subtype biomarkers (for example,

R E V I E W S

328 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved

Page 9: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

see REFS 89,90). If validation or extension cohort cases are linked to clinical data, they can be used to explore clinical importance, which can be defined by the value added to the clinical management of cancer patients (TABLE  1). For instance, in neuroblastoma, somatic mutation of alpha thalassaemia/mental retardation syn-drome X-linked (ATRX) associates with age, and age is a prognostic marker of survival89. Another example is that chromothripsis associates with poor survival in neuroblastoma90 and acute myeloid leukaemia93.

Somatic mutations can also predict the response to, or the need for, standard treatment (for example, see REFS 27,64). Before development of somatic mutations as clinically important prognostic or diagnostic biomark-ers, researchers will need to consider and to discuss more than whether a finding is statistically significant; clinical significance (that is, the strength of correlations and/or effect size) and technical reliability (that is, sensitivity and specificity) are elemental to clinical importance. Some of the recent studies using large discovery cohorts report statistically significant findings and consequently have no validation cohort (for example, see REFS 90,94). As the sample size of discovery cohorts increases, inde-pendent replication studies, rather than validation cohorts, may become the norm.

Investigating specific aimsDiscovering driver mutations. The first clue that a somatic mutation may be important to cancer pathogen-esis is if it affects a known cancer gene or a member of a pathway implicated in cancer. Here the interpretation of cancer genome-sequencing findings is based on current knowledge and is not dependent on statistical analyses; consequently, it does not require a large sample size. A disadvantage of this method is that the current number

of cancer genes — approximately 487 according to the Sanger Institute Cancer Gene Census — is probably an underestimation of the true number.

Cases in which similar cancer phenotypes in differ-ent patients are found to have recurrent mutations of genes or pathways suggest convergent molecular evolu-tion and thus provide compelling evidence for biologi-cal relevance and selection95,96. Bioinformatics tools such as MutSig and MuSiC find the statistically significant recurrent somatically mutated genes30,97 (BOX 2). Perhaps the most common method for discovering driver muta-tions is by identifying recurrent somatic mutations that are predicted to have translational consequences (for example, missense, nonsense, splice-site SNVs or coding indels). In addition, signatures of negative selection on somatic mutations in promoter regions21, decreased expression of genes with somatic mutations in untranslated regions or introns52 and significantly recur-rent mutations found in non-coding regions (for exam-ple, see REFS 30,60,65) suggest that non-coding mutations may also provide insight into the pathogenesis of cancer. Synonymous somatic mutations also deserve attention as they can potentially modulate transcript levels, pro-tein structure and splicing (for example, see REFS 98,99). Given the multiple mutational mechanisms that inacti-vate, activate, moderate or change the function of genes, researchers should ideally assess whether structural variants or transcriptional or epigenetic deregulation also affect recurrently mutated genes.

After recurrence has been established, patterns can also indicate the functional effect that a putative driver mutation might have. For instance, mutation hotspots72 (that is, a single DNA locus in which somatic mutations recur) and somatic mutations that cluster in a particular domain of a gene (for example, see

Table 1 | Examples of clinically relevant actionable mutations

Clinically relevant or actionable mutated genes and/or pathways

Treatment Cancer types Refs

RET Sunitininib (multi-targeted receptor tyrosine kinase inhibitor), sorafenib (multi-targeted receptor tyrosine kinase and RAF kinases inhibitor)

Adenocarcinoma, lung adenocarcinoma 17,84

BRAF Vemurafenib (small‑molecule BRAF V600E kinase inhibitor)

Melanoma, hypermutated colorectal cancer, multiple myeloma, malignant melanoma, triple-negative breast cancer

19,43,60, 65,97

KRAS Cetuximab (EGFR inhibitor) Colorectal cancer, metastatic pancreatic cancer, cancer of the ampulla of Vater, lung cancer, early T cell precursor acute lymphoblastic leukaemia

21,28,36, 40,43,58,

83

EGFR, ERBB2, ERBB3 signalling

Gefitinib (EGFR inhibitor), erlotinib (EGFR tyrosine kinase inhibitor), cetuximab, lapatinib (ERBB1 and ERBB2 receptors tyrosine kinase inhibitor)

ER‑positive breast cancer, hepatocellular carcinoma, lung cancer, colorectal cancer, breast cancer, lobular breast cancer

16,21,27, 30,43,52, 60,61,72

EML4–ALK Crizotinib (ALK and ROS1 inhibitor) Neuroblastoma 90

PML–RARA All trans-retinoic acid therapy Acute myeloid leukaemia, acute promyelocytic leukaemia

53,100

LRRK Candidate treatment: bortezomib Multiple myeloma, metastatic acral melanomas 34,65

ALK, anaplastic lymphoma receptor tyrosine kinase; BRAF, v‑Raf murine sarcoma viral oncogene B1; EGFR, epidermal growth factor receptor; EML4, echinoderm microtubule‑associated protein‑like 4; ER, oestrogen receptor; ERBB2, v‑Erb‑b2 erythroblastic leukaemia viral oncogene 2; KRAS, v-Ki-Ras2 Kirsten rat sarcoma viral oncogene; LRRK, leucine-rich repeat kinase 2; PML, promyelocytic leukaemia; RARA, retinoic acid receptor alpha; RET, ret proto-oncogene; ROS1, c-Ros oncogene 1.

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 329

© 2013 Macmillan Publishers Limited. All rights reserved

Page 10: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

Multi-region sequencingSequencing of distinct regions of the same solid tumour, this allows for the examination of intra-tumour heterogeneity and clonal evolution.

Clinically actionable drug targetsBiological molecules or processes that can be targeted by an existing or experimental drug.

REFS 57,62) suggest an increase of function or a cancer-promoting change of function65. Conversely, recurrent somatic mutations that are distributed along a gene are suggestive of a candidate tumour suppressor 62: a gene that allows for cancer growth when deactivated or deleted. There are also bioinformatics means to detect signatures of selection in recurrently mutated genes, which can reveal candidate driver genes (BOX 2 and Supplementary information S1 (table)). One caveat is the assumption that nonsynonymous mutations are evolutionarily neutral.

Pathway analysis is another method used to iden-tify driver genes, and it is particularly important in cancer types with heterogeneous mutational land-scapes, where few recurrently mutated genes are found (Supplementary information S1 (table)). The somatic mutations that are important to cancer pathogenesis in individual patients may differ, but different somatic mutations can create similar dysfunction of a biologi-cal pathway (for example, see REFS 30,90). On a related note, several somatic mutations may work together to deregulate the same pathway in one patient (for exam-ple, see REFS 17,21). In both these scenarios, pathway analysis may uncover driver mutations of interest for further evaluation, bearing in mind the important caveats that not all genes affected by somatic mutations are included in pathway analysis and that understand-ing of the function of gene products in relation to each other is incomplete. Also, determining mutations that are mutually exclusive can help to define subtypes (for example, see REF. 76) and/or reveal the different somatic mutations that create similar oncogenic dysfunction (for example, see REFS 89,97). A bioinformatics tool called mutation relation test can be used to assess statistically mutually exclusive relationships80.

Characterizing clonal evolution. There are three main second-generation genome-sequencing study designs that can demonstrate clonality and the molecular evolution of cancer (Supplementary information S1 (table)). These designs, which are not mutually exclu-sive, use ultra-deep resequencing (for example, see REFS 2,60,100), multi-region sequencing (for example, see REFS 55,101,102) and/or sequential multi-sample sequencing (for example, see REFS 100,103,104). First, ultra-deep resequencing (that is, >100-fold redundant coverage) of select somatic mutations allows research-ers to assess somatic mutant allele frequencies more accurately and to detect those that are at a low fre-quency (for example, see REFS 16,18,61). Clustering analyses based on mutant allele frequencies can reveal the number of subclones or the intra-tumour hetero-geneity (for example, see REFS 16,18,60) and can be used to construct a phylogenetic tree to show the inferred evolutionary relationships among subclones (for exam-ple, see REFS 61,100,103). An advantage of this method is that it requires only one sample to be sequenced. Second, multi-region sequencing can also detect intra-tumour clonal heterogeneity and phylogeny in solid tumours without the need for determining somatic mutation allele frequencies55. Third, sequential

multi-sample sequencing is based on observing a change in somatic mutation allele frequencies over time in related populations of cancer cells; in addition to the primary cancer, the recurrent cancer (for example, see REFS 17,38,54) and/or secondary metastasis (for exam-ple, see REFS 34,40,101) are also sequenced. Somatic mutations that change in frequency or that are unique to the subsequent cancer may be important to disease progression and/or acquired drug resistance (in cancer types that have been exposed to the selective pressure of treatment). One of the major hurdles in multi-sample evolutionary examination is the ethics of sequential sampling in which there is no intrinsic benefit for a patient. For haematological cancer, this is less of an issue, as frequent blood draws are used to monitor pro-gression or remission. In most solid tumours, however, invasive biopsies or resection of drug-resistant recur-rent cancer or metastasis is not a widespread practice.

Future perspectivesPersonalized medicine is an important area that will greatly benefit from innovative cancer genome-sequencing study designs. Currently, second-generation sequencing of cancer genomes with the intent of guid-ing therapy decisions is in its infancy, but promise has been shown in single-patient studies. Among the first cancer genome-sequencing studies to inform treatment was work on a rare subtype of treatment-resistant meta-static adenocarinoma17. On the basis of an integrative analysis of genome and transcriptome data, researchers developed a hypothesis for the mechanism driving the cancer. Subsequent therapeutic intervention coincided with partial remission of the metastatic disease, which ultimately progressed. Further sequencing showed that the metastasis had undergone extensive evolution, par-ticularly in the pathway that was targeted by tyrosine kinase inhibitors and had become drug-resistant.

Moving from single patients to cohort-based person-alized medicine research trials has great potential. In fact, recent studies have found that ~20% of triple-negative breast cancers16 and more than 60% of lung cancers77 have potential clinically actionable drug targets. However, cohort-based personalized medicine research trials will have numerous challenges. If a randomized trial tests the safety and efficacy of cancer genome-sequencing guided treatment decisions, then it might require many different drugs to be used in the treat-ment arm; this becomes complicated in terms of who funds the trial. If a randomized trial tests the safety and efficacy of a novel treatment and if in order to meet the inclusion criteria patients are required to have a specific somatically mutated gene, pathway or mutational signa-ture, then a large number of potential participants will probably need to be screened. Furthermore, ethical con-cerns require that personalized medicine approaches be tested in end-stage patients for whom the standard care has failed. Consequently, this paradigm will initially be tested in the patients with the greatest need and the most challenging cases. Despite challenges, second-generation cancer sequencing is rapidly moving towards use in a clinical capacity.

R E V I E W S

330 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved

Page 11: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

1. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).

2. Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008).This was first study to use second-generation technology to sequence a cancer genome. It established cancer genome sequencing as an unbiased method for discovering candidate driver mutations.

3. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

4. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

5. Battelle Technology Partnership Practice. Economic impact of the Human Genome Project: how a $3.8 billion investment drove $796 billion in economic impact, created 310,000 jobs, and launched the genomic revolution. battelle.org [online], http://www.battelle.org/docs/default-document-library/economic_impact_of_the_human_genome_project.pdf?sfvrsn=2 (2011).

6. Morin, R. D. et al. Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin. Nature Genet. 42, 181–185 (2010) (2011).

7. Sneeringer, C. J. et al. Coordinated activities of wild-type plus mutant EZH2 drive tumor-associated hypertrimethylation of lysine 27 on histone H3 (H3K27) in human B-cell lymphomas. Proc. Natl Acad. Sci. USA 107, 20980–20985 (2010).

8. McCabe, M. T. et al. EZH2 inhibition as a therapeutic strategy for lymphoma with EZH2-activating mutations. Nature 492, 108–112 (2012).

9. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012).

10. Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 27–40 (2011).

11. Misale, S. et al. Emergence of KRAS mutations and acquired resistance to anti-EGFR therapy in colorectal cancer. Nature 486, 532–536 (2012).

12. Northcott, P. A. et al. Medulloblastomics: the end of the beginning. Nature Rev. Cancer 12, 818–834 (2012).

13. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nature Rev. Genet. 11, 685–696 (2010).

14. Mardis, E. R. et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med. 361, 1058–1066 (2009).

15. Link, D. C. Identification of a novel TP53 cancer susceptibility mutation through whole-genome sequencing of a patient with therapy-related AML. JAMA 305, 1568 (2011).

16. Shah, S. P. et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature 461, 809–813 (2009).This study used ultra-deep resequencing to characterize clonal evolution and showed that variable somatic mutation allele frequencies can reflect different subclones. Moreover, considerable evolution can occur over time.

17. Jones, S. J. et al. Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors. Genome Biol. 11, R82 (2010).This work incorporated second-generation sequencing into the personalized medicine framework. Specifically, the intent of the case study was to inform physician decision making with respect to treatment of a rare cancer.

18. Ding, L. et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature 464, 999–1005 (2010).

19. Pleasance, E. D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2009).

20. Pleasance, E. D. et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 463, 184–190 (2009).This study highlighted that the distribution and composition of somatic mutations across a genome is not uniform. It showed that through examining the mutational signatures, researchers can gain insight into the mechanisms and processes that may have given rise to the mutations.

21. Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).

22. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

23. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

24. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

25. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).An accurate consensus sequence was built with second-generation technology from >30-fold redundant coverage of 35 bp paired-end reads.

26. Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).

27. Ellis, M. J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 486, 353–360 (2012).

28. Bass, A. J. et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nature Genet. 43, 964–968 (2011).

29. Berger, M. F. et al. The genomic complexity of primary human prostate cancer. Nature 470, 214–220 (2011).

30. Fujimoto, A. et al. Whole-genome sequencing of liver cancers identifies etiological influences on mutation patterns and recurrent mutations in chromatin regulators. Nature Genet. 44, 760–764 (2012).

31. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

32. Ajay, S. S., Parker, S. C. J., Ozel Abaan, H., Fuentes Fajardo, K. V. & Margulies, E. H. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21, 1498–1505 (2011).

33. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).

34. Turajlic, S. et al. Whole genome sequencing of matched primary and metastatic acral melanomas. Genome Res. 22, 196–207 (2011).

35. Puente, X. S. et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475, 101–105 (2011).

36. Campbell, P. J. et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature 467, 1109–1113 (2010).

37. Peña-Llopis, S. et al. BAP1 loss defines a new class of renal cell carcinoma. Nature Genet. 44, 751–759 (2012).

38. Ng, C. K. et al. The role of tandem duplicator phenotype in tumour evolution in high-grade serous ovarian cancer. J. Pathol. 226, 703–712 (2012).

39. Stephens, P. J. et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 462, 1005–1010 (2009).

40. Kloosterman, W. P. et al. Chromothripsis is a common mechanism driving genomic rearrangements in primary and metastatic colorectal cancer. Genome Biol. 12, R103 (2011).

41. McBride, D. J. et al. Tandem duplication of chromosomal segments is common in ovarian and breast cancer genomes. J. Pathol. 227, 446–455 (2012).

42. Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genet. 40, 722–729 (2008).By investigating the paired-end sequencing reads that did not align to the reference genome as expected with respect to each other, the authors were able to demonstrate a high-throughput and high-resolution bioinformatics method to characterize structural variation.

43. Muzny, D. M. et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).With 97 colorectal cancer genomes sequenced to low-to-moderate redundant coverage, this discovery cohort is the largest to date.

44. Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).

45. Onishi-Seebacher, M. & Korbel, J. O. Challenges in studying genomic structural variant formation mechanisms: the short-read dilemma and beyond. BioEssays 33, 840–850 (2011).

46. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

47. Fullwood, M. J., Wei, C.-L., Liu, E. T. & Ruan, Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19, 521–532 (2009).

48. Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6, S13–S20 (2009).

49. Hillmer, A. M. et al. Comprehensive long-span paired-end-tag mapping reveals characteristic patterns of structural variations in epithelial cancer genomes. Genome Res. 21, 665–675 (2011).

50. Boeva, V. et al. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics 28, 423–425 (2012).

51. Druker, B. J. et al. Five-year follow-up of patients receiving imatinib for chronic myeloid leukemia. N. Engl. J. Med. 355, 2408–2417 (2006).

52. Lee, E. et al. Landscape of somatic retrotransposition in human cancers. Science 337, 967–971 (2012).

53. Welch, J. S. Use of whole-genome sequencing to diagnose a cryptic fusion oncogene. JAMA 305, 1577 (2011).

54. Weiss, G. J. et al. Paired tumor and normal whole genome sequencing of metastatic olfactory neuroblastoma. PLoS ONE 7, e37029 (2012).

55. Tao, Y. et al. Rapid growth of a hepatocellular carcinoma and the driving mutations revealed by cell-population genetic analysis of whole-genome data. Proc. Natl Acad. Sci. USA 108, 12042–12047 (2011).

56. Bueno, R. et al. Second generation sequencing of the mesothelioma tumor genome. PLoS ONE 5, e10612 (2010).

57. Totoki, Y. et al. High-resolution characterization of a hepatocellular carcinoma genome. Nature Genet. 43, 464–469 (2011).

58. Demeure, M. J. et al. Cancer of the ampulla of Vater: analysis of the whole genome sequence exposes a potential therapeutic vulnerability. Genome Med. 4, 56 (2012).

59. Muller, F. L. et al. Passenger deletions generate therapeutic vulnerabilities in cancer. Nature 488, 337–342 (2012).

60. Shah, S. P. et al. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486, 395–399 (2012).

61. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).This paper demonstrates the utility of characterizing the somatic mutational signature with the discovery of kataegis.

62. Morin, R. D. et al. Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature 476, 298–303 (2011).

63. Wu, G. et al. Somatic histone H3 alterations in pediatric diffuse intrinsic pontine gliomas and non-brainstem glioblastomas. Nature Genet. 44, 251–253 (2012).

64. Wang, L. et al. SF3B1 and other novel cancer genes in chronic lymphocytic leukemia. N. Engl. J. Med. 365, 2497–2506 (2011).

65. Chapman, M. A. et al. Initial genome sequencing and analysis of multiple myeloma. Nature 471, 467–472 (2011).

66. Harbour, J. W. et al. Frequent mutation of BAP1 in metastasizing uveal melanomas. Science 330, 1410–1413 (2010).This study discovered a gene that was somatically mutated in an impressive number of metastasizing tumours using second-generation sequencing of exomes. This study highlights that there are novel and valuable candidate therapeutic targets that are yet to be discovered.

67. Yoshida, K. et al. Frequent pathway mutations of splicing machinery in myelodysplasia. Nature 478, 64–69 (2011).

68. Schwartzentruber, J. et al. Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 482, 226–231 (2012).

69. Pugh, T. J. et al. Medulloblastoma exome sequencing uncovers subtype-specific somatic mutations. Nature 488, 106–110 (2012).

70. Sathirapongsasuti, J. F. et al. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 27, 2648–2654 (2011).

71. Karakoc, E. et al. Detection of structural variants and indels within exome data. Nature Methods 9, 176–178 (2012).

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 14 | MAY 2013 | 331

© 2013 Macmillan Publishers Limited. All rights reserved

Page 12: Cancer genome-sequencing study design - CCSBccsb.stanford.edu/content/dam/sm/ccsb/documents/education/cbio243... · variation, and verification resequencing to confirm the somatic

FURTHER INFORMATION1000 Genomes Project: http://www.1000genomes.orgCOSMIC: Cancer Gene Census: http://cancer.sanger.ac.uk/cancergenome/projects/censusdbSNP — NCBI: http://www.ncbi.nlm.nih.gov/snpE.6 Quality Standards of Samples — International Cancer Genome Consortium: http://icgc.org/icgc/goals-structure-policies-guidelines/e6-quality-standards-of-samplesE.7 Study Design and Statistical Issues— International Cancer Genome Consortium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e7-study-design-and-statistical-issuesE.8 Genome Analyses— International Cancer Genome Consortium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e8-genome-analysesGenome MuSiC: http://gmt.genome.wustl.edu/genome-music/currentMutSig: http://www.broadinstitute.org/cancer/cga/mutsigNature Reviews Genetics Series on Applications of next-generation sequencing: http://www.nature.com/nrg/series/nextgeneration/index.htmlNature Reviews Genetics Series on Study designs: http://www.nature.com/nrg/series/studydesigns/index.htmlWHO — International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3): http://www.who.int/classifications/icd/adaptations/oncology/en

SUPPLEMENTARY INFORMATIONSee online article: S1 (table) | S2 (table) | S3 (table)

ALL LINKS ARE ACTIVE IN THE ONLINE PDF

72. Banerji, S. et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 486, 405–409 (2012).

73. Ruan, Y. et al. Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using paired-end ditags (PETs). Genome Res. 17, 828–838 (2007).

74. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

75. Roberts, K. G. et al. Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Cancer Cell 22, 153–166 (2012).

76. Jones, D. T. W. et al. Dissecting the genomic complexity underlying medulloblastoma. Nature 488, 100–105 (2012).

77. Hammerman, P. S. et al. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).

78. Morin, R. D. et al. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 18, 610–621 (2008).

79. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

80. Dees, N. D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).

81. Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).

82. Hawkins, R. D., Hon, G. C. & Ren, B. Next-generation genomics: an integrative approach. Nature Rev. Genet. 11, 476–486 (2010).

83. Zhang, J. et al. The genetic basis of early T-cell precursor acute lymphoblastic leukaemia. Nature 481, 157–163 (2012).

84. Ju, Y. S. et al. A transforming KIF5B and RET gene fusion in lung adenocarcinoma revealed from whole-genome and transcriptome sequencing. Genome Res. 22, 436–445 (2011).

85. Esteller, M. Cancer epigenomics: DNA methylomes and histone-modification maps. Nature Rev. Genet. 8, 286–298 (2007).

86. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).

87. Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009).

88. Zhang, J. et al. A novel retinoblastoma therapy from genomic and epigenetic analyses. Nature 481, 329–334 (2012).

89. Cheung, N.-K. V. et al. Association of age at diagnosis and genetic mutations in patients with neuroblastoma. JAMA 307, 1062–1071 (2012).

90. Molenaar, J. J. et al. Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483, 589–593 (2012).This is one of the largest discovery cohorts to date. Researchers sequenced the genomes of 87 tumour–normal pairs to at least 30-fold redundant coverage.

91. Collins, C. C. et al. Next generation sequencing of prostate cancer from a patient identifies a deficiency of methylthioadenosine phosphorylase, an exploitable tumor target. Mol. Cancer Ther. 11, 775–783 (2012).

92. Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).

93. Rausch, T. et al. Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations. Cell 148, 59–71 (2012).

94. Sung, W.-K. et al. Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nature Genet. 44, 765–769 (2012).

95. Klein, G. Lymphoma development in mice and humans: diversity of initiation is followed by convergent cytogenetic evolution. Proc. Natl Acad. Sci. USA 76, 2442–2446 (1979).

96. Castoe, T. A., De Koning, A. P. J. & Pollock, D. D. Adaptive molecular convergence: molecular evolution versus molecular phylogenetics. Commun. Integr. Biol. 3, 67–69 (2010).

97. Berger, M. F. et al. Melanoma genome sequencing reveals frequent PREX2 mutations. Nature 485, 502–506 (2012).

98. Kimchi-Sarfaty, C. et al. A ‘silent’ polymorphism in the MDR1 gene changes substrate specificity. Science 315, 525–528 (2007).

99. Pagani, F., Raponi, M. & Baralle, F. E. Synonymous mutations in CFTR exon 12 affect splicing and are not neutral in evolution. Proc. Natl Acad. Sci. USA 102, 6368–6372 (2005).

100. Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole- genome sequencing. Nature 481, 506–510 (2012).

101. Wu, C. et al. Integrated genome and transcriptome sequencing identifies a novel form of hybrid and aggressive prostate cancer. J. Pathol. 227, 53–61 (2012).

102. Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).

103. Walter, M. J. et al. Clonal architecture of secondary acute myeloid leukemia. N. Engl. J. Med. 366, 1090–1098 (2012).

104. Jones, S. et al. Comparative lesion sequencing provides insights into tumor evolution. Proc. Natl Acad. Sci. USA 105, 4283–4288 (2008).

105. Greenman, C., Wooster, R., Futreal, P. A., Stratton, M. R. & Easton, D. F. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 173, 2187–2198 (2006).

AcknowledgementsJ.C.M. thanks the Canadian Institutes of Health Research and the Michael Smith Foundation for Health Research for their support. M.A.M. is the University of British Columbia, Canada Research Chair in Genome Science.

Competing interests statementThe authors declare no competing financial interests.

R E V I E W S

332 | MAY 2013 | VOLUME 14 www.nature.com/reviews/genetics

© 2013 Macmillan Publishers Limited. All rights reserved