Precise common and rare germline CNV calling with GATK...

1
Optimization and performance on WES Algorithms Introduction Precise common and rare germline CNV calling with GATK gCNV Mehrtash Babadi 1 , Samuel K. Lee 1 , Andrey Smirnov 1 , Lee Lichtenstein 1 , Laura D. Gauthier 1 , Daniel P. Howrigan 2 , Timothy Poterba 1 1 Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA 2 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA We implement and evaluate a new tool, GATK gCNV, for the discovery of rare and common copy-number variations (CNVs). Given coverage data, the tool simultaneously models systematic sequencing noise and copy-number events. Furthermore, in contrast to many previous approaches to CNV analysis, we explicitly model regions of common and multiallelic copy- number variation. We demonstrate that GATK gCNV significantly outperforms XHMM and CODEX on WES data as well as Genome STRiP on WGS data. Graphical model for jointly determining multiallelic CNV sites, denoising, and calling CNV events. WGS performance and trio validation GATK gCNV is based on a hierarchical graphical model that simultaneously infers the noise profile of the coverage data and makes predictions on the copy-number activity. In addition, we infer the locations of common and multiallelic regions, imposing different prior distributions on CNV activity at those regions. W W GC tg z z GC sg t s σ T σ S σ GC μ m t " M D n st μ st d s D st ¯ d s t 2 {common, silent} common silent p alt s,j [t] 1 2 3 4 c 1 c 2 c 3 c 4 n 1 n 2 n 3 n 4 D CNV p alt j p com. D class D 1 D 2 D 3 D 4 Samples Inference method The inference algorithm for gCNV utilizes PyMC3 and Theano. We use an Automatic Differentiation Variational Inference (ADVI) method to enable automated and scalable inference and an annealing protocol for finding the optimal mean-field variational posterior. Features Best-in-class sensitivity and specificity for detecting rare and common germline CNV events Automatic karyotyping and CNV calling on sex chromosomes GC-bias correction and automatic discovery of latent-bias features Easy to run and cloud-ready pipelines Summary and future work References We implemented, optimized, and evaluated a fully Bayesian model for discovery of rare and common copy-number variation in WES and WGS data GATK gCNV is capable of calling events in areas of common copy- number variation with very high sensitivity compared to other tools Using Bayesian techniques for hyperparameter selection could further improve sensitivity Improving ground-truth resources could lead to more accurate performance measurements Expanding the model to call non-integer copy-number states would enable analysis of somatic data [1] R.E. Handsaker et al., "Large multiallelic copy number variations in humans", Nature Genetics 47, 296-303 (2015) [2] M. Fromer et al., "Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth", Am. Journal of Human Genetics 91(4) (2012) [3] Y. Jiang et al., "CODEX: a normalization and copy number variation detection method for whole- exome sequencing", Nucl. Acids Res. 43(6) (2015) [4] Chaisson, Mark JP, et al. "Multi-platform discovery of haplotype-resolved structural variation in human genomes." bioRxiv (2017): 193144. Hyperparameter optimization We perform a hyperparameter optimization search using the first 3 chromosomes of a GPC2 dataset (180 samples). In order to optimize performance, we search over the following parameters: • Prior probability of non-reference copy- number states • Dosage of CNV-active loci • Penalty for unexplained coverage variance • Correlation length of copy-number events The result of the hyperparameter optimization is shown on the left. Trio validation Finally, we perform an orthogonal validation by calculating transmission- rate statistics on heterozygous CNV deletion calls for a cohort of Taiwanese trios (102 samples). We show that the transmission rate is consistent with expected values. Transmission-rate concordance. 0.001 0.002 0.003 False Positive Rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 True Positive Rate GATK gCNV (optimization) XHMM CODEX Hyperparameter optimization. Black curves represent GATK gCNV runs using different hyperparameter values. 0.000 0.001 0.002 0.003 False Positive Rate 0.0 0.2 0.4 0.6 True Positive Rate GATK gCNV CODEX XHMM 0.0 0.2 0.4 Specificity 0.0 0.2 0.4 0.6 Sensitivity GATK gCNV CODEX XHMM 0.00 0.25 0.50 0.75 1.00 Variant Frequency 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity GATK gCNV CODEX XHMM Performance results for GPC2 cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right). GATK gCNV outperforms XHMM and CODEX For our main evaluation, we compare the performance of GATK gCNV on WES data against that of two other coverage-based tools, CODEX [2] and XHMM [3]. As ground truth, we use a callset produced by Genome STRiP [1] on a set of matched WGS samples. We compare sensitivity and specificity on the entire callset, as well as sensitivity as a function of variant frequency. We can see that GATK gCNV clearly stands out as the most sensitive and accurate tool. 0.000 0.001 0.002 0.003 False Positive Rate 0.0 0.2 0.4 0.6 True Positive Rate GATK gCNV CODEX XHMM 0.0 0.2 0.4 Specificity 0.0 0.2 0.4 0.6 Sensitivity GATK gCNV CODEX XHMM 0.00 0.25 0.50 0.75 1.00 Variant Frequency 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity GATK gCNV CODEX XHMM To demonstrate the validity of the previous results on a different dataset, we run all three tools on a TCGA cohort (168 samples), again using GenomeSTRiP on matched WGS samples. The figure below shows similar results to previous section. Performance results for TCGA cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right). 0 2000 4000 6000 8000 10000 Minimum Event Length (bp) 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity GATK gCNV GenomeSTRiP LUMPY manta liWGS DELLY Comparison of GATK gCNV and GenomeSTRiP sensitivity on WGS data to that of orthogonal methods. WGS performance GATK gCNV compares favorably to Genome STRiP in sensitivity to CNVs that were detected with Illumina short- read coverage (9 samples) and validated with PacBio long reads [4]. The sensitivity of call sets generated by breakpoint-based methods or jumping libraries is also shown for comparison.

Transcript of Precise common and rare germline CNV calling with GATK...

Page 1: Precise common and rare germline CNV calling with GATK gCNVgenomics.broadinstitute.org/data-sheets/gCNV_AACR2018.pdf · 2018. 5. 1. · Performance results for TCGA cohort. ROC curve

Optimization and performance on WES

Algorithms

Introduction

Precise common and rare germline CNV calling with GATK gCNV Mehrtash Babadi1, Samuel K. Lee1, Andrey Smirnov1, Lee Lichtenstein1, Laura D. Gauthier1, Daniel P. Howrigan2, Timothy Poterba1 1Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA 2Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA

We implement and evaluate a new tool, GATK gCNV, for the discovery of rare and common copy-number variations (CNVs). Given coverage data, the tool simultaneously models systematic sequencing noise and copy-number events. Furthermore, in contrast to many previous approaches to CNV analysis, we explicitly model regions of common and multiallelic copy-number variation. We demonstrate that GATK gCNV significantly outperforms XHMM and CODEX on WES data as well as Genome STRiP on WGS data.

Graphical model for jointly determining multiallelic CNV sites, denoising, and calling CNV events.

WGS performance and trio validation

GATK gCNV is based on a hierarchical graphical model that simultaneously infers the noise profile of the coverage data and makes predictions on the copy-number activity. In addition, we infer the locations of common and multiallelic regions, imposing different prior distributions on CNV activity at those regions.

Wtµ WGCtg

zsµ zGCsg

t

s

�T

�S

�GC

↵µ

mt

"M

⌧D

nst

µst

ds

Dst

d̄s

⌧t 2 {common, silent}

⇡common

⇡silent

palt

s,j[t]

⌧1 ⌧2 ⌧3 ⌧4

c1 c2 c3 c4

n1 n2 n3 n4

DCNV

paltj

pcom.

Dclass

D1 D2 D3 D4

Sam

ples

Inference method The inference algorithm for gCNV utilizes PyMC3 and Theano. We use an Automatic Differentiation Variational Inference (ADVI) method to enable automated and scalable inference and an annealing protocol for finding the optimal mean-field variational posterior.

Features ― Best-in-class sensitivity and specificity for detecting rare and common germline CNV events

― Automatic karyotyping and CNV calling on sex chromosomes

― GC-bias correction and automatic discovery of latent-bias features

― Easy to run and cloud-ready pipelines

Summary and future work

References

― We implemented, optimized, and evaluated a fully Bayesian model for discovery of rare and common copy-number variation in WES and WGS data

― GATK gCNV is capable of calling events in areas of common copy-number variation with very high sensitivity compared to other tools

― Using Bayesian techniques for hyperparameter selection could further improve sensitivity

― Improving ground-truth resources could lead to more accurate performance measurements

― Expanding the model to call non-integer copy-number states would enable analysis of somatic data

[1] R.E. Handsaker et al., "Large multiallelic copy number variations in humans", Nature Genetics 47, 296-303 (2015)[2] M. Fromer et al., "Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth", Am. Journal of Human Genetics 91(4) (2012)[3] Y. Jiang et al., "CODEX: a normalization and copy number variation detection method for whole- exome sequencing", Nucl. Acids Res. 43(6) (2015)[4] Chaisson, Mark JP, et al. "Multi-platform discovery of haplotype-resolved structural variation in human genomes." bioRxiv (2017): 193144.

Hyperparameter optimization We perform a hyperparameter optimization search using the first 3 chromosomes of a GPC2 dataset (180 samples).

In order to optimize performance, we search over the following parameters:

• Prior probability of non-reference copy-number states

• Dosage of CNV-active loci

• Penalty for unexplained coverage variance

• Correlation length of copy-number events

The result of the hyperparameter optimization is shown on the left.

Trio validation Finally, we perform an orthogonal validation by calculating transmission-rate statistics on heterozygous CNV deletion calls for a cohort of Taiwanese trios (102 samples). We show that the transmission rate is consistent with expected values.

Transmission-rate concordance.

0.001 0.002 0.003False Positive Rate

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

True

Posi

tive

Rat

e

GATK gCNV (optimization)XHMMCODEX

Hyperparameter optimization. Black curves represent GATK gCNV runs using different hyperparameter values.

0.000 0.001 0.002 0.003False Positive Rate

0.0

0.2

0.4

0.6

True

Posi

tive

Rat

e

GATK gCNVCODEXXHMM

0.0 0.2 0.4Specificity

0.0

0.2

0.4

0.6

Sen

sitiv

ity

GATK gCNVCODEXXHMM

0.00 0.25 0.50 0.75 1.00Variant Frequency

0.0

0.2

0.4

0.6

0.8

1.0

Sen

sitiv

ity

GATK gCNVCODEXXHMM

Performance results for GPC2 cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right).

GATK gCNV outperforms XHMM and CODEXFor our main evaluation, we compare the performance of GATK gCNV on WES data against that of two other coverage-based tools, CODEX [2] and XHMM [3]. As ground truth, we use a callset produced by Genome STRiP [1] on a set of matched WGS samples. We compare sensitivity and specificity on the entire callset, as well as sensitivity as a function of variant frequency. We can see that GATK gCNV clearly stands out as the most sensitive and accurate tool.

0.000 0.001 0.002 0.003False Positive Rate

0.0

0.2

0.4

0.6

True

Posi

tive

Rat

e

GATK gCNVCODEXXHMM

0.0 0.2 0.4Specificity

0.0

0.2

0.4

0.6

Sen

sitiv

ity

GATK gCNVCODEXXHMM

0.00 0.25 0.50 0.75 1.00Variant Frequency

0.0

0.2

0.4

0.6

0.8

1.0

Sen

sitiv

ity

GATK gCNVCODEXXHMM

To demonstrate the validity of the previous results on a different dataset, we run all three tools on a TCGA cohort (168 samples), again using GenomeSTRiP on matched WGS samples. The figure below shows similar results to previous section.

Performance results for TCGA cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right).

0 2000 4000 6000 8000 10000Minimum Event Length (bp)

0.0

0.2

0.4

0.6

0.8

1.0

Sen

sitiv

ity

GATK gCNVGenomeSTRiPLUMPYmantaliWGSDELLY

Comparison of GATK gCNV and GenomeSTRiP sensitivity on WGS data to that of orthogonal methods.

WGS performance GATK gCNV compares favorably to Genome STRiP in sensitivity to CNVs that were detected with Illumina short-read coverage (9 samples) and validated with PacBio long reads [4]. The sensitivity of call sets generated by breakpoint-based methods or jumping libraries is also shown for comparison.