CBI NGS Workshop Lesson 4 The Genome Analysis Toolkit (GATK)
Precise common and rare germline CNV calling with GATK...
Transcript of Precise common and rare germline CNV calling with GATK...
![Page 1: Precise common and rare germline CNV calling with GATK gCNVgenomics.broadinstitute.org/data-sheets/gCNV_AACR2018.pdf · 2018. 5. 1. · Performance results for TCGA cohort. ROC curve](https://reader036.fdocuments.in/reader036/viewer/2022071114/5fec67c015414914d8066852/html5/thumbnails/1.jpg)
Optimization and performance on WES
Algorithms
Introduction
Precise common and rare germline CNV calling with GATK gCNV Mehrtash Babadi1, Samuel K. Lee1, Andrey Smirnov1, Lee Lichtenstein1, Laura D. Gauthier1, Daniel P. Howrigan2, Timothy Poterba1 1Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA 2Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
We implement and evaluate a new tool, GATK gCNV, for the discovery of rare and common copy-number variations (CNVs). Given coverage data, the tool simultaneously models systematic sequencing noise and copy-number events. Furthermore, in contrast to many previous approaches to CNV analysis, we explicitly model regions of common and multiallelic copy-number variation. We demonstrate that GATK gCNV significantly outperforms XHMM and CODEX on WES data as well as Genome STRiP on WGS data.
Graphical model for jointly determining multiallelic CNV sites, denoising, and calling CNV events.
WGS performance and trio validation
GATK gCNV is based on a hierarchical graphical model that simultaneously infers the noise profile of the coverage data and makes predictions on the copy-number activity. In addition, we infer the locations of common and multiallelic regions, imposing different prior distributions on CNV activity at those regions.
Wtµ WGCtg
zsµ zGCsg
t
s
�T
�S
�GC
↵µ
mt
"M
⌧D
nst
µst
ds
Dst
d̄s
⌧t 2 {common, silent}
⇡common
⇡silent
palt
s,j[t]
⌧1 ⌧2 ⌧3 ⌧4
c1 c2 c3 c4
n1 n2 n3 n4
DCNV
paltj
pcom.
Dclass
D1 D2 D3 D4
Sam
ples
Inference method The inference algorithm for gCNV utilizes PyMC3 and Theano. We use an Automatic Differentiation Variational Inference (ADVI) method to enable automated and scalable inference and an annealing protocol for finding the optimal mean-field variational posterior.
Features ― Best-in-class sensitivity and specificity for detecting rare and common germline CNV events
― Automatic karyotyping and CNV calling on sex chromosomes
― GC-bias correction and automatic discovery of latent-bias features
― Easy to run and cloud-ready pipelines
Summary and future work
References
― We implemented, optimized, and evaluated a fully Bayesian model for discovery of rare and common copy-number variation in WES and WGS data
― GATK gCNV is capable of calling events in areas of common copy-number variation with very high sensitivity compared to other tools
― Using Bayesian techniques for hyperparameter selection could further improve sensitivity
― Improving ground-truth resources could lead to more accurate performance measurements
― Expanding the model to call non-integer copy-number states would enable analysis of somatic data
[1] R.E. Handsaker et al., "Large multiallelic copy number variations in humans", Nature Genetics 47, 296-303 (2015)[2] M. Fromer et al., "Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth", Am. Journal of Human Genetics 91(4) (2012)[3] Y. Jiang et al., "CODEX: a normalization and copy number variation detection method for whole- exome sequencing", Nucl. Acids Res. 43(6) (2015)[4] Chaisson, Mark JP, et al. "Multi-platform discovery of haplotype-resolved structural variation in human genomes." bioRxiv (2017): 193144.
Hyperparameter optimization We perform a hyperparameter optimization search using the first 3 chromosomes of a GPC2 dataset (180 samples).
In order to optimize performance, we search over the following parameters:
• Prior probability of non-reference copy-number states
• Dosage of CNV-active loci
• Penalty for unexplained coverage variance
• Correlation length of copy-number events
The result of the hyperparameter optimization is shown on the left.
Trio validation Finally, we perform an orthogonal validation by calculating transmission-rate statistics on heterozygous CNV deletion calls for a cohort of Taiwanese trios (102 samples). We show that the transmission rate is consistent with expected values.
Transmission-rate concordance.
0.001 0.002 0.003False Positive Rate
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
True
Posi
tive
Rat
e
GATK gCNV (optimization)XHMMCODEX
Hyperparameter optimization. Black curves represent GATK gCNV runs using different hyperparameter values.
0.000 0.001 0.002 0.003False Positive Rate
0.0
0.2
0.4
0.6
True
Posi
tive
Rat
e
GATK gCNVCODEXXHMM
0.0 0.2 0.4Specificity
0.0
0.2
0.4
0.6
Sen
sitiv
ity
GATK gCNVCODEXXHMM
0.00 0.25 0.50 0.75 1.00Variant Frequency
0.0
0.2
0.4
0.6
0.8
1.0
Sen
sitiv
ity
GATK gCNVCODEXXHMM
Performance results for GPC2 cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right).
GATK gCNV outperforms XHMM and CODEXFor our main evaluation, we compare the performance of GATK gCNV on WES data against that of two other coverage-based tools, CODEX [2] and XHMM [3]. As ground truth, we use a callset produced by Genome STRiP [1] on a set of matched WGS samples. We compare sensitivity and specificity on the entire callset, as well as sensitivity as a function of variant frequency. We can see that GATK gCNV clearly stands out as the most sensitive and accurate tool.
0.000 0.001 0.002 0.003False Positive Rate
0.0
0.2
0.4
0.6
True
Posi
tive
Rat
e
GATK gCNVCODEXXHMM
0.0 0.2 0.4Specificity
0.0
0.2
0.4
0.6
Sen
sitiv
ity
GATK gCNVCODEXXHMM
0.00 0.25 0.50 0.75 1.00Variant Frequency
0.0
0.2
0.4
0.6
0.8
1.0
Sen
sitiv
ity
GATK gCNVCODEXXHMM
To demonstrate the validity of the previous results on a different dataset, we run all three tools on a TCGA cohort (168 samples), again using GenomeSTRiP on matched WGS samples. The figure below shows similar results to previous section.
Performance results for TCGA cohort. ROC curve (left), sensitivity vs. specificity (middle), sensitivity vs. variant frequency (right).
0 2000 4000 6000 8000 10000Minimum Event Length (bp)
0.0
0.2
0.4
0.6
0.8
1.0
Sen
sitiv
ity
GATK gCNVGenomeSTRiPLUMPYmantaliWGSDELLY
Comparison of GATK gCNV and GenomeSTRiP sensitivity on WGS data to that of orthogonal methods.
WGS performance GATK gCNV compares favorably to Genome STRiP in sensitivity to CNVs that were detected with Illumina short-read coverage (9 samples) and validated with PacBio long reads [4]. The sensitivity of call sets generated by breakpoint-based methods or jumping libraries is also shown for comparison.