Performance benchmarking for calling and phasing of single ...

1
Performance benchmarking for calling and phasing of single-nucleotide polymorphisms and structural variants The length, accuracy and low bias of nanopore reads makes them ideally suited to the characterisation and phasing of structural variants and single-nucleotide polymorphisms across the entire genome Fig. 1 Calling structural variants (SV) with nanopore reads a) bioinformatics workflow b) and c) precision and recall d) calling different SV classes e) calling very long insertions and deletions Our newest version of the SV analysis pipeline benefits from LRA’s improved performance in aligning long reads in areas of structural variation and repeats. Furthermore, cuteSV improves the exact identification of break points, alternative allele sequences and variant genotypes compared to older tools. To validate this pipeline we called SVs in our publicly available 50x HG002 dataset and compared to the most recent GIAB truth-set. We found high overall performance with >95% precision and 97.5% recall (Figs. 1b and 1c). The size distribution of the detected SVs clearly shows the characteristic ALU peak around 300 bp and LINE1 peak above 5,000 bp that is expected in human samples. Low-complexity short tandem repeats are known to be under-represented in the current human reference genome and this is seen as a higher number of insertions than deletions in our data. We found that high precision is independent of read depth. Recall is more directly influenced by read depth with 30x being required for identifying >95% of all variants. However, even with as little as 15x >90% are detected. Performance is largely independent of SV size (Fig. 1d); even very large SVs are called with base-pair precision (Fig. 1e and f). These examples show SV-calling combined with read phasing: Fig. 1e shows a homozygous 40 kb insertion which has been captured in single reads from each haplotype and Fig. 1f shows a 30 kb heterozygous deletion that covers LCE3B and LCE3C. A deletion of those two genes is known to be a susceptibility factor for psoriasis. The read lengths enabled by nanopore sequencing make it feasible to call long structural variants with high sensitivity and specificity, even in difficult regions of the human genome Fig. 2 SNP-calling a) recall and precision using publicly available datasets b)-d) phasing e) phasing genes in the major incompatibility complex To test SNP-calling performance we used DeepVariant on the publicly available FDA Precision Challenge HG002 dataset. Results show high precision (>99.7%) and recall (>99.5%) for substitutions across the whole genome, including highly repetitive regions, segmental duplications, and the MHC region (Fig. 2a). The length of nanopore reads can be leveraged to assign SNP, SVs, and methylation calls to their homologous chromosome of origin. To illustrate phasing on a typical dataset, we used whatshap on an internal 40x HG002 dataset with a read N50 of 23 kb. By using information from overlapping long reads the resulting phase blocks can span several tens of megabases of the genome, typically only interrupted by long stretches of homozygosity (Fig. 2b). Phase-block lengths are influenced by read length. For 10-15 kb reads we observe a phase-block length of between 100-200 kb, sufficient to fully phase roughly 60% of all gene bodies. Increasing read length to >30 kb causes median phase-block length to increase tenfold, exceeding 1.5 Mb, allowing full phasing of >90% of all genes in the human genome. A high percentage of phased reads allows phasing of small variants, SVs and methylation calls. We used whatshap haplotag to scan raw reads for the presence of phased variants. With a read N50 of 23 kb we were able to assign >85% of all sequenced nucleotides to a haplotype (Fig. 2c). We found that >98% of the entire human genome (excluding centromeres and telomeres) is covered by phased reads (Fig. 2d). As a demonstration of phasing performance, we choose the MHC region, where we were able to cover 99% of the 6 Mb locus with 5 phase blocks and fully phase >99% of all genes (Fig. 2e). The combination of long reads with highly sensitive and specific calling of single-nucleotide polymorphisms simplifies phasing and results in exceptionally long phase-blocks Contact: [email protected] More information at: www.nanoporetech.com and publications.nanoporetech.com e) b) LCE3D LCE3C LCE3B 41,147 [0 - 58] 54,123,200 bp 54,123,300 bp p13 p12.2 p11.23 p11.21 q11.23 q13.13 q13.33 Chr20 p12.3 p12.1 q13.2 q13.12 q12 283 bp 41,268 41,005 41,099 41,344 40,943 41,071 41,441 41,044 41,344 41,096 40,595 Depth 32,192 32,192 32,192 152,560 kb 152,570 kb 152,580 kb 152,590 kb 32,192 32,192 32,192 32,192 32,192 32,192 32,192 32,192 32,192 43 kb Chr 1 q12 q32.1 p31.1 p31.2 q41 q43 p34.3 Depth Haplotype 1 reads Haplotype 2 reads Haplotype 1 reads Haplotype 2 reads a) 220 min 15 min FASTQ (q-score filtered) LRA cuteSV (genotyping) Filtering Visualization (IGV & Ribbon) Evaluation (truvari v2.0 & GIAB) c) 40% 60% 80% 100% 0 5 10 15 20 25 30 35 40 Read depth Detecting and characterising structural variation Precision Recall Precision Recall Metric -3000 -1000-750 -500 -250 0 250 500 750 10003000 Size [bp] SV count (bars) 0% 20% 40% 60% 80% 100% 95.47 97.53 Overall 0 400 800 1200 1600 Split by type and size 0% 20% 40% 60% 80% 100% Precision / Recall (lines) Deletions Insertions d) f) a) SNP 90 94 98 92 96 FDA Precision Challenge data, HG002 c) e) ONT whole genome (2.5 Gb, 55x) ONT MHC region (6 Mb Gb, 55x) F1 Recall Precision 100 SNP 90 94 98 92 96 100 b) Chr 6 Depth [0-70] Depth HLA-A 29,941,000 bp 29,943,000 bp 29,945,000 bp 29,947,000 bp ~7.4 kb 33,000 kb 4,962 kb 32,000 kb 31,000 kb 30,000 kb 29,000 kb p22.3 p12.3 q12 q14.1 q16.1 q21 q22.31 Haplotype 1 reads Haplotype 2 reads MHC locus [0-79] LINC00533 HCG15 OR14J1 LINC01015 IFITM4P HCG9 HCG17 HLA-E PPP1R18 SFTA2 C6orf15 HLA-B NFKBIL1 DDAH2 C2 C4A PPT2 TSBP1-AS1 HLA-DRB6 HLA-DMA COL11A2 KIFC1 ~3.5 kb Phase-block length N50 for different read lengths (HG002) 0 1,000 2,000 500 1,500 2,500 3,000 Expected phasing block length (kbp) Method of shearing Control (25 kb) g-Tube (11 kb) SRE+MRIII (17 kb) SRE+MRIII (26 kb) SRE+MRIII (32 kb) SRE+MRIII (36 kb) Sequenced base pairs (%) 0 40 80 20 60 Overall 100 84.3 15.7 Unphased 0 4 8 2 6 Split by read length 10 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000 42500 45000 d) Haplotypes covered with >5 phased reads 1 3 5 2 4 6 7 9 11 8 10 12 13 15 19 14 16 20 17 21 18 22 X M Y Chr Unphased Reference gaps Haplotype 1 Haplotype 2 Phased Oxford Nanopore Technologies and the Wheel icon are registered trademarks of Oxford Nanopore Technologies in various countries. All other brands are the property of their respective owners. © 2020 Oxford Nanopore Technologies. All rights reserved. Oxford Nanopore Technologies’ products are for research use only. P20005 - Version 1

Transcript of Performance benchmarking for calling and phasing of single ...

Page 1: Performance benchmarking for calling and phasing of single ...

Performance benchmarking for calling and phasing of single-nucleotide polymorphisms and structural variantsThe length, accuracy and low bias of nanopore reads makes them ideally suited to the characterisation and phasing of structural variants and single-nucleotide polymorphisms across the entire genome

Fig. 1 Calling structural variants (SV) with nanopore reads a) bioinformatics workflow b) and c) precision and recall d) calling different SV classes e) calling very long insertions and deletions

Our newest version of the SV analysis pipeline benefits from LRA’s improved performance in aligning long reads in areas of structural variation and repeats. Furthermore, cuteSV improves the exact identification of break points, alternative allele sequences and variant genotypes compared to older tools. To validate this pipeline we called SVs in our publicly available 50x HG002 dataset and compared to the most recent GIAB truth-set. We found high overall performance with >95% precision and 97.5% recall (Figs. 1b and 1c). The size distribution of the detected SVs clearly shows the characteristic ALU peak around 300 bp and LINE1 peak above 5,000 bp that is expected in human samples. Low-complexity short tandem repeats are known to be under-represented in the current human reference genome and this is seen as a higher number of insertions than deletions in our data. We found that high precision is independent of read depth. Recall is more directly influenced by read depth with 30x being required for identifying >95% of all variants. However, even with as little as 15x >90% are detected. Performance is largely independent of SV size (Fig. 1d); even very large SVs are called with base-pair precision (Fig. 1e and f). These examples show SV-calling combined with read phasing: Fig. 1e shows a homozygous 40 kb insertion which has been captured in single reads from each haplotype and Fig. 1f shows a 30 kb heterozygous deletion that covers LCE3B and LCE3C. A deletion of those two genes is known to be a susceptibility factor for psoriasis.

The read lengths enabled by nanopore sequencing make it feasible to call long structural variants with high sensitivity and specificity, even in difficult regions of the human genome

Fig. 2 SNP-calling a) recall and precision using publicly available datasets b)-d) phasing e) phasing genes in the major incompatibility complex

To test SNP-calling performance we used DeepVariant on the publicly available FDA Precision Challenge HG002 dataset. Results show high precision (>99.7%) and recall (>99.5%) for substitutions across the whole genome, including highly repetitive regions, segmental duplications, and the MHC region (Fig. 2a). The length of nanopore reads can be leveraged to assign SNP, SVs, and methylation calls to their homologous chromosome of origin. To illustrate phasing on a typical dataset, we used whatshap on an internal 40x HG002 dataset with a read N50 of 23 kb. By using information from overlapping long reads the resulting phase blocks can span several tens of megabases of the genome, typically only interrupted by long stretches of homozygosity (Fig. 2b). Phase-block lengths are influenced by read length. For 10-15 kb reads we observe a phase-block length of between 100-200 kb, sufficient to fully phase roughly 60% of all gene bodies. Increasing read length to >30 kb causes median phase-block length to increase tenfold, exceeding 1.5 Mb, allowing full phasing of >90% of all genes in the human genome. A high percentage of phased reads allows phasing of small variants, SVs and methylation calls. We used whatshap haplotag to scan raw reads for the presence of phased variants. With a read N50 of 23 kb we were able to assign >85% of all sequenced nucleotides to a haplotype (Fig. 2c). We found that >98% of the entire human genome (excluding centromeres and telomeres) is covered by phased reads (Fig. 2d). As a demonstration of phasing performance, we choose the MHC region, where we were able to cover 99% of the 6 Mb locus with 5 phase blocks and fully phase >99% of all genes (Fig. 2e).

The combination of long reads with highly sensitive and specific calling of single-nucleotide polymorphisms simplifies phasing and results in exceptionally long phase-blocks

Contact: [email protected] More information at: www.nanoporetech.com and publications.nanoporetech.com

e)

b)

LCE3D LCE3C LCE3B

41,147

[0 - 58]

54,123,200 bp 54,123,300 bp

p13 p12.2 p11.23 p11.21 q11.23 q13.13 q13.33Chr20 p12.3 p12.1 q13.2q13.12q12

283 bp

41,26841,00541,09941,34440,94341,071

41,44141,04441,34441,096

40,595

Depth

32,19232,19232,192

152,560 kb 152,570 kb 152,580 kb 152,590 kb

32,19232,19232,192

32,19232,19232,192

32,19232,19232,192

43 kb

Chr 1 q12 q32.1p31.1p31.2 q41 q43p34.3

Depth

Haplotype 1reads

Haplotype 2reads

Haplotype 1reads

Haplotype 2reads

a)

220 min 15 min

FASTQ(q-score filtered)

LRA cuteSV(genotyping)

Filtering

Visualization(IGV & Ribbon)

Evaluation(truvari v2.0 & GIAB)

c)

40%

60%

80%

100%

0 5 10 15 20 25 30 35 40

Read depth

Detecting and characterising structural variation

Precision

Recall

Precision Recall

Metric

-3000 -1000-750 -500 -250 0 250 500 750 10003000

Size [bp]

SV

cou

nt (b

ars)

0%

20%

40%

60%

80%

100%95.47 97.53

Overall

0

400

800

1200

1600

Split by type and size

0%

20%

40%

60%

80%

100%

Pre

cisi

on /

Rec

all (

lines

)

Deletions Insertions

d)

f)

a)

SN

P

90

94

98

92

96

FDA Precision Challenge data, HG002 c) e)ONT whole genome

(2.5 Gb, 55x)ONT MHC region

(6 Mb Gb, 55x)

F1

RecallPrecision

100

SN

P

90

94

98

92

96

100

b)

Chr 6

Depth

[0-70]

Depth

HLA-A

29,941,000 bp 29,943,000 bp 29,945,000 bp 29,947,000 bp

~7.4 kb

33,000 kb

4,962 kb32,000 kb31,000 kb30,000 kb29,000 kb

p22.3 p12.3 q12 q14.1 q16.1 q21 q22.31

Haplotype 1reads

Haplotype 2reads

MHClocus

[0-79]

LINC00533 HCG15 OR14J1 LINC01015 IFITM4P HCG9 HCG17 HLA-E PPP1R18 SFTA2 C6orf15 HLA-B NFKBIL1 DDAH2 C2 C4A PPT2 TSBP1-AS1 HLA-DRB6 HLA-DMA COL11A2 KIFC1

~3.5 kb

Phase-block length N50 for different read lengths (HG002)

0

1,000

2,000

500

1,500

2,500

3,000

Exp

ecte

d ph

asin

g bl

ock

leng

th (k

bp)

Method of shearing

Control(25 kb)

g-Tube(11 kb)

SRE+MRIII(17 kb)

SRE+MRIII(26 kb)

SRE+MRIII(32 kb)

SRE+MRIII(36 kb)

Seq

uenc

ed b

ase

pairs

(%)

0

40

80

20

60

Overall100

84.3

15.7

Unpha

sed

0

4

8

2

6

Split by read length10

02500

50007500

1000012500

1500017500

2000022500

2500027500

3000032500

3500037500

4000042500

45000

d) Haplotypes covered with >5 phased reads

1 3 52 4 6 7 9 118 10 12 13 15 1914 16 2017 2118 22 XM YChr

UnphasedReference gaps

Haplotype 1Haplotype 2

Phase

d

Oxford Nanopore Technologies and the Wheel icon are registered trademarks of Oxford Nanopore Technologies in various countries. All other brands are the property of their respective owners. © 2020 Oxford Nanopore Technologies. All rights reserved. Oxford Nanopore Technologies’ products are for research use only. P20005 - Version 1