Sept2016 sv 10_x
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
78 -
download
1
Transcript of Sept2016 sv 10_x
SV-calling from 10X Linked-Read data
GIAB SV Data Jamboree
Sofia Kyriazopoulou-Panagiotopoulou, Sr Scientist, 10X [email protected]
Sept 15, 2016
2
Making Linked-Reads1.0ng high-molecular-weight gDNA = 300 haploid copies of the genome
CollectGEMs
OilBarcoded primer library
Pool
EnzymePrimers
with the
same barcode
0.5ng DNA (150 haploid copies of the genome) split into 1M partitions
Long input molecule
P5 16bp BCR1 Nmer gDNA Insert
Linked-Reads
Sequence
3
What Linked-Reads are not
150X avg molecule coverage
chr13: BRCA2
> 30X avg read coverage
• Each GEM contains 150/1M = 1/6000 of the genome (500 Kb)• If the average molecule length is 50Kb: 10 molecules/GEM• At an average of 30X sequencing depth, the read depth per molecule is 30/150 =
0.2X.• Roughly 35 read-pairs per molecule Linked-Reads.
4
Linked-Reads make SV detection easier
Short-read data
Barcoded short-read data
10X barcoding
Molecule inferencePhasing
Linked-Read data
Phased Linked-Read data
5
Deletion detection from Linked-Read data
Linked-Read alignment + Phasing
Coverage drop detection (HMM)
Discordant read-pair clustering
Local assembly by haplotype
Probabilistic modeling of insert sizes, errors,
phasing (EM)
Final phased HET/HOM deletion calls
Candidate generation
Candidate filtering
6
Large SV detection from Linked-Read dataA B C D V W X Y ZE
Refe
renc
e
A B C W E D X Y ZV
Inve
rsio
n
A B C X Y Z
Alig
ning
to th
e re
fere
nce
D V WE
• If the event is heterozygous we see a mixture of the two types of signal.• Probabilistic model of molecule length distribution and read depth to call and
phase variants (deletions, inversions, duplications, translocations).
7
A/J and CEPH trio callsLa
rge-
scal
e SV
s (>
30Kb
)De
letio
ns 5
0bp-
30Kb
• Mostly deletions
• 2-3 inversions in the son/daughter
A/J calls against hg19 and GRCh38 deposited to GIAB
8
SV-calling in hard regions
• 840bp HET deletion call in child and dad, no call in mom
• No mappability for short-reads. Region is spanned by 200Kb segmental duplication with >98% identity copy.
10X
Geno
mic
sPC
R-fr
ee
Trus
eq
Hap 1
Hap 2
Unphased
9
SV-calling in hard regions
• 73bp HET deletion call in child and mom, no call in dad.
• Overlaps simple repeat.
• Supported by 10X de-novo assembly.
10X
Geno
mic
sPC
R-fr
ee
Trus
eq
10X de-novo
assembly
10
Improved breakpoint resolution over short reads
• ~30Kb HET deletion call in child and dad, no call in mom.
• No read-pair support in PCR-free Truseq, low mappability at breakpoints (LINEs)
• Barcode-aware alignment allows us to get near-bp resolution.
10X
Geno
mic
sPC
R-fr
ee
Trus
eq
11
Beyond deletions
• HOM inversion
• No read-pair support, low mappability at breakpoints (LINEs)
• Breakpoints resolved within <1.5Kb in all three individuals of the trio
90Kb inversion causes molecules to “jump” between breakpointsAB
10X
Geno
mic
sPC
R-fr
ee
Trus
eq
12
•Not all repetitive/hard-to-map regions are resolved by the Linked-Read aligner.
•Highly polymorphic regions, assembly artifacts lead to false positives–SV whitelist/blacklist?
•How do we compare/overlap SV calls, especially in repeat-rich areas?
Future development