Sept2016 sv illumina

14
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. A Population-Based SV Callset Michael Eberle September 15, 2016

Transcript of Sept2016 sv illumina

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.

A Population-Based SV Callset Michael Eberle September 15, 2016

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

2

Populations as a CNV/SV discovery tool

●  Collected high-depth WGS data from 3,000-4,000 samples

●  Current SV & CNV callers and SNP information can be used for hypothesis generation -  Manta, Canvas & SNP GT information (HWE)

●  Confirm CNVs and SVs and refine break points -  Combination of depth and assembly

●  Create targeted callers for common SVs -  Develop improved methods to call these variants on any sample

(improved consistency and accuracy)

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

3

Hypothesis generation

●  Manta (split read) calls from ~3,000 samples -  Good for deletions up to ~10kb -  These variants should work well for graph-based calling

●  Canvas (depth) calls from ~3,000 samples -  Primarily deletions >10kb -  These variants should be callable using targeted depth analysis

●  Population genetics (SNP calls from ~2,200 Europeans) -  Good for most size ranges but relies on SNPs overlapping CNV -  May identify variants that split read methods miss

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

4

Small SV Validation and boundary resolution

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

5

Assembly of putative deletions

●  Assemble common deletions using different methods -  K-mer approach (SPAdes) -  String assembly (SGA)

●  Large number of samples should improve assembly -  5% deletion ~ 4,500x depth (in 3,000 sequenced samples) -  Easy if we can identify just the reads associated with the deletion -  Low complexity (e.g. STRs) and variability around break-ends are

problematic (but highlight issues that need to be resolved)

●  Assembled ~9,800 deletions to break point resolution -  Starting from Manta calls in PG and population

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

6

Genotyping deletions with graphs

●  Graph alignment + genotyping tool -  Create a sequence graph for deletions -  Align + count reads on BP edges and

on BP boundaries

●  “Genotype” deletions via mixture model on read counts.

●  Very preliminary results and some obvious improvements are still in progress

REF+ALT REF+ALT

REF REF BP end REF BP start

ALT BP 25 BP up

25 BP down 25 BP up

25 BP down

Candidate events in Platinum Genomes:

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

7

PG Pedigree Genotyping

●  High-level variant statistics

-  ~3,700 hom-ref, 830 hom-alt -  ~1,100 variants consistent -  ~2,000 variants probably wrong in

<= 2 samples -  ~1,900 need improvement

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

8

Giab Trio Analysis

●  GIAB Ashkenazim Trio: -  HG002 – Son -  HG003 – Father -  HG004 – Mother

●  Trio Statistics (Total: 9,739) Outcome Count Percentage

Homref everywhere 3,425 35.2%

Pass 2,131 21.9%

Conflict (5) / Male X is het (28) 33 0.3%

Parents het 1,045 10.7%

Called in 2 1,637 16.8%

Called in <= 1 1,468 15.1%

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

9

Depth-based validation and boundary resolution

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

10

Canvas-identified deletion in the population

●  Characterization of a high frequency call

●  Use fine-grained depth-map to isolate boundaries of the call

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

11

Improving breakpoints – depth pileup

●  Characterization of a high frequency call

●  Use fine-grained depth-map to isolate boundaries of the call

1 kb

1 kb

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

12

Improving breakpoints – six double deletions

180 bp

180 bp

•  Assembly from 6 hom del samples resolves breakpoints and co-inserted sequence •  Expect ~5 double deletions in 3,000 samples for deletions with ~4% frequency

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

13

Next steps (for deletions)

●  Complete merging of all the putative common deletions -  Currently some parts of this work are proceeding in parallel and we

have not merged the analysis/variants yet

●  Attempt to validate and refine all deletions -  LD, depth, assembly

●  Finish graph-based GT and depth-based GT -  QC these GT callers using both Mendel errors and HWE

●  Map calls onto PG and GIAB samples

●  Demonstrate other variants on other samples -  Sequencing 150 Coriell samples of diverse ethnicity to provide

additional support of calls within (and outside of) GIAB families

COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY

14

Acknowledgements

Mitch Bekritsky (SNP analysis)

Andy Gross (population depth)

Nathan Johnson (assembly)

Peter Krusche (graph genotyping)