171017 giab for giab grc workshop

27
Genome in a Bottle: Developing benchmark sets for large indels and structural variants Justin Zook, Marc Salit, and the GIAB Consortium NIST Genome-Scale Measurements Group Joint Initiative for Metrology in Biology (JIMB) Oct 16, 2017

Transcript of 171017 giab for giab grc workshop

Page 1: 171017 giab for giab grc workshop

Genome in a Bottle:Developing benchmark sets for large indels and

structural variants

Justin Zook, Marc Salit, and the GIAB Consortium

NIST Genome-Scale Measurements Group

Joint Initiative for Metrology in Biology (JIMB)

Oct 16, 2017

Page 2: 171017 giab for giab grc workshop

Take-home Messages

• Genome in a Bottle is authoritatively characterizing human genomes

• Current characterization enables benchmarking of “easier” variants/regions in germline genomes– Clinical validation

– Technology development, optimization, and demonstration

• Now working on difficult variants and regions– Draft variant calls >=20bp available and feedback requested

– Many challenges remain and collaborations welcome!

Page 3: 171017 giab for giab grc workshop

Why are we doing this?

• Technologies evolving rapidly

• Different sequencing and bioinformatics methods give different results

• Now have concordance in easy regions, but not in difficult regions

• Challenge:– How do we characterize 6 billion

bases in the genome with high confidence?

O’Rawe et al, Genome Medicine, 2013https://doi.org/10.1186/gm432

Page 4: 171017 giab for giab grc workshop

GIAB is evolving

2012

• No human benchmark calls available

• GIAB Consortium formed

2014

• Small variant genotypes for ~77% of pilot genome NA12878

2015

• NIST releases first human genome Reference Material

2016

• 4 new genomes

• Small variants for 90% of 5 genomes for GRCh37/38

2017+

• Characteriz-ing difficult variants

Page 5: 171017 giab for giab grc workshop

Genome in a Bottle ConsortiumAuthoritative Characterization of Human Genomes

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance

• GIAB is developing:

– reference materials

– Reference data

– Methods

– Tools to calculate performance metrics

gen

eric

me

asu

rem

en

t p

roce

ss

www.slideshare.net/genomeinabottle

Page 6: 171017 giab for giab grc workshop

Bringing Principles of Metrologyto the Genome

• Reference materials

– DNA in a tube from NIST

• Extensive state-of-the-art characterization

• “Upgradable” as technology develops

• Commercial innovation

– PGP genomes suitable for commercial derived products

• Benchmarking tools and software

– with GA4GH

• Enhance new technologies

Page 7: 171017 giab for giab grc workshop

GIAB has characterized 5 human genome RMs

• Pilot genome

– NA12878

• PGP Human Genomes

– Ashkenazi Jewish son

– Ashkenazi Jewish trio

– Chinese son• Parents also characterized

RM 8391 Page 1 of 3

National I nstitute of S tandards & Technology

Report of I nvestigation

Reference Material 8391

Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry)

This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists

of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess

performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human

genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell

Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak

of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer

(10 mM TRIS, 1 mM EDTA, pH 8.0).

This material is intended for assessing performance of human genome sequencing variant calling by obtaining

estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include

whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This

genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze

extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA

extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of

mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as

functional or clinical interpretation.

Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions

and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods

similar to described in reference 1. An information value is considered to be a value that will be of interest and use to

the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe

and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.

These data and genomic characterizations will be maintained over time as new data accrue and measurement and

informatics methods become available. The information values are given as a variant call file (vcf) that contains the

high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called

high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this

report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information

(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest

Expiration of Value Assignment: RM 8391 is valid, until 23 December 2024, provided the RM is handled and

stored in accordance with instructions given in this report (see “Instructions for Storage and Use”). This material and

associated information values are nullified if the RM is damaged, contaminated, or otherwise modified.

Maintenance of RM: This report will be updated periodically to reflect important new releases as the high-confidence

calls and regions are updated. NIST will monitor this RM over the period of its validity. If substantive technical

changes occur that affect the value assignment before the expiration of this report, NIST will notify the purchaser.

Registration (see attached sheet or register online) will facilitate notification.

Overall direction and coordination of the analyses was performed by J. Zook and M. Salit of the NIST Biosystems

and Biomaterials Division.

Anne L. Plant, Chief

Biosystems and Biomaterials Division

Gaithersburg, MD 20899 Steven J. Choquette, Director

Report Issue Date: 08 September 2016 Office of Reference Materials

Page 8: 171017 giab for giab grc workshop

Integration of diverse data types and analyses

• Data publicly available

– Deep short reads

– Linked reads

– Long reads

– Optical/nanopore mapping

• Analyses

– Small variant calling

– SV calling

– Local and global assembly

Discover & Refine

sequence-resolved calls from multiple

datasets & analyses Compare

variant and genotype calls from different

methods

Evaluate/ genotype calls

with other data

Identify features

associated with reliability of calls from each method

Form benchmark calls using

heuristics & machine learning

Compare benchmarks

to high-quality

callsets and examine

differences

Page 9: 171017 giab for giab grc workshop

Paper describing data…

51 authors14 institutions12 datasets7 genomesData described in ISA-tab

Page 10: 171017 giab for giab grc workshop

Evolution of high-confidence small variants

CallsHC

Regions HC CallsHC

indelsConcordant

with PG

NIST-only in beds

PG-only in beds PG-only

Variants Phased

v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%

5-7 errors in NIST

1-7 errors in NIST

~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files

Page 11: 171017 giab for giab grc workshop

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools

• Integrated into a single framework with standardized inputs and outputs

• Standardized bed files with difficult genome contexts for stratification

https://github.com/ga4gh/benchmarking-tools

Variant types can change when decomposing or recomposing variants:

Complex variant:chr1 201586350 CTCTCTCTCT CA

DEL + SNP:

chr1 201586350 CTCTCTCTCT C

chr1 201586359 T A

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team

Page 12: 171017 giab for giab grc workshop

Benchmarking ToolsStandardized comparison, counting, and stratification with Hap.py + vcfeval

https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools

Page 13: 171017 giab for giab grc workshop

What are we accessing and what is still challenging?

Type of variant Genome context

Fraction of variants

called*

Number of variants missing*

How to improve?

Simple SNPs Not repetitive ~97% >100k Machine learning

Simple indels Not repetitive ~93% >10k Machine learning

All variants Lowmappability

<30% >170k Use linked reads and long reads

All variants Regions not in GRCh37/38

0 >>100k??? De novo assembly; long reads

Small indels Tandem repeatsand

homopolymers

<50% >200k STR/homopolymer callers; long reads; better handle complex

and compound variants

Indels 15-50bp All <25% >30k Assembly-based callers; integrate larger variants differently; long reads

Indels >50bp All <1% >20k

* Approximate values based on fraction of variants in GATKHC or FermiKit that are inside v3.3.2 High-confidence regions

Page 14: 171017 giab for giab grc workshop

How can we extend our approach to structural variants?

Similarities to small variants

• Collect callsets from multiple technologies

• Compare callsets to find calls supported by multiple technologies

Differences from small variants

• Callsets have limited sensitivity

• Variants are often imprecisely characterized– breakpoints, size, type, etc.

• Representation of variants is poorly standardized, especially when complex

• Comparison tools in infancy

Page 15: 171017 giab for giab grc workshop

Our strategy

Collect many candidate calls for AJ Trio

• Gather candidate calls from a variety of approaches

– Many technologies

• Short, linked, and long reads

• Optical and nanopore mapping

– Many approaches

• Small variant callers

• Structural variant callers

• Local and global de novo assemblies

• Community submitted >1 million calls from 30+ methods using 5+ technologies

Refine/evaluate/genotype candidates

• Obtain sequence-resolved calls as often as possible using assembly-based approaches

• Compare sequence predictions of candidate calls and merge similar calls

• Determine raw data’s support of each sequence-resolved call and its genotype

Page 16: 171017 giab for giab grc workshop

Evaluation/genotyping suite of methods

Current approaches

• svviz – maps reads to REF or ALT alleles– PacBio

– Illumina paired end and mate-pair

– 10X haplotype-separated

• BioNano – compare size predictions

• Nabsys – evaluates large deletions

Future approaches

• Separate haplotypes on other data types for svviz using whatshap

• Online manual curation of svviz, IGV, dotplots, gEVAL, etc.– Volunteers needed!

• PCR-Sanger targeted sequencing– Collaborations welcome!

Page 17: 171017 giab for giab grc workshop

Integrating Sequence-resolved Calls >=20bp

>1 million calls from 30+ sequence-resolved callsets from 4 techs for AJ Trio

>500k unique sequence-resolved calls

30k INS and 32k DEL with 2+ techs or 5+ callers predicting sequences <20%

different or BioNano/Nabsys support

28k INS and 29k DEL genotyped by svviz in 1+

individuals

v0.4.0

http://tinyurl.com/GIABSV0-4-0

Page 18: 171017 giab for giab grc workshop

Size Distribution of v0.4.0 Calls

Not Tandem Repeat

Tandem Repeat

Deletions Insertions

Alu

LINE

Alu

LINE

Page 19: 171017 giab for giab grc workshop

Sequence-resolved insertion size relative to BioNano

Page 20: 171017 giab for giab grc workshop

Insertion sequence prediction accuracy differs between methods

Relative Distance from exact match

Illumina local assembly

PacBio raw read

PacBio consensus assembly

Page 21: 171017 giab for giab grc workshop

Developing web-based Manual curation tools

https://github.com/svviz/svviz

Page 22: 171017 giab for giab grc workshop

Outstanding challenges and future work

• Large sequence-resolved insertions

• Many fewer multi-kb insertions than multi-kb deletions

• Dense calls

• ~1/3 v0.4.0 calls are within 1kb of another v0.4.0 call

• Sequence-resolved insertion size doesn’t always match BioNano

• Phasing will be important for these (e.g., with 10X, whatshap)

• Calls with inaccurate or incomplete sequence change

• Exploring training a model to predict sequence accuracy

• Homozygous Reference calls

• Can we definitively state there is no SV in some regions?

• E.g., using diploid assembly?

• Benchmarking tool development

• How to compare SVs to a benchmark?

• What performance metrics are important?

Page 23: 171017 giab for giab grc workshop

New public data planned for late 2017

• PacBio Sequel sequencing of GIAB Chinese trio

– Collaboration with Mt. Sinai

– 60x/30x/30x coverage planned

– Potentially >15kb N50 read length

• Oxford Nanopore sequencing of Ashkenazim trio

– Collaboration with Nick Loman and Matt Loose

– ~50x/25x/25x coverage planned

– Ultralong read sequencing (50-100kb+ N50 read length)

Page 24: 171017 giab for giab grc workshop

New Samples

Additional ancestries

• Shorter term– Use existing PGP individual samples

– Use existing integration pipeline

• Data-based selection– Proportion of potential genomes from

different ancestries

• 3 to 8 new samples

• Longer term– Recruit large family

– Recruit trios from other ancestry groups

Cancer samples

• Longer term

• Make PGP-consented tumor and normal cell lines from same individual

• Select tumor with diversity of mutation types

Page 25: 171017 giab for giab grc workshop

Take-home Messages

• Genome in a Bottle is authoritatively characterizing human genomes

• Current characterization enables robust benchmarking of “easier” variants/regions

• Actively working on difficult variants and regions– Draft variant calls >=20bp available – feedback requested!

• New public long and ultralong read datasets coming!• What can we help enable?

– Clinical applications – precision medicine– Research applications – how to know new methods are measuring difficult

regions/variants well

http://tinyurl.com/GIABSV0-4-0

Page 26: 171017 giab for giab grc workshop

Acknowledgements

• NIST/JIMB

– Marc Salit

– Jenny McDaniel

– Lindsay Vang

– David Catoe

– Lesley Chapman

• Genome in a Bottle Consortium

• GA4GH Benchmarking Team

• FDA

Page 27: 171017 giab for giab grc workshop

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

SVs: http://tinyurl.com/GIABSV0-4-0

Data: http://www.nature.com/articles/sdata201625

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools– precision.fda.gov – GA4GH benchmarking app

Biweekly Analysis Team calls (open to all)– https://groups.google.com/forum/#!forum/giab-analysis-team

Public workshops – Next workshop Jan 25-26, 2018 in Stanford, CA– http://jimb.stanford.edu/giabworkshops for info and registration

NIST/JIMB postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]