March 2013 NIST Reference Material Program and Data Integration

NIST Program for Human Genome Reference Materials

Marc Salit and Justin ZookNIST

Some use cases for a well-characterized, stable RM

• Obtain metrics for validation, QC, QA, PT

• Determine sources and types of bias/error

• Learn to resolve difficult structural variants

• Improve reference genome assembly

• Optimization– integration of data from

multiple platforms– sequencing and analysis

• Enable regulated applications

Comparison of SNP Calls forNA12878 on 2 platforms, 3

analysis methods

Measurement ProcessSample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified

for their variants against a reference sequence, with confidence estimates

gene

ric m

easu

rem

ent p

roce

ss

Variants of Interest

• SNPs (and larger polymorphisms)

• Indels• Longer insertions/deletions• Inversions• Rearrangements• CNV (different lengths)

– Deletions, tandem and dispersed dups

– duplications with SNPs/indels

• Mobile Element Insertions

• NIST working with GiaB to select genomes

• Current plan– NA12878 HapMap

sample as Pilot sample• part of 17-member

pedigree

– trios from PGP as more complete set• 8 trios, focus on children• varying biogeographic

ancestry

CEPH Utah Pedigree 1463

Putting “Genomes” in Bottles

Consenting Genomes for use as Reference Materials

• Risk of re-identification– this is a real risk– privacy– implications for family members

• Meaning of possibility of withdrawal

• Commercial application– indirect, research– direct, derived products

• PGP project currently state-of-art– broad and direct– test to demonstrate

understanding

• “Wild West”

Characterization Methods

Whole Genome Sequencing• ABI 5500 (1kb, 6kb, and 10kb

mate-pair libraries)• Illumina• Complete Genomics

– including LFR

• Emerging technologies – Ion Proton– nanopore?

• 3x replication of sequencing (3 library preps)

• …

Other• Genotyping microarrays• Array CGH• Targeted sequencing• Fosmid sequencing?• Optical Mapping?

Father Mother

NA12878Husband

Son Daughter

Timeline

Consortium Activity• WG Telecons

– Starting up in April– Info to be posted on

www.genomeinabottle.org• schedules• agendas• summaries

• Website forums– general and supporting each WG

• Upcoming Workshops– Proposed 8/2013

• NIST, Gaithersburg, MD

NIST RM Activity• 80 mg gDNA for NA12878

expected @ NIST 4/2013– 8000 samples– available for characterization within

GiaB immediately– target for release as NIST RM 2/2014

• SNPs, small indels

• PGP Samples coming• IRB Status

– working to establish policy• looks good for release of NA12878

as pilot RM• PGP samples expected to gain

approval

http://www.genomeinabottle.org/

Artificial Constructs• useful as spike-ins

– QC on clinical samples

• a panel of druggable targets in development at NCI– pDNA with a mutation insert

• ‘barcoded’ adjacent to mutation of interest

• large-scale constructs may be useful for SV and specific contexts

• recapitulate “difficult” sequence contexts– simple sequence– duplications

Reference Samples

SamplePreparation

Sequencing

Bioinformatics

Microbial Genome RMs

Variant List, Performance

Metrics

Extracted DNA

DATA INTEGRATION

With multiple data sets, both opportunity for integration and question of just how to do it.

Datasets

• 9 whole genome – Illumina, CG, 454, SOLiD• 3 whole exome – Illumina, Ion Torrent

Integration of Data toForm “Gold Standard” Genotype Calls

Find all possible variant sites

Find highly confident sites across multiple datasets

Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias

For each site, remove datasets with decreasingly atypical characteristics until all datasets agree

Even if all datasets agree, identify them as uncertain if few have typical characteristics

Candidate variants

Confident variants

Find characteristics of bias

Arbitration

Confidence Level

Characteristics of Sequence Data/Genotype associated with bias

• Systematic sequencing errors– Strand bias– Base Quality Rank Sum Test

• Local Alignment problems– Distance from end of read– Mean position within read– Read Position Rank Sum– HaplotypeScore– Mean length of aligned

reads

• Mapping problems– Mapping Quality– Higher (or lower) than

expected coverage – CNV

– Length of aligned reads

• Abnormal allele balance or Quality/Depth– Allele Balance – Quality/Depth

Example of Arbitration: SSE suspected from strand bias

Platf

orm

BPl

atfor

m A

Homopolymer

Strand Bias(SNP overrepresentedon reverse strands)

Performance Assessment of Genotype Calling

• For our purposes, we consider three categories of genotype calls– homozygous reference– heterozygous– homozygous variant

• by convention– Negative: homozygous

reference– Positive: anything else

• our approach looks at 3x3 matrix of call concordance

• Fourth category: Uncertain Genotype– developing

• Three performance assessments:– Individual dataset and

Consensus calls against Omni SNP Array

– Individual dataset against Omni SNP Array and Consensus

– Individual dataset with two different genotype callers against Consensus

Genotype Comparison TablesMethod as “Truth”

Met

hod

bein

g As

sess

ed

Hom. Ref

Hom

. Ref

.

Heterozygous Hom. Variant Uncertain

Het

.H

om. V

ar.

Unc

erta

in

* current state of research: only consensus process has “Uncertain” category

?

?

?

?? ? ? ?

Consensus has lower FN rate than individual datasets

Homozygous Reference Heterozygous Homozygous

Variant Uncertain

Homozygous Reference/

No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A

Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/AHomozygous

Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A

Illumina Omni SNP Array

Inte

grat

ed C

onse

nsus

G

enot

ypes


Variant Uncertain

Homozygous Reference 1.45M 613 (0.09%) 977 (0.15%) N/A


Variant 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A

Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A

HiS

eq –

GAT

K

“FNs”

“FPs*”

“FNs”

“FPs*”

* Note that most or all of the putative FPs seem to actually be FNs on the microarray


SNP arrays overestimate performance


Variant Uncertain


No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A


Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A


Variant Uncertain


No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M

Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous

Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)

Integrated Consensus Genotypes

HiS

eq –

GAT

K

“FNs”

“FPs*”

“FNs”

“FPs”

HiS

eq –

GAT

K


Samtools has higher FP and lower FN than GATK


Variant Uncertain


No Call1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M

Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%)Homozygous

Variant 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%)



Variant Uncertain


No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M

Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous

Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)


HiS

eq –

sam

tool

s

“FNs”

“FPs”

“FNs”

“FPs”

HiS

eq –

GAT

K

Performance Metrics: Characteristics of Mis-calls

. . .

QUAL/Depth of Coverage

HiS

eq/G

ATK

Consensus Genotypes

Het

eroz

ygou

sH

om. V

aria

ntH

om. R

ef./

No

call

Heterozygous Hom. VariantHom. Ref. Uncertain

Strand Bias …

Challenges with assessing performance

• All variant types are not equal• Nearby variants are often

difficult to align• All regions of the genome are

not equal– Homopolymers, STRs,

duplications– Can be similar or different in

different genomes

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic accuracy

measures not well posed

• Data from multiple platforms and library preparations– when characterizing a

Reference Material– when assessing performance

of a test platform

Genome-in-a-Bottle Consortium

• Genome-in-a-Bottle– www.genomeinabottle.org

• newsletters, blogs, forums, announcements

– new partners welcome!– targeting pilot reference

material availability in 2013– working to identify best

practice for consent of subject genome as a whole-genome reference material

• Developing genomic DNA reference materials for small number of microbial species– to enable performance

assessment of sequencing platforms

– range of GC– range of complexity

http://www.genomeinabottle.org/

QUESTIONS?

Microbial Reference Material Considerations

• Variation in GC Content – Genomes with a range of GC to challenge

platforms– Within genome variation to challenge

analytical process to define mobile genetic and insertion elements

• Structural variations to challenge the ability to recognize– Repetitive sequences (e.g. palindromic

repeats)– Homopolymers (>14 bases)– Insertion elements– Chromosomal rearrangements– SNP calls (e.g. variant silencing due to

motifs)

• Reference data available on multiple platforms

• Pedigree/phylogeny of strains• Phenotypic characterization

Interesting work on assessing performance for microbial sequencing

• Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance– ~20% - ~68% GC overall– Bordetella pertussis

• 67.7 % GC, with some regions in excess of 90 % GC content

– Salmonella Pullorum• 52 % GC

– Staphylococcus aureus• 33 % GC

– Plasmodium falciparum• 19.3 % GC, with some regions close to 0 % GC

content

• “We routinely use these to test new sequencing technologies, as together their sequences represent the range of genomic landscapes that one might encounter.”

Quail, M. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific

Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341

(2012).

March 2013 NIST Reference Material Program and Data Integration

Documents

Transcript of March 2013 NIST Reference Material Program and Data Integration