March 2013 NIST Reference Material Program and Data Integration
-
Upload
genomeinabottle -
Category
Documents
-
view
500 -
download
1
Transcript of March 2013 NIST Reference Material Program and Data Integration
NIST Program for Human Genome Reference Materials
Marc Salit and Justin ZookNIST
Some use cases for a well-characterized, stable RM
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications
Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
Some use cases for a well-characterized, stable RM
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications
Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
Measurement ProcessSample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be certified
for their variants against a reference sequence, with confidence estimates
gene
ric m
easu
rem
ent p
roce
ss
Variants of Interest
• SNPs (and larger polymorphisms)
• Indels• Longer insertions/deletions• Inversions• Rearrangements• CNV (different lengths)
– Deletions, tandem and dispersed dups
– duplications with SNPs/indels
• Mobile Element Insertions
• NIST working with GiaB to select genomes
• Current plan– NA12878 HapMap
sample as Pilot sample• part of 17-member
pedigree
– trios from PGP as more complete set• 8 trios, focus on children• varying biogeographic
ancestry
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
Consenting Genomes for use as Reference Materials
• Risk of re-identification– this is a real risk– privacy– implications for family members
• Meaning of possibility of withdrawal
• Commercial application– indirect, research– direct, derived products
• PGP project currently state-of-art– broad and direct– test to demonstrate
understanding
• “Wild West”
Characterization Methods
Whole Genome Sequencing• ABI 5500 (1kb, 6kb, and 10kb
mate-pair libraries)• Illumina• Complete Genomics
– including LFR
• Emerging technologies – Ion Proton– nanopore?
• 3x replication of sequencing (3 library preps)
• …
Other• Genotyping microarrays• Array CGH• Targeted sequencing• Fosmid sequencing?• Optical Mapping?
Father Mother
NA12878Husband
Son Daughter
Timeline
Consortium Activity• WG Telecons
– Starting up in April– Info to be posted on
www.genomeinabottle.org• schedules• agendas• summaries
• Website forums– general and supporting each WG
• Upcoming Workshops– Proposed 8/2013
• NIST, Gaithersburg, MD
NIST RM Activity• 80 mg gDNA for NA12878
expected @ NIST 4/2013– 8000 samples– available for characterization within
GiaB immediately– target for release as NIST RM 2/2014
• SNPs, small indels
• PGP Samples coming• IRB Status
– working to establish policy• looks good for release of NA12878
as pilot RM• PGP samples expected to gain
approval
Artificial Constructs• useful as spike-ins
– QC on clinical samples
• a panel of druggable targets in development at NCI– pDNA with a mutation insert
• ‘barcoded’ adjacent to mutation of interest
• large-scale constructs may be useful for SV and specific contexts
• recapitulate “difficult” sequence contexts– simple sequence– duplications
Reference Samples
SamplePreparation
Sequencing
Bioinformatics
Microbial Genome RMs
Variant List, Performance
Metrics
Extracted DNA
DATA INTEGRATION
With multiple data sets, both opportunity for integration and question of just how to do it.
Datasets
• 9 whole genome – Illumina, CG, 454, SOLiD• 3 whole exome – Illumina, Ion Torrent
Integration of Data toForm “Gold Standard” Genotype Calls
Find all possible variant sites
Find highly confident sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics
Candidate variants
Confident variants
Find characteristics of bias
Arbitration
Confidence Level
Characteristics of Sequence Data/Genotype associated with bias
• Systematic sequencing errors– Strand bias– Base Quality Rank Sum Test
• Local Alignment problems– Distance from end of read– Mean position within read– Read Position Rank Sum– HaplotypeScore– Mean length of aligned
reads
• Mapping problems– Mapping Quality– Higher (or lower) than
expected coverage – CNV
– Length of aligned reads
• Abnormal allele balance or Quality/Depth– Allele Balance – Quality/Depth
Example of Arbitration: SSE suspected from strand bias
Platf
orm
BPl
atfor
m A
Homopolymer
Strand Bias(SNP overrepresentedon reverse strands)
Performance Assessment of Genotype Calling
• For our purposes, we consider three categories of genotype calls– homozygous reference– heterozygous– homozygous variant
• by convention– Negative: homozygous
reference– Positive: anything else
• our approach looks at 3x3 matrix of call concordance
• Fourth category: Uncertain Genotype– developing
• Three performance assessments:– Individual dataset and
Consensus calls against Omni SNP Array
– Individual dataset against Omni SNP Array and Consensus
– Individual dataset with two different genotype callers against Consensus
Genotype Comparison TablesMethod as “Truth”
Met
hod
bein
g As
sess
ed
Hom. Ref
Hom
. Ref
.
Heterozygous Hom. Variant Uncertain
Het
.H
om. V
ar.
Unc
erta
in
* current state of research: only consensus process has “Uncertain” category
?
?
?
?? ? ? ?
Consensus has lower FN rate than individual datasets
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/AHomozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Illumina Omni SNP Array
Inte
grat
ed C
onse
nsus
G
enot
ypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference 1.45M 613 (0.09%) 977 (0.15%) N/A
Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/AHomozygous
Variant 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A
Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A
HiS
eq –
GAT
K
“FNs”
“FPs*”
“FNs”
“FPs*”
* Note that most or all of the putative FPs seem to actually be FNs on the microarray
Illumina Omni SNP Array
SNP arrays overestimate performance
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/AHomozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiS
eq –
GAT
K
“FNs”
“FPs*”
“FNs”
“FPs”
HiS
eq –
GAT
K
Illumina Omni SNP Array
Samtools has higher FP and lower FN than GATK
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M
Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%)Homozygous
Variant 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%)
Integrated Consensus Genotypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/
No Call1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%)Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiS
eq –
sam
tool
s
“FNs”
“FPs”
“FNs”
“FPs”
HiS
eq –
GAT
K
Performance Metrics: Characteristics of Mis-calls
. . .
QUAL/Depth of Coverage
HiS
eq/G
ATK
Consensus Genotypes
Het
eroz
ygou
sH
om. V
aria
ntH
om. R
ef./
No
call
Heterozygous Hom. VariantHom. Ref. Uncertain
Strand Bias …
Challenges with assessing performance
• All variant types are not equal• Nearby variants are often
difficult to align• All regions of the genome are
not equal– Homopolymers, STRs,
duplications– Can be similar or different in
different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic accuracy
measures not well posed
• Data from multiple platforms and library preparations– when characterizing a
Reference Material– when assessing performance
of a test platform
Genome-in-a-Bottle Consortium
• Genome-in-a-Bottle– www.genomeinabottle.org
• newsletters, blogs, forums, announcements
– new partners welcome!– targeting pilot reference
material availability in 2013– working to identify best
practice for consent of subject genome as a whole-genome reference material
• Developing genomic DNA reference materials for small number of microbial species– to enable performance
assessment of sequencing platforms
– range of GC– range of complexity
QUESTIONS?
Microbial Reference Material Considerations
• Variation in GC Content – Genomes with a range of GC to challenge
platforms– Within genome variation to challenge
analytical process to define mobile genetic and insertion elements
• Structural variations to challenge the ability to recognize– Repetitive sequences (e.g. palindromic
repeats)– Homopolymers (>14 bases)– Insertion elements– Chromosomal rearrangements– SNP calls (e.g. variant silencing due to
motifs)
• Reference data available on multiple platforms
• Pedigree/phylogeny of strains• Phenotypic characterization
Interesting work on assessing performance for microbial sequencing
• Quail et al. at Sanger report on using 4 different microbial genomes to characterize sequencer performance– ~20% - ~68% GC overall– Bordetella pertussis
• 67.7 % GC, with some regions in excess of 90 % GC content
– Salmonella Pullorum• 52 % GC
– Staphylococcus aureus• 33 % GC
– Plasmodium falciparum• 19.3 % GC, with some regions close to 0 % GC
content
• “We routinely use these to test new sequencing technologies, as together their sequences represent the range of genomic landscapes that one might encounter.”
Quail, M. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific
Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341
(2012).