Jan2015 ga4 gh variant comparison
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
354 -
download
0
Transcript of Jan2015 ga4 gh variant comparison
GA4GH work towards a standardized variant comparison tool
Kevin Jacobs
1/30/2015
Simple Question
Is a variant present within a genome?
Simple Question?
Is a variant present within a genome?
• sequence at a location: – allele
– haplotype
– genotype
• something like a– VCF genotype
– HGVS string
– dbSNP/ClinVar/HGMD entry
Simple Question?
Is a variant present within a genome?
• Collection of variants (and reference) in
– VCF/gVCF/BCF file
– var/MasterVar file
– dbSNP/ClinVar/HGMD/etc.
– your fancy new file format
– your fancy new database
Problem Statement
Is a variant present within a genome?
Surely this must be a solved problem?
Dr. Seuss• Sometimes the questions
are complicated and the answers are simple
Is this a simple question?
• It also depends on how we define…– variant, genome, location, genotype, present
• Can we answer this question?– Is the location well defined?
– Did we observe reads that location?
– Could we infer a single most-likely genotype at that location?
– Are we asking about “simple” variation in a “nice” region of the genome?
• If yes to all of these, then we can almost always answer our question correctly.
Don’t Panic!
Why is this so hard?
• Consider c.2_4delCTAinsGC
– REF: ACTAC
– H1: =G-C=
• It can also be spelled
– c.[2C>G; 3del; 4A>C]
– c.[2C>G; 3T>C; 4del]
– c.[2del; 3T>G; 4A>C]
– …
Assumptions and notation
• We have an accurate reference genome sequence
• Queries are relative to well-defined non-ambiguous regions of the reference sequence
• Simple sequence query / assertion:– VCF: (chrom, pos, ref, alts, geno)
• E.g. (chrZ, 55, A, G, 1/1)
– Generic: (chrom, start, stop, alleles)• E.g. (chrZ, 54, 55, G, G)
– These representations are equivalent modulo some strange encoding rules for VCF relating to null alleles
Most basic model
• A “genome” G is a set of sequence assertions
• A “query” is a proposition q∈G where q is a sequence assertion
• E.g.
– G = { (chrZ, 55, G, G) }
– Q1 : (chrZ, 55, A, G) ∈ G = False
– Q2 : (chrZ, 55, G, G) ∈ G = True
Basic model extensions
• Simple extensions– Indels / MNVs
– Reference calls (like gVCF)
– No calls, partial calls
– Arbitrary ploidy
– Phase, quality, filters, etc. (not show)
G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G), (chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),
(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN), (chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),(chrZ, 98, 100, =, =)}
Limitations of the basic model
• Sequence assertions do not have unique representations
– Alignments are not unique
– Alignment models differ
– Nearby variants / phase information
– Missing data and uncertainty
• Sometimes we aren’t asking the right question
Alignments are not unique
• Precedence of insertions, deletions and mismatches:
– REF: ACAC
– H1: =-G= (AGC)
– H2: =G-= (AGC)
Limitations of the basic model
• Sequence assertions do not have unique representations– REF: TCACACACAG
– H1: T--CACACAG (REF, 1, 3, ☐)
– H2: TC--ACACAG (REF, 2, 4, ☐)
– H3: TCA--CACAG (REF, 3, 5, ☐)
– H4: TCAC--ACAG (REF, 4, 6, ☐)
– H5: TCACA--CAG (REF, 5, 7, ☐)
– H6: TCACAC--AG (REF, 6, 8, ☐)
– H7: TCACACA--G (REF, 7, 9, ☐)
Alignments models differ
• Different alignment scoring:– REF: A--CAC
– H1: =GG--= (REF, 1, 1, ☐, GG) (REF, 1, 3, CA, ☐)
– H2: =--GG= (REF, 1, 3, CA, GG)
• Base quality aware alignments algorithms are even more susceptible to non-unique alignments
Ignoring phase or phase uncertainty introduces ambiguity
– REF: ACGT– H1: =A== (REF, 1, 2, C, A) – H2: ==C= (REF, 2, 3, G, C)
• Vs– REF: ACGT– H1: =AC= (REF, 1, 2, C, A)
(REF, 2, 3, G, C)– H2: ====
• Vs– REF: ACGT– H1: =AC= (REF, 1, 3, CG, AC) – H2: ====
Missing data
G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G), (chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),
(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN), (chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),(chrZ, 98, 100, =, =)}
• Q: (chrZ, 54, 55, A, T) ∈ G False
Multiple alleles/samples
• Remember our friend:
– REF: TCACACACAG
– H1: T--CACACAG (REF, 1, 3, ☐)
– H2: TCACACA--G (REF, 7, 9, ☐)
Multiple alleles/samples
• What left-normalizing H2 will look like in VCF?
– REF: TCACACACAG
– H2: TCACACA--G (REF, 7, 9, ☐)
– H3: TCACACACTG (REF, 8, 9, T)
– H4: TCACTCACAG (REF, 4, 5, T)
– H5: TTACACACAG (REF, 1, 2, T)
– H1: T--CACACAG (REF, 1, 3, ☐)
Bottom Line
• Is there a canonical form for sequence assertions?
– If so, then we can normalize our data into that form and rely on simple set-existential queries
– If not, then we need a better model
– In the mean time, we rely on heuristics to perform comparisons and understand that they are imperfect
Better models
• Two basic approaches
1. Standardize alignment and representations so that we can always derive a unique canonical representation
2. Make the comparison model “spelling agnostic”
Reference graph model
• Convert (g)VCF and other file formats into a graph representation
• Compute whether graph can “generate” the query haplotype or genotype– Supporting multiple forms of ambiguity that are inherent
in the biological questions we ask.
Phase constraint
Related Problems
• What are all of the differences between two genomes?
• Collect all alleles observed across multiple genomes
• Merge genomes into a single coherent representation
• Efficiently store and query a large number of genomes
Implementation plan
• Build a reference implementation
– Open source, free, and hosted by GA4GH
– Built in Python + Cython
– Include an extensive test suite
• Not inventing any new file formats
• Implementation underway
– VCF processor built on htslib
– Rest of the engine in progress
– Accounting and testing coming soon after
Thanks to:
• Justin Zook and the other GIAB organizers
• Geneticists, who have been doing this right all along
• Complete Genomics for their calldiff algorithm
• Great discussions and debates with friends and colleagues at NCI, NCBI, Invitae, 23andMe, 1000 Genomes, GA4GH, etc.