Jan2015 ga4 gh variant comparison

GA4GH work towards a standardized variant comparison tool

Kevin Jacobs

1/30/2015

Simple Question

Is a variant present within a genome?

Simple Question?


• sequence at a location: – allele

– haplotype

– genotype

• something like a– VCF genotype

– HGVS string

– dbSNP/ClinVar/HGMD entry

Simple Question?


• Collection of variants (and reference) in

– VCF/gVCF/BCF file

– var/MasterVar file

– dbSNP/ClinVar/HGMD/etc.

– your fancy new file format

– your fancy new database

Problem Statement


Surely this must be a solved problem?

Dr. Seuss• Sometimes the questions

are complicated and the answers are simple

Is this a simple question?

• It also depends on how we define…– variant, genome, location, genotype, present

• Can we answer this question?– Is the location well defined?

– Did we observe reads that location?

– Could we infer a single most-likely genotype at that location?

– Are we asking about “simple” variation in a “nice” region of the genome?

• If yes to all of these, then we can almost always answer our question correctly.

Don’t Panic!

Why is this so hard?

• Consider c.2_4delCTAinsGC

– REF: ACTAC

– H1: =G-C=

• It can also be spelled

– c.[2C>G; 3del; 4A>C]

– c.[2C>G; 3T>C; 4del]

– c.[2del; 3T>G; 4A>C]

– …

Assumptions and notation

• We have an accurate reference genome sequence

• Queries are relative to well-defined non-ambiguous regions of the reference sequence

• Simple sequence query / assertion:– VCF: (chrom, pos, ref, alts, geno)

• E.g. (chrZ, 55, A, G, 1/1)

– Generic: (chrom, start, stop, alleles)• E.g. (chrZ, 54, 55, G, G)

– These representations are equivalent modulo some strange encoding rules for VCF relating to null alleles

Most basic model

• A “genome” G is a set of sequence assertions

• A “query” is a proposition q∈G where q is a sequence assertion

• E.g.

– G = { (chrZ, 55, G, G) }

– Q1 : (chrZ, 55, A, G) ∈ G = False

– Q2 : (chrZ, 55, G, G) ∈ G = True

Basic model extensions

• Simple extensions– Indels / MNVs

– Reference calls (like gVCF)

– No calls, partial calls

– Arbitrary ploidy

– Phase, quality, filters, etc. (not show)

G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G), (chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),

(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN), (chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),(chrZ, 98, 100, =, =)}

Limitations of the basic model

• Sequence assertions do not have unique representations

– Alignments are not unique

– Alignment models differ

– Nearby variants / phase information

– Missing data and uncertainty

• Sometimes we aren’t asking the right question

Alignments are not unique

• Precedence of insertions, deletions and mismatches:

– REF: ACAC

– H1: =-G= (AGC)

– H2: =G-= (AGC)

Limitations of the basic model

• Sequence assertions do not have unique representations– REF: TCACACACAG

– H1: T--CACACAG (REF, 1, 3, ☐)

– H2: TC--ACACAG (REF, 2, 4, ☐)

– H3: TCA--CACAG (REF, 3, 5, ☐)

– H4: TCAC--ACAG (REF, 4, 6, ☐)

– H5: TCACA--CAG (REF, 5, 7, ☐)

– H6: TCACAC--AG (REF, 6, 8, ☐)

– H7: TCACACA--G (REF, 7, 9, ☐)

Alignments models differ

• Different alignment scoring:– REF: A--CAC

– H1: =GG--= (REF, 1, 1, ☐, GG) (REF, 1, 3, CA, ☐)

– H2: =--GG= (REF, 1, 3, CA, GG)

• Base quality aware alignments algorithms are even more susceptible to non-unique alignments

Ignoring phase or phase uncertainty introduces ambiguity

– REF: ACGT– H1: =A== (REF, 1, 2, C, A) – H2: ==C= (REF, 2, 3, G, C)

• Vs– REF: ACGT– H1: =AC= (REF, 1, 2, C, A)

(REF, 2, 3, G, C)– H2: ====

• Vs– REF: ACGT– H1: =AC= (REF, 1, 3, CG, AC) – H2: ====

Missing data

G = {(chrZ, 0, 24, =, =), (chrZ, 24, 25, G, G), (chrZ, 25, 53, =, =), (chrZ, 53, 55, NN, NN),

(chrZ, 55, 88, =, =), (chrZ, 88, 92, ATAT, NNNN), (chrZ, 92, 96, =, =), (chrZ, 96, 98, A, ☐),(chrZ, 98, 100, =, =)}

• Q: (chrZ, 54, 55, A, T) ∈ G False

Multiple alleles/samples

• Remember our friend:

– REF: TCACACACAG

– H1: T--CACACAG (REF, 1, 3, ☐)

– H2: TCACACA--G (REF, 7, 9, ☐)

Multiple alleles/samples

• What left-normalizing H2 will look like in VCF?

– REF: TCACACACAG

– H2: TCACACA--G (REF, 7, 9, ☐)

– H3: TCACACACTG (REF, 8, 9, T)

– H4: TCACTCACAG (REF, 4, 5, T)

– H5: TTACACACAG (REF, 1, 2, T)

– H1: T--CACACAG (REF, 1, 3, ☐)

Bottom Line

• Is there a canonical form for sequence assertions?

– If so, then we can normalize our data into that form and rely on simple set-existential queries

– If not, then we need a better model

– In the mean time, we rely on heuristics to perform comparisons and understand that they are imperfect

Better models

• Two basic approaches

1. Standardize alignment and representations so that we can always derive a unique canonical representation

2. Make the comparison model “spelling agnostic”

Reference graph model

• Convert (g)VCF and other file formats into a graph representation

• Compute whether graph can “generate” the query haplotype or genotype– Supporting multiple forms of ambiguity that are inherent

in the biological questions we ask.

Phase constraint

Related Problems

• What are all of the differences between two genomes?

• Collect all alleles observed across multiple genomes

• Merge genomes into a single coherent representation

• Efficiently store and query a large number of genomes

Implementation plan

• Build a reference implementation

– Open source, free, and hosted by GA4GH

– Built in Python + Cython

– Include an extensive test suite

• Not inventing any new file formats

• Implementation underway

– VCF processor built on htslib

– Rest of the engine in progress

– Accounting and testing coming soon after

Thanks to:

• Justin Zook and the other GIAB organizers

• Geneticists, who have been doing this right all along

• Complete Genomics for their calldiff algorithm

• Great discussions and debates with friends and colleagues at NCI, NCBI, Invitae, 23andMe, 1000 Genomes, GA4GH, etc.

Jan2015 ga4 gh variant comparison

Health & Medicine

Transcript of Jan2015 ga4 gh variant comparison