Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs
description
Transcript of Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs
Computational Problems in Computational Problems in Perfect Phylogeny Perfect Phylogeny Haplotyping: Haplotyping:
Xor-Genotypes and Tag SNPsXor-Genotypes and Tag SNPs Tamar BarzuzaTamar Barzuza11 Jacques S. Jacques S.
BeckmannBeckmann2,32,3
Ron ShamirRon Shamir44 Itsik Pe’erItsik Pe’er55
11Computer Science and Applied Mathematics, Weizmann Institute of Computer Science and Applied Mathematics, Weizmann Institute of ScienceScience
22Molecular Genetics, Weizmann Institute of ScienceMolecular Genetics, Weizmann Institute of Science33Génétique Médicale, Universitätsspital LausanneGénétique Médicale, Universitätsspital Lausanne 44School of Computer Science, Tel- Aviv UniversitySchool of Computer Science, Tel- Aviv University
55Medical and Population Genetics Group, Broad InstituteMedical and Population Genetics Group, Broad Institute
OverviewOverview IntroductionIntroduction Xor PPHXor PPH
Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results
Informative SNPsInformative SNPs Theoretical resultsTheoretical results
Summary and Future researchSummary and Future research
ChromosomesChromosomes
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA
AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA
AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATTAGCTGCCACA
AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA
AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA
ATTAAA
GTTTGG
AACCCC
CCCTTT
SNP – Single nucleotide SNP – Single nucleotide polymorphismpolymorphism
ATTAAA
GTTTGG
AACCCC
CCCTTT
SNP – Single nucleotide SNP – Single nucleotide polymorphismpolymorphism
Haplotypes, Genotypes and XOR-Haplotypes, Genotypes and XOR-GenotypesGenotypes
Genotype: A/T T/G A C
Haplotypes:A G A C T T A C
XOR-Genotype: Het Het Hom Hom
1 2 3 4ATTAAA
GTTTGG
AACCCC
CCCTTT
100111
100011
001111
111000
Haplotypes, Genotypes and XOR-Haplotypes, Genotypes and XOR-GenotypesGenotypes
1 2 3 4ATTAAA
GTTTGG
AACCCC
CCCTTT
100111
100011
001111
111000
Genotype: 2 2 0 1
Haplotypes:1 1 0 1 0 0 0 1
XOR-Genotype: {1, 2} {1, 2}
Perfect PhylogenyPerfect Phylogeny1 0 0 0 01 0 0 1 01 0 1 0 01 1 0 0 01 0 0 1 10 0 0 1 0
SNPs only
1 0 0 1 0
1 0 0 0 0 0 0 0 1 0 1 0 0 1 1
1 1 0 0 0 1 0 1 0 0
4: 1→01: 1→0 5: 0→1
2: 0→1 3: 0→11 0 1 0 01 1 0 0 0
2 3
Previous workPrevious workHaplotyping:Haplotyping: haplotypes from haplotypes from genotypesgenotypes::Input:Input: Genotypes Genotypes GG={={GG11,…,,…,GGnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}Output:Output: Find the haplotypes Find the haplotypes HH={={HH11,…,,…,HH22nn}} that gave rise to that gave rise to GG
General heuristics: General heuristics: Clark ’90 Clark ’90 Excoffier+Slatkin ‘95Excoffier+Slatkin ‘95
PPH:PPH: Perfect phylogeny haplotyping ( Perfect phylogeny haplotyping (nn genotypes, genotypes, mm SNPs):SNPs):Gusfield 2002Gusfield 2002 O(O(nmnm((nn,,mm)) )) Bafna et. al 2002Bafna et. al 2002O(O(nmnm22))Eskin et. al 2003Eskin et. al 2003O(O(nmnm22))
Graph Realization
Graph Realization
Previous workPrevious work
Tutte 1959 Tutte 1959 O(O(nn22mm), ), Gavril and Tamari 1983 Gavril and Tamari 1983 O(O(nmnm22), ),
Bixby and Wagner 1988 Bixby and Wagner 1988 O(O(nmnm((nn,,mm))))
The graph realization problem:The graph realization problem: Input: Input: A hypergraphA hypergraph HH=({1,…,=({1,…,mm}, }, PP))
PP={={PP11,,PP22,…,,…,PPnn}, }, PPii{1,…,{1,…,mm}}
Goal: Goal: A treeA tree TT=(=(VV,,EE) ) with with EE==NN s.ts.t PPii labels a path inlabels a path in TT
Input:Input: { {1,2}, {2,3} }{ {1,2}, {2,3} }Output:Output:
11 22 3311
22 33
OverviewOverview IntroductionIntroduction Xor PPHXor PPH
Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results
Informative SNPsInformative SNPs Theoretical resultsTheoretical results
Summary and Future researchSummary and Future research
Xor-haplotypingXor-haplotyping: haplotypes from : haplotypes from xor-genotypesxor-genotypes::Input:Input: 1. Xor-genotype data 1. Xor-genotype data (can be obtained by DHPLC)(can be obtained by DHPLC)
2. Three genotypes2. Three genotypesGoal:Goal: Resolve the haplotypes and their perfect phylogeny Resolve the haplotypes and their perfect phylogeny
XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping
haplo
type
s Xor-genotypes genotypes{1, 2}{1, 2} 0/1 0/1 0 1
{2, 4}{2, 4} 0 0/1 0 0/1
{2, 3, 4}{2, 3, 4} 0 0/1 0/1 0/1
{1, 2, 4}{1, 2, 4} 0/1 0/1 0 0/1
{1}{1} 0/1 1 0 01 1 0 10 1 0 1
1 1 0 10 0 0 1
0 1 0 10 0 0 0
0 1 1 10 0 0 01 1 0 10 0 0 0
?
????
Xor-haplotypingXor-haplotyping: haplotypes from : haplotypes from xor-genotypesxor-genotypes::Input:Input: 1. Xor-genotype data 1. Xor-genotype data (can be obtained by DHPLC)(can be obtained by DHPLC)
2. Three genotypes2. Three genotypesGoal:Goal: Resolve the haplotypes and their perfect phylogeny Resolve the haplotypes and their perfect phylogeny
XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping
haplo
type
s Xor-genotypes genotypes{1, 2}{1, 2} 0/1 0/1 0 1
{2, 4}{2, 4} 0 0/1 0 0/1
{2, 3, 4}{2, 3, 4} 0 0/1 0/1 0/1
{1, 2, 4}{1, 2, 4} 0/1 0 0/1 0/1
{1}{1} 0/1 1 0 0
?????
Strategy:Strategy: 1.1. Input: Input: Xor-genotype data Xor-genotype dataGoal:Goal: Find the perfect phylogeny Find the perfect phylogeny
2. Additional 2. Additional Input:Input: 3 genotypes 3 genotypesGoal:Goal: Find haplotypes Find haplotypes
Step 1:Step 1:Xor-genotypeXor-genotype = {Het SNPs} = A = {Het SNPs} = A pathpath in the in the perfect perfect
phylogenyphylogeny Build a tree from its paths Build a tree from its paths Graph realization Graph realization
Input reduction:Input reduction: Merge SNPs that are equivalent in the xor- Merge SNPs that are equivalent in the xor-datadata
Proof:Proof: Unique graph realization solution Unique graph realization solution A perfect phylogeny A perfect phylogeny
XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping
GREALGREAL Find graph realization or determine that none Find graph realization or determine that none
existsexists Count num of graph realization solutions for dataCount num of graph realization solutions for data Stable and fastStable and fast Available at Available at http://http://www.cs.tau.ac.il/~rshamir/grealwww.cs.tau.ac.il/~rshamir/greal//
SimulationsSimulations Simulate data of Simulate data of nn individuals using Hudson 2002 individuals using Hudson 2002 Remove all SNPs with <5% minor allele frequencyRemove all SNPs with <5% minor allele frequency Apply GREAL: Is there a single solution?Apply GREAL: Is there a single solution? Repeat 5000 times for each Repeat 5000 times for each nn
We implemented Gavril & Tamari’s algorithm (83) We implemented Gavril & Tamari’s algorithm (83) for graph realization: for graph realization: O(O(mm22nn))
ResultsResultsThe percentage of single solutions vs sample size
The percentage of single solutions vs sample size
R.H. Chung and D. Gusfield 2003
ResultsResults
Perfect phylogenyPerfect phylogeny? HaplotypesHaplotypesStep 2Step 2
1
230 0 0
1 1 01 0 1
1
231 0 0
0 1 0 0 0 1
{1, 2}{1, 3}{2, 3}
Xor-genotypes
?
XPPHXPPH
Resolution up to Resolution up to bit flippingbit flipping : gives the haplotypes : gives the haplotypes structurestructure
1
23
{1, 2}{1, 3}{2, 3}
Xor-genotypes
1 2 2Genotype
1 x x1 x x
0 x x
SNP #1 homozygous SNP #1 homozygous Can infer SNP #1 for all Can infer SNP #1 for all haplotypeshaplotypes Need individuals with Need individuals with xor-genotypes (=xor-genotypes (={het {het SNPs}) = SNPs}) =
XPPHXPPH
Perfect phylogenyPerfect phylogeny? HaplotypesHaplotypesStep 2Step 2
Theorem:Theorem: xor-genotypes=xor-genotypes= there are there are three three xor-genotypes with empty intersectionxor-genotypes with empty intersection
Proof: Proof: ! xor-genotypes are tree paths ! xor-genotypes are tree paths (ow: NP-(ow: NP-hard)hard)
(1) The intersection of two tree paths is an (1) The intersection of two tree paths is an intervalinterval
(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn
XX11
(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn
XX11
(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn
(3) (3) XXLL ends firstends first,, XXRR begins last begins last
XXLL
XXRR
XX11
XX11
(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn
(3)(3) XXLL ends firstends first,, XXRR begins last begins last
XXLL
XXRR
XX11XXLL
XXRR
XX11
(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn
XX11XXLLXXRR==
XXLL
XXRR
XX11 XXLL
XXRR
XX11
XXLL
XXRR
XX11
Find 3 individuals to genotype in Find 3 individuals to genotype in O(O(nmnm))
Resolve the haplotypesResolve the haplotypes
XXLL
XXRR
XX11 XXLL
XXRR
XX11
XXLL
XXRR
XX11
OverviewOverview IntroductionIntroduction Xor PPHXor PPH
Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results
Informative SNPsInformative SNPs Theoretical resultsTheoretical results
Summary and Future researchSummary and Future research
Input:Input: 1. Haplotypes 1. Haplotypes HH={={HH11,…,,…,HHnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}2. A set of interesting SNPs2. A set of interesting SNPs SS""SS
Output:Output: Minimal setMinimal set SSSS\\SS"" that distinguishes the same that distinguishes the same haplotypes as haplotypes as SS""
Informative SNPs (Bafna et al. 2003):Informative SNPs (Bafna et al. 2003):
Informative SNPsInformative SNPs
1 0 0 0 00 0 1 0 00 0 0 1 10 1 0 1 0Ha
plo t
ypes
4 3
2
1
SNPs1 2 3 4 5
Not perfect phylogeny: NP-hard (Not perfect phylogeny: NP-hard (MINIMUM TEST SETMINIMUM TEST SET))Perfect phylogeny, 1 interesting SNP: O(Perfect phylogeny, 1 interesting SNP: O(nmnm), Bafna et al. 2003), Bafna et al. 2003
Informative SNPs:Informative SNPs:Input:Input: 1. Haplotypes 1. Haplotypes HH={={HH11,…,,…,HHnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}
2. A set of interesting SNPs2. A set of interesting SNPs SS""SS 3. A perfect phylogeny for 3. A perfect phylogeny for HH..4. A cost function4. A cost function CC::SSRR++..
Output:Output: SSSS\\SS"" with minimal costwith minimal cost that distinguishes that distinguishes the same haplotypes as the same haplotypes as SS""
Informative SNPsInformative SNPs
Generalization of prev defGeneralization of prev def
1 0 0 0 00 0 1 0 00 0 0 1 10 1 0 1 0Ha
plo t
ypes
4 3
2
1
SNPs1 2 3 4 5
We find informative SNPs setWe find informative SNPs set Of minimal costOf minimal cost For any number of interesting SNPsFor any number of interesting SNPs In O(In O(mm))
By a dynamic programming algorithm that By a dynamic programming algorithm that climbs up the perfect phylogeny treeclimbs up the perfect phylogeny tree
We prove that the definition of informative We prove that the definition of informative SNPs generalizes to a more practical SNPs generalizes to a more practical definitiondefinition Under the perfect phylogeny model, informative Under the perfect phylogeny model, informative
SNPs on genotypes and haplotypes are SNPs on genotypes and haplotypes are equivalentequivalent
SummarySummary Xor-haplotyping:Xor-haplotyping:
DefinitionDefinition Resolve haplotypes given xor-data and 3 Resolve haplotypes given xor-data and 3
genotypes in O(genotypes in O(nmnm((mm,,nn)))) ImplementationImplementation Experimental resultsExperimental results
Selection of tag SNPs:Selection of tag SNPs: Generalize to Generalize to
arbitrary costarbitrary cost many interesting SNPsmany interesting SNPs
Find optimal informative SNPs set in O(Find optimal informative SNPs set in O(mm) time) time Combinatorial observation allows practical usesCombinatorial observation allows practical uses
Future researchFuture research Relax the strong assumption of perfect Relax the strong assumption of perfect
phylogenyphylogeny Deal with data errors and missing dataDeal with data errors and missing data
Obtain empirical results for the theoretical Obtain empirical results for the theoretical work on informative SNPswork on informative SNPs Preliminary results show that blocks of up to 600 Preliminary results show that blocks of up to 600
SNPs are distinguishable by ~20 informative SNPsSNPs are distinguishable by ~20 informative SNPs
Theorem:Theorem: All genotypes are distinct within a block All genotypes are distinct within a blockProof: Proof: Assume to the contrary equivalency of two:Assume to the contrary equivalency of two:
1111
0000
11 0011 00
00 1100 11
1111
0000
1111
0000
1111
0000
2222
1100
1100
2222
1100
1100
2222
0011
1100
HaplotypePair 1
HaplotypePair 2
Genotype 1Genotype 2