Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis
description
Transcript of Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis
WABI 2005
Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single
Homoplasy or Recombnation Event
Yun S. Song, Yufeng Wu and Dan Gusfield
University of California, Davis
Haplotyping Problem
• Diploid organisms have two copies of (not identical) chromosomes.
• A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs)
• SNP: a site with two types of nucleotides occur frequently, 0 or 1
• The mixed description is genotype, vector of 0,1,2– If both haplotypes are 0, genotype is 0– If both haplotypes are 1, genotype is 1– If one is 0 and the other is 1, genotype is 2
Haplotypes and Genotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
Merge the haplotypes
Sites: 1 2 3 4 5 6 7 8 9
• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes
Perfect Phylogeny Haplotyping (PPH)
• Finding original haplotypes in nature hopeless without genetic model to guide solution picking
• Gusfield (2002) introduced PPH problem• PPH is to find HI solutions that fit into a
perfect phylogeny.• Nice results for PPH, including a linear time
algorithm
The Perfect Phylogeny Model for Haplotypes
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edges
The tree derives the set M:1010010000010110101000010
Assume at most 1 mutationat each site
PPH Example
GenotypesInferred
Haplotypes Perfect Phylogeny
Imperfect Phylogeny Haplotyping (IPPH): Extending PPH
• Often, the real biological data does not have PPH solutions.
• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)
• Our approach: IPPH with explicit genetic model, with small amount of– Homoplasy, i.e. back or recurrent mutation – Recombination
• Goal: Extend usage of PPH– Real data: may be of small perturbation from PPH– Haplotype block: low recombination or homoplasy
Back/Recurrent Mutation for Haplotypes
Data000010101110
000
000110
2 1
3
010 101
1
010100
More than one mutation at a site
Recombinations: Single Crossover
• Recombination is one of the principle genetic force shaping genetic variations
• Two equal length sequences generate the third equal length sequence
110001111111001 000110000001111
Prefix Suffix
11000 0000001111
breakpoint
IPPH (Imperfect Phylogeny Haplotyping) Problems
• Small deviation from PPH• H-1 IPPH problem
– Find a tree that allows exactly one site to mutate twice – The rest of sites can only mutate at most once– Derive haplotypes for the given genotypes
• R-1 IPPH problem– Find a network that has exactly one recombination
event– Each site mutates at most once– Derive haplotypes for the given genotypes
Number of Minimum Recombinations for Haplotypes
Rmin Rho=1 Rho=3 Rho=5
0 60.8% 23.6% 8.4%
1 31.8% 35.2% 27.6%
2 6.8% 24.8% 27.8%
3 11.6% 21.6%
4 3.8% 9.0%
5 0.8% 3.6%
6 0.2% 1.4%
Frequency of Minimumrecombinations for small rho(scaled recombination rate)
20 sequences30 sites500 simulations
Haplotyping with One Homoplasy
More than one mutation at a site 1
s1 s2 s3
a1 0 0 0
a2 0 1 0
b1 1 0 1
b2 1 1 0
s1 s2 s3
a 0 2 0
b 1 2 2
Genotype Haplotype000
a1b2
2 1
3
a2 b1
1
010100
1 Homoplasy Tree
Algorithm for H1-IPPH
• For each site s in the input genotype data M– Test whether M-{s} has PPH solutions– If not, move to next site.– Otherwise, check whether 1 homoplasy at site s
can lead to HI solutions– If yes, stop and report result
• Assume only one PPH solution for M-{s}• But how to find solutions with 1 homoplasy at
s efficiently?
Example
M
Site i3
M-{i3} {i3}
PPH
M-{i3} {i3} Mh-{i3} h{i3}
r2
r2’ s2’
s2
Assume Mh-{i3} is fixed.Haplotypes for the same genotype must pair up.Two ways to pair
Combine Mh-{i3} with h{i3}
• 4 ways to try pairing i3.• Exponential number in general, even for one PPH solution• Need polynomial-time method to avoid trying all the pairings
?
Mh-{i3} h{i3} Mh1 Mh2
Mh-{i3} h{i3}
Move to Trees
Convert perfect phylogeny tree from PPH solution to un-rooted
1 Homoplasy: from T to Tr, Ts
s s
Recurrent mutation @ site s
Tree T
L1 L2O1 O2
L1, L2 O1, O2 s
Ts
Tree Tr
s induces a split Ts
Deleting s induces tree Tr
From Tr, Ts to T
Find two subtrees Ts1, Ts2, in Tr, s.t.
Tree Tr
L O s
Ts
Ts1, Ts2 corresponds to one side
s s
Tree T
L1 L - L1O1 O2
of Ts
L1 L - L1
2. Pick leaves from Tr corresponding the chosen partition side1. Pick one side of partition from Ts
3. Check whether the selected leaves fit into two sub-trees
1. May need to refine a non-binary vertex before picking subtree
s2 can pair with r2’
Solution
Algorithms and Results
• Efficient graph-coloring based method to select two subtrees (skipped)
• Implemented in C++• Simulation with data with program ms.• Compare to PHASE (a haplotyping program)
– Accuracy: comparable– Speed: at least 10x faster– 100x100 data: about 3 seconds
• Can identify the homoplasy site with high accuracy: >95% in simulation
Algorithm for R1-IPPHM ML MR
Split M by cutting between two sites
PPH Solutions
Build perfect phylogeny for two partitions
1-SPR operation
SPR: subtree-prune-regraft operation
1 recombination condition equivalent to distance-SPR(TL,TR) = 1
Algorithm for R1-IPPH
• Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary.
• Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)
Conclusions
• Contributions– Assuming bounded number of PPH solutions1. Polynomial time algorithm for H1-IPPH problem2. Polynomial time algorithm for R1-IPPH problem3. Possible extension to more than 1 homoplasy
event.
• Open problems– Haplotyping with more than 1 recombination
efficiently.– Remove assumption that number of PPH solutions
for M-{s} is bounded.
Thank you
• Questions?